We’ve known for some time now that AI models can be made to perform erratically using adversarial examples, or subtly crafted inputs that appear ordinary to humans.
For example, in the case of chatbots that handle both text and image inputs, scholars at Princeton University last year found they could input an image of a panda, subtly altered in ways imperceptible to humans but significant to the chatbot, and cause the chatbot to break its “guardrails.”
“An aligned model can be compelled to heed a wide range of harmful instructions that it otherwise tends to refuse,” the authors wrote, such as producing hate speech or giving tips for committing murder.
Also: The best AI chatbots
What would happen if such models, as they gain greater powers, interact with one another? Could they spread their malfunctioning between each other, like a virus?
Yes, they can, and “exponentially,” is the answer in a report this month from Xiangming Gu and his colleagues at the National University of Singapore and collaborating institutions. In the theoretical paper, Gu and his colleagues describe how they simulated what happens in a “multi-agent” environment of Visual Language Models, or VLAs, that have been given “agent” capabilities.
These agents can tap into databases, such as the increasingly popular “retrieval-augmented generation,” or, RAG, which lets a VLA retrieve an image from a database. A popular example is named LLaVA, for “large language and vision assistant,” developed by Microsoft with the help of scholars at The University of Wisconsin and Columbia University.
Gu simulated what happens when a single chatbot agent based on LLaVA, called “Agent Smith,” injects an altered image into a chat with another LLaVA agent. The image can spread throughout the collection of chatbots, causing them all, after several rounds of chatting, to behave erratically.
“We present infectious jailbreak, a new jailbreaking paradigm developed for multi-agent environments,” Gu and team wrote, “in which, analogous to the modeling of infectious diseases, an adversary need only jailbreak a single agent to infect (almost) all other agents exponentially fast.”
Also: I asked Gemini and GPT-4 to explain deep learning AI, and Gemini won hands down
Here’s how it works: The authors “injected” an image into Agent Smith by asking it to select from a library of images contained in an image album using RAG. They injected the chat history with harmful text, such as questions about how to commit murder. They then prompted the agent to ask another agent a question based on the image. The other agent was tasked with taking the image given to it by Agent Smith, and answering the question posed by Agent Smith.
After some time, the adversarial image prompted one agent to retrieve a harmful statement from the chat history and pose it as a question to the other agent. If the other agent responded with a harmful answer, then the adversarial image had done its job.
Their approach is “infectious” because the same malicious, alerted image is being stored by each answering chatbot, so that the image propagates from one chatbot to the other, like a virus.
Also: The safety of OpenAI’s GPT-4 gets lost in translation
Once the mechanics were in place, Gu and his team modeled how fast the tainted image spread among the agents by measuring how many produced a harmful question or answer, such as how to commit murder.
The attack, of course, has an element of chance: once the altered, malicious image was injected into the system, the virus’ spread depended on how often each chatbot retrieved the image and also asked a harmful question about that image.
The authors compared their method to known methods of infecting multiple agents, such as a “sequential attack,” where each pair of chatbots has to be attacked from a blank slate. Their “infectious” approach is superior: They find that they’re able to spread the malicious image amongst the chatbots much faster.
“The sequential jailbreak ideally manages to infect 1/8 of almost all agents cumulatively after 32 chat rounds, exhibiting a linear rate of infection,” Gu and his team wrote. “Our method demonstrates efficacy, achieving infection of all agents at an exponential rate, markedly surpassing the baselines.”
“…Without any further intervention from the adversary, the infection ratio […] reaches ∼100% exponentially fast after only 27 – 31 chat rounds, and all infected agents exhibit harmful behaviors,” according to Gu and his team.
From an attacker’s point of view, the infectious route makes attacking systems of agents much easier. “To jailbreak almost all N agents in a multi-agent environment,” Gu and his team wrote, “an infectious jailbreak method enables the adversary to incur a fixed cost for jailbreaking (only needing to initially jailbreak a fraction of agents […], and then waiting for a logarithmic amount of time with no further intervention.”
Such a risk may seem far-fetched. Most human users are accustomed to working with a single chatbot. But Gu and his team warn that chatbot agents such as LLaVA, armed with memory retrieval, are being integrated into AI-infused infrastructure.
Also: What to know about Mistral AI: The company behind the latest GPT-4 rival
“These MLLM [multi-modal large language model] agents are being integrated into robots or virtual assistants, granted memory banks and the ability to use tools, in line with the growing trend of deploying MLLM agents in manufacturing or daily life,” Gu and his team wrote.
There is hope for forestalling the infection, the authors wrote. Because there’s an element of chance around whether a given chatbot agent retrieves the adversarial image in a given round of chat, infection can be stymied by reducing the chances that an agent spreads the malicious image.
“If a defense mechanism can more efficiently recover infected agents or lower down infection rate […] then this defense is provably to decrease the infection rate to zero […]” they wrote.
However, they also added, “How to design a practical defense for our infectious jailbreak method remains an open and urgent question.”
Artificial Intelligence