Such attacks could become a major issue as LLMs become increasingly multimodal or are capable of responding contextually to inputs that combine text, audio, pictures, and even video.
Hiding Instructions in Images and Audio At Black Hat Europe 2023 this week, researchers from Cornell University will demonstrate an attack they developed that uses images and sounds to inject instructions into multimodal LLMs that causes the model to output attacker-specified text and instructions.
The researchers blended an instruction into an audio clip available online that caused PandaGPT to respond with an attacker-specific string.
The researchers blended an instruction in an image of a building, that would have caused LLaVA to chat like Harry Potter if a user had input the image into the chatbot and asked a question about it.
Ben Nassi, a researcher at Cornell University and one of the authors of the report, says one of the goals of their research was to find ways to inject prompts indirectly into a multimodal chatbot in a manner undetectable to the user.
Nassi describes the research as building on studies by others showing how LLMs are vulnerable to prompt injection attacks where an adversary might engineer inputs or prompts in such a manner as to intentionally influence the model's output.
The attack that Nassi and his team will demonstrate at Black Hat is different in that in involves an indirect prompt.
In other words, the user is not so much the attacker - as is the case with regular prompt injection - but rather the victim.
The other two authors are Cornell researchers Tsung-Yin Hsieh and Vitaly Shmatikov.
Indirect Prompt Injection Attacks The new paper is not the first to explore the idea of indirect prompt injection as a way to attack LLMs. In May, researchers at Germany's CISPA Helmholtz Center for Information Security at Saarland University and Sequire Technology published a report that described how an attacker could exploit LLM models by injecting hidden prompts into data that the model would likely retrieve when responding to a user input.
In that case the attack involved strategically placed text prompts.
Bagdasaryan says their attack is different because it shows how an attacker could inject malicious instructions into audio and image inputs as well, making them potentially harder to detect.
One other distinction with the attacks involving manipulated audio and image inputs is that the chatbot will continue to respond in its instructed manner during the entirely of a conversation.
Prompting the chatbot to respond in Harry Potter-like fashion causes it to continue to do so even when the user may have stopped asking about the specific image or audio sample.
Potential ways to direct a user to a weaponized image or audio clip could include a phishing or social engineering lure to a webpage with an interesting image or an email with an audio clip.
The research is significant because many organizations are rushing to integrate LLM capabilities into their applications and operations.
Attackers that devise ways to sneak poisoned text, image, and audio prompts into these environments could cause significant damage.
This Cyber News was published on www.darkreading.com. Publication date: Tue, 05 Dec 2023 22:50:32 +0000