As AI proliferates, so does the discovery and exploitation of AI cybersecurity vulnerabilities.
Prompt injection is one such vulnerability that specifically attacks generative AI. In Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations, NIST defines various adversarial machine learning tactics and cyberattacks, like prompt injection, and advises users on how to mitigate and manage them.
AML tactics extract information about how machine learning systems behave to discover how they can be manipulated.
That information is used to attack AI and its large language models to circumvent security, bypass safeguards and open paths to exploit.
NIST defines two prompt injection attack types: direct and indirect.
With direct prompt injection, a user enters a text prompt that causes the LLM to perform unintended or unauthorized actions.
An indirect prompt injection is when an attacker poisons or degrades the data that an LLM draws from.
One of the best-known direct prompt injection methods is DAN, Do Anything Now, a prompt injection used against ChatGPT. DAN uses roleplay to circumvent moderation filters.
In its first iteration, prompts instructed ChatGPT that it was now DAN. DAN could do anything it wanted and should pretend, for example, to help a nefarious person create and detonate explosives.
Indirect prompt injection, as NIST notes, depends on an attacker being able to provide sources that a generative AI model would ingest, like a PDF, document, web page or even audio files used to generate fake voices.
Indirect prompt injection is widely believed to be generative AI's greatest security flaw, without simple ways to find and fix these attacks.
Explore AI cybersecurity solutions How to stop prompt injection attacks.
These attacks tend to be well hidden, which makes them both effective and hard to stop.
For model creators, NIST suggests ensuring training datasets are carefully curated.
They also suggest training the model on what types of inputs signal a prompt injection attempt and training on how to identify adversarial prompts.
For indirect prompt injection, NIST suggests human involvement to fine-tune models, known as reinforcement learning from human feedback.
RLHF helps models align better with human values that prevent unwanted behaviors.
NIST further suggests using LLM moderators to help detect attacks that don't rely on retrieved sources to execute.
Finally, NIST proposes interpretability-based solutions.
Learn more about how IBM Security delivers AI cybersecurity solutions that strengthen security defenses.
This Cyber News was published on securityintelligence.com. Publication date: Tue, 19 Mar 2024 14:13:05 +0000