The exploding use of large language models in industry and across organizations has sparked a flurry of research activity focused on testing the susceptibility of LLMs to generate harmful and biased content when prompted in specific ways.
The latest example is a new paper from researchers at Robust Intelligence and Yale University that describes a completely automated way to get even state-of-the-art black box LLMs to escape guardrails put in place by their creators and generate toxic content.
Tree of Attacks With Pruning Black box LLMs are basically large language models such as those behind ChatGPT whose architecture, datasets, training methodologies and other details are not publicly known.
An aligned LLM such as the one behind ChatGPT and other AI chatbots is explicitly designed to minimize potential for harm and would not, for example, normally respond to a request for information on how to build a bomb.
An unaligned LLM is optimized for accuracy and generally has no - or fewer - such constraints.
With TAP, the researchers have shown how they can get an unaligned LLM to prompt an aligned target LLM on a potentially harmful topic and then use its response to keep refining the original prompt.
The process basically continues until one of the generated prompts jailbreaks the target LLM and gets it to spew out the requested information.
Rapidly Proliferating Research Interest The new research is the latest among a growing number of studies in recent months that show how LLMs can be coaxed into unintended behavior, like revealing training data and sensitive information with the right prompt.
Some of the research has focused on getting LLMs to reveal potentially harmful or unintended information by directly interacting with them via engineered prompts.
Other studies have shown how adversaries can elicit the same behavior from a target LLM via indirect prompts hidden in text, audio, and image samples in data the model would likely retrieve when responding to a user input.
Such prompt injection methods to get a model to diverge from intended behavior have relied at least to some extent on manual interaction.
The output the prompts have generated have often been nonsensical.
The new TAP research is a refinement of earlier studies that show how these attacks can be implemented in a completely automated, more reliable way.
In October, researchers at the University of Pennsylvania released details of a new algorithm they developed for jailbreaking an LLM using another LLM. The algorithm, called Prompt Automatic Iterative Refinement, involved getting one LLM to jailbreak another.
The researchers described that as a 10,000-fold improvement over previous jailbreak techniques.
Such research is important because many organizations are rushing to integrate LLM technologies into their applications and operations without much thought to the potential security and privacy implications.
As the TAP researchers noted in their report, many of the LLMs depend on guardrails that model developers implement to protect against unintended behavior.
This Cyber News was published on www.darkreading.com. Publication date: Thu, 07 Dec 2023 20:55:08 +0000