Researchers Show How to Use One LLM to Jailbreak Another

The exploding use of large language models in industry and across organizations has sparked a flurry of research activity focused on testing the susceptibility of LLMs to generate harmful and biased content when prompted in specific ways.
The latest example is a new paper from researchers at Robust Intelligence and Yale University that describes a completely automated way to get even state-of-the-art black box LLMs to escape guardrails put in place by their creators and generate toxic content.
Tree of Attacks With Pruning Black box LLMs are basically large language models such as those behind ChatGPT whose architecture, datasets, training methodologies and other details are not publicly known.
An aligned LLM such as the one behind ChatGPT and other AI chatbots is explicitly designed to minimize potential for harm and would not, for example, normally respond to a request for information on how to build a bomb.
An unaligned LLM is optimized for accuracy and generally has no - or fewer - such constraints.
With TAP, the researchers have shown how they can get an unaligned LLM to prompt an aligned target LLM on a potentially harmful topic and then use its response to keep refining the original prompt.
The process basically continues until one of the generated prompts jailbreaks the target LLM and gets it to spew out the requested information.
Rapidly Proliferating Research Interest The new research is the latest among a growing number of studies in recent months that show how LLMs can be coaxed into unintended behavior, like revealing training data and sensitive information with the right prompt.
Some of the research has focused on getting LLMs to reveal potentially harmful or unintended information by directly interacting with them via engineered prompts.
Other studies have shown how adversaries can elicit the same behavior from a target LLM via indirect prompts hidden in text, audio, and image samples in data the model would likely retrieve when responding to a user input.
Such prompt injection methods to get a model to diverge from intended behavior have relied at least to some extent on manual interaction.
The output the prompts have generated have often been nonsensical.
The new TAP research is a refinement of earlier studies that show how these attacks can be implemented in a completely automated, more reliable way.
In October, researchers at the University of Pennsylvania released details of a new algorithm they developed for jailbreaking an LLM using another LLM. The algorithm, called Prompt Automatic Iterative Refinement, involved getting one LLM to jailbreak another.
The researchers described that as a 10,000-fold improvement over previous jailbreak techniques.
Such research is important because many organizations are rushing to integrate LLM technologies into their applications and operations without much thought to the potential security and privacy implications.
As the TAP researchers noted in their report, many of the LLMs depend on guardrails that model developers implement to protect against unintended behavior.

This Cyber News was published on www.darkreading.com. Publication date: Thu, 07 Dec 2023 20:55:08 +0000

Cyber News related to Researchers Show How to Use One LLM to Jailbreak Another

Researchers Show How to Use One LLM to Jailbreak Another - The exploding use of large language models in industry and across organizations has sparked a flurry of research activity focused on testing the susceptibility of LLMs to generate harmful and biased content when prompted in specific ways. The latest ...
1 year ago Darkreading.com

OWASP Top 10 for LLM Applications: A Quick Guide - Even still, the expertise and insights provided, including prevention and mitigation techniques, are highly valuable to anyone building or interfacing with LLM applications. Prompt injections are maliciously crafted inputs that lead to an LLM ...
1 year ago Securityboulevard.com

Researchers Uncover Simple Technique to Extract ChatGPT Training Data - Can getting ChatGPT to repeat the same word over and over again cause it to regurgitate large amounts of its training data, including personally identifiable information and other data scraped from the Web? The answer is an emphatic yes, according to ...
1 year ago Darkreading.com

The impact of prompt injection in LLM agents - This risk is particularly alarming when LLMs are turned into agents that interact directly with the external world, utilizing tools to fetch data or execute actions. Malicious actors can leverage prompt injection techniques to generate unintended and ...
1 year ago Helpnetsecurity.com

AI models can be weaponized to hack websites on their own The Register - AI models, the subject of ongoing safety concerns about harmful and biased output, pose a risk beyond content emission. When wedded with tools that enable automated interaction with other systems, they can act on their own as malicious agents. ...
1 year ago Go.theregister.com

The age of weaponized LLMs is here - It's exactly what one researcher, Julian Hazell, was able to simulate, adding to a collection of studies that, altogether, signify a seismic shift in cyber threats: the era of weaponized LLMs is here. The research all adds up to one thing: LLMs are ...
1 year ago Venturebeat.com

Meta AI Models Cracked Open With Exposed API Tokens - Researchers recently were able to get full read and write access to Meta's Bloom, Meta-Llama, and Pythia large language model repositories in a troubling demonstration of the supply chain risks to organizations using these repositories to integrate ...
1 year ago Darkreading.com

Hugging Face dodged a cyber-bullet with Lasso Security's help - Further validating how brittle the security of generative AI models and their platforms are, Lasso Security helped Hugging Face dodge a potentially devastating attack by discovering that 1,681 API tokens were at risk of being compromised. The tokens ...
1 year ago Venturebeat.com

Researchers automated jailbreaking of LLMs with other LLMs - AI security researchers from Robust Intelligence and Yale University have designed a machine learning technique that can speedily jailbreak large language models in an automated fashion. Their findings suggest that this vulnerability is universal ...
1 year ago Helpnetsecurity.com Hunters

Exploring the Security Risks of LLM - According to a recent survey, 74% of IT decision-makers have expressed concerns about the cybersecurity risks associated with LLMs, such as the potential for spreading misinformation. Security Concerns of LLMs While the potential applications of ...
1 year ago Feeds.dzone.com

Forget Deepfakes or Phishing: Prompt Injection is GenAI's Biggest Problem - Cybersecurity professionals and technology innovators need to be thinking less about the threats from GenAI and more about the threats to GenAI from attackers who know how to pick apart the design weaknesses and flaws in these systems. Chief among ...
1 year ago Darkreading.com

Three Tips To Use AI Securely at Work - Simon makes a very good point that AI is becoming similar to open source software in a way. To remain nimble and leverage the work of great minds from around the world, companies will need to adopt it or spend a lot of time and money trying to ...
1 year ago Securityboulevard.com

A Single Cloud Compromise Can Feed an Army of AI Sex Bots – Krebs on Security - “Once initial access was obtained, they exfiltrated cloud credentials and gained access to the cloud environment, where they attempted to access local LLM models hosted by cloud providers: in this instance, a local Claude (v2/v3) LLM model from ...
8 months ago Krebsonsecurity.com

New 'LLMjacking' Attack Exploits Stolen Cloud Credentials - The attackers gained access to these credentials from a vulnerable version of Laravel, according to a blog post published on May 6. Unlike previous discussions surrounding LLM-based Artificial Intelligence systems, which focused on prompt abuse and ...
1 year ago Infosecurity-magazine.com

Flawed AI Tools Create Worries for Private LLMs, Chatbots - Companies that use private instances of large language models to make their business data searchable through a conversational interface face risks of data poisoning and potential data leakage if they do not properly implement security controls to ...
1 year ago Darkreading.com

Researchers extract RSA keys from SSH server signing errors - A team of academic researchers from universities in California and Massachusetts demonstrated that it's possible under certain conditions for passive network attackers to retrieve secret RSA keys from naturally occurring errors leading to failed SSH ...
1 year ago Bleepingcomputer.com

ChatGPT 4 can exploit 87% of one-day vulnerabilities - Since the widespread and growing use of ChatGPT and other large language models in recent years, cybersecurity has been a top concern. ChatGPT 4 quickly exploited one-day vulnerabilities. During the study, the team used 15 one-day vulnerabilities ...
11 months ago Securityintelligence.com

Google Researchers' Attack Prompts ChatGPT to Reveal Its Training Data - A team of researchers primarily from Google's DeepMind systematically convinced ChatGPT to reveal snippets of the data it was trained on using a new type of attack prompt which asked a production model of the chatbot to repeat specific words forever. ...
1 year ago 404media.co

Hackers Can Bypass Microsoft, Nvidia, & Meta AI Filters With a Simple Emoji - According to new research, these companies’ AI safety systems can be completely bypassed using a deceptively simple technique involving emoji characters, allowing malicious actors to inject harmful prompts and execute jailbreaks with 100% ...
1 month ago Cybersecuritynews.com

ChatGPT Spills Secrets in Novel PoC Attack - A team of researchers from Google DeepMind, Open AI, ETH Zurich, McGill University, and the University of Washington have developed a new attack for extracting key architectural information from proprietary large language models such as ChatGPT and ...
1 year ago Darkreading.com

Exposed Hugging Face API tokens jeopardized GenAI models - Lasso Security researchers discovered 1,681 Hugging Face API tokens exposed in code repositories, which left vendors such as Google, Meta, Microsoft and VMware open to potential supply chain attacks. In a blog post published Monday, Lasso Security ...
1 year ago Techtarget.com

macOS Malware Mix & Match: North Korean APTs Stir Up Fresh Attacks - North Korean advanced persistent threat groups are mixing and matching components of two recently unleashed types of Mac-targeted malware to evade detection and fly under the radar as they continue their efforts to conduct operations at the behest of ...
1 year ago Darkreading.com

In Other News: Fake Lockdown Mode, New Linux RAT, AI Jailbreak, Country's DNS Hijacked - Each week, we will curate and present a collection of noteworthy developments, ranging from the latest vulnerability discoveries and emerging attack techniques to significant policy changes and industry reports. Guilty pleas and convictions of ...
1 year ago Securityweek.com CVE-2022-28958

Akto Launches Proactive GenAI Security Testing Solution - With the increasing reliance on GenAI models and Language Learning Models like ChatGPT, the need for robust security measures have become paramount. Akto, a leading API Security company, is proud to announce the launch of its revolutionary GenAI ...
1 year ago Darkreading.com

Simbian Unveils Generative AI Platform to Automate Cybersecurity Tasks - Simbian today launched a cybersecurity platform that leverages generative artificial intelligence to automate tasks that can increase in complexity as the tool learns more about the IT environment. Fresh off raising $10 million in seed funding, ...
1 year ago Securityboulevard.com