Can getting ChatGPT to repeat the same word over and over again cause it to regurgitate large amounts of its training data, including personally identifiable information and other data scraped from the Web? The answer is an emphatic yes, according to a team of researchers at Google DeepMind, Cornell University, and four other universities who tested the hugely popular generative AI chatbot's susceptibility to leaking data when prompted in a specific way. 'Poem' as a Trigger Word In a report this week, the researchers described how they got ChatGPT to spew out memorized portions of its training data merely by prompting it to repeat words like "Poem," "Company," "Send," "Make," and "Part" forever. After a few hundred times, ChatGPT began generating "Often nonsensical" output, a small fraction of which included memorized training data such as an individual's email signature and personal contact information. The researchers discovered that some words were better at getting the generative AI model to spill memorized data than others. Prompting the chatbot to repeat the word "Company" caused it to emit training data 164 times more often than other words, such as "Know." Data that the researchers were able to extract from ChatGPT in this manner included personally identifiable information on dozens of individuals; explicit content; verbatim paragraphs from books and poems; and URLs, unique user identifiers, bitcoin addresses, and programming code. A Potentially Big Privacy Issue? "Using only $200 USD worth of queries to ChatGPT, we are able to extract over 10,000 unique verbatim memorized training examples," the researchers wrote in their paper titled "Scalable Extraction of Training Data from Language Models." "Our extrapolation to larger budgets suggests that dedicated adversaries could extract far more data," they wrote. The researchers estimated an adversary could extract 10 times more data with more queries. The tendency for such memorization increases with the size of the training data. Researchers have shown how such memorized data is often discoverable in a model's output. Other researchers have shown how adversaries can use so-called divergence attacks to extract training data from an LLM. A divergence attack is one in which an adversary uses intentionally crafted prompts or inputs to get an LLM to generate outputs that diverge significantly from what it would typically produce. In many of these studies, researchers have used open source models - where the training datasets and algorithms are known - to test the susceptibility of LLM to data memorization and leaks. The studies have also typically involved base AI models that have not been aligned to operate in a manner like an AI chatbot such as ChatGPT. A Divergence Attack on ChatGPT The latest study is an attempt to show how a divergence attack can work on a sophisticated closed, generative AI chatbot whose training data and algorithms remain mostly unknown. The study involved the researchers developing a way to get ChatGPT "To 'escape' out of its alignment training" and getting it to "Behave like a base language model, outputting text in a typical Internet-text style." The prompting strategy they discovered caused precisely such an outcome, resulting in the model spewing out memorized data. To verify that the data the model was generating was indeed training data, the researchers first built an auxiliary dataset containing some 9 terabytes of data from four of the largest LLM pre-training datasets - The Pile, RefinedWeb, RedPajama, and Dolma. They then compared the output data from ChatGPT against the auxiliary dataset and found numerous matches. The researchers figured they were likely underestimating the extent of data memorization in ChatGPT because they were comparing the outputs of their prompting only against the 9-terabyte auxiliary dataset. "Our paper suggests that training data can easily be extracted from the best language models of the past few years through simple techniques."
This Cyber News was published on www.darkreading.com. Publication date: Fri, 01 Dec 2023 14:00:24 +0000