The LLMs are evolving rapidly with continuous advancements in their research and applications. Recently, cybersecurity researchers at Google discovered how threat actors can exploit ChatGPT queries to collect personal data. StorageGuard scans, detects, and fixes security misconfigurations and vulnerabilities across hundreds of storage and backup devices. Cybersecurity analysts developed a scalable method that detects memorization in trillions of tokens, analyzing open-source and semi-open models. Researchers identified that the larger and more capable models are vulnerable to data extraction attacks. GPT-3.5-turbo shows minimal memorization due to alignment as a helpful chat assistant. Using a new prompting strategy, the model diverges from chatbot-style responses, resembling a base language model. Researchers test its output against a nine-terabyte web-scale dataset, recovering over ten thousand training examples at a $200 query cost, with the potential for extracting 10× more data. Security analysts assess past extraction attacks in a controlled setting, focusing on open-source models with publicly available training data. 's method, they downloaded 108 bytes from Wikipedia, generating prompts by sampling continuous 5-token blocks. Unlike prior methods, they directly query the model's open-source training data to evaluate attack efficacy, eliminating the need for manual internet searches. Researchers tested their attack on 9 open-source models tailored for scientific research, providing access to their complete training, pipeline, and dataset for study. Semi-closed models have downloadable parameters but undisclosed training datasets and algorithms. Despite generating outputs similarly, establishing 'ground truth' for extractable memorization requires experts due to inaccessible training datasets. While extracting the data from ChatGPT, researchers found two major challenges, and here below, we have mentioned those challenges:-. Researchers extract training data from ChatGPT through a divergent attack, but it lacks generalizability to other models. Despite limitations in testing for memorization, they use known samples from the extracted training set to measure discoverable memorization. For the 1,000 longest memorized examples, they prompt ChatGPT with the first N−50 tokens and generate a 50-token completion to assess discoverable memorization. ChatGPT is highly susceptible to data extraction attacks due to over-training for extreme-scale, high-speed inference. The trend of over-training on vast amounts of data poses a trade-off between privacy and inference efficiency. Speculation arises about ChatGPT's multiple-epoch training, potentially amplifying memorization and allowing easy extraction of training data. Experience how StorageGuard eliminates the security blind spots in your storage systems by trying a 14-day free trial.
This Cyber News was published on cybersecuritynews.com. Publication date: Thu, 30 Nov 2023 21:55:08 +0000