According to cybersecurity firm Truffle Security, the study highlights how AI models trained on unfiltered internet snapshots risk internalizing and potentially reproducing insecure coding patterns. The tool differentiated live secrets (authenticated against their services) from inert strings—a critical step given that LLMs cannot discern valid credentials during training. The study underscores a growing dilemma: LLMs trained on publicly accessible data inherit its security flaws. While models like DeepSeek utilize additional safeguards fine-tuning, alignment techniques, and prompt constraints—the prevalence of hardcoded secrets in training corpora risks normalizing unsafe practices. The findings follow earlier revelations that LLMs frequently suggest hardcoding credentials in codebases, raising questions about the role of training data in reinforcing these behaviors. Truffle Security scanned 400 terabytes of Common Crawl’s December 2024 dataset, comprising 2.67 billion web pages from 47.5 million hosts. Truffle Security warns that developers who reuse API keys across client projects face heightened risks. Cyber Security News is a Dedicated News Platform For Cyber News, Cyber Attack News, Hacking News & Vulnerability Analysis. Truffle Security deployed a 20-node AWS cluster to process the archive, splitting files using awk and scanning each segment with TruffleHog’s verification engine. Integrating security guardrails into AI coding tools via platforms like GitHub Copilot’s Custom Instructions, which can enforce policies against hardcoding secrets. Adopting Constitutional AI techniques to align models with security best practices, reducing inadvertent exposure of sensitive patterns. With LLMs increasingly shaping software development, securing their training data is no longer optional—it’s foundational to building a safer digital future. A recent analysis uncovered 11,908 live DeepSeek API keys, passwords, and authentication tokens embedded in publicly scraped web data. Gurubaran is a co-founder of Cyber Security News and GBHackers On Security. Notably, the dataset included high-risk exposures like AWS root keys in front-end HTML and 17 unique Slack webhooks hardcoded into a single webpage’s chat feature. Despite these challenges, the team prioritized ethical disclosure by collaborating with vendors like Mailchimp to revoke thousands of keys, avoiding spam-like outreach to individual website owners. Expanding secret-scanning programs to include archived web data as historical leaks resurface in training datasets. Common Crawl’s dataset, stored in 90,000 WARC files, preserves raw HTML, JavaScript, and server responses from crawled sites. Non-functional credentials (e.g., placeholder tokens) contribute to this issue, as LLMs cannot contextually evaluate their validity during code generation. He has 10+ years of experience as a Security Consultant, Editor, and Analyst in cybersecurity, technology, and communications.
This Cyber News was published on cybersecuritynews.com. Publication date: Fri, 28 Feb 2025 03:40:22 +0000