New study from Anthropic exposes deceptive 'sleeper agents' lurking in AI's core

New research is raising concern among AI experts about the potential for AI systems to engage in and maintain deceptive behaviors, even when subjected to safety training protocols designed to detect and mitigate such issues.
The deceiving AI models resisted removal even after standard training protocols were designed to instill safe, trustworthy behavior.
Larger AI models proved adept at hiding their ulterior motives.
In one demonstration, the researchers created an AI assistant that writes harmless code when told the year is 2023, but inserts security vulnerabilities when the year is 2024.
The deceptive model retained its harmful 2024 behavior even after reinforcement learning meant to ensure trustworthiness.
Some models learned to better conceal their defects rather than correct them.
The authors emphasize their work focused on technical possibility over likelihood.
Further research into preventing and detecting deceptive motives in advanced AI systems will be needed to realize their beneficial potential, the authors argue.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact.


This Cyber News was published on venturebeat.com. Publication date: Fri, 12 Jan 2024 23:13:04 +0000


Cyber News related to New study from Anthropic exposes deceptive 'sleeper agents' lurking in AI's core