At Google, we maintain a Vulnerability Reward Program to honor cutting-edge external contributions addressing issues in Google-owned and Alphabet-subsidiary Web properties.
To keep up with rapid advances in AI technologies and ensure we're prepared to address the security challenges in a responsible way, we recently expanded our existing Bug Hunters program to foster third-party discovery and reporting of issues and vulnerabilities specific to our AI systems.
In our recent AI red team report, which is based on Google's AI Red Team exercises, we identified common tactics, techniques, and procedures that we consider most relevant and realistic for real-world adversaries to use against AI systems.
The following table incorporates what we learned to help the research community understand our criteria for AI bug reports and what's in scope for our reward program.
It's important to note that reward amounts are dependent on severity of the attack scenario and the type of target affected.
Category Attack scenario Guidance Prompt Attacks: Crafting adversarial prompts that allow an adversary to influence the behavior of the model and the output, in ways that were not intended by the application.
Prompt or preamble extraction in which a user is able to extract the initial prompt used to prime the model only when sensitive information is present in the extracted preamble.
Google's generative AI products already have a dedicated reporting channel for these types of content issues.
Training Data Extraction: Attacks that are able to successfully reconstruct verbatim training examples that contain sensitive information.
Manipulating Models: An attacker able to covertly change the behavior of a model such that they can trigger pre-defined adversarial behaviors.
Adversarial output or behavior that an attacker can reliably trigger via specific input in a model owned and operated by Google.
Only in scope when a model's output is used to change the state of a victim's account or data.
Attacks in which an attacker manipulates the training data of the model to influence the model's output in a victim's session according to the attacker's preference.
Adversarial Perturbation: Inputs that are provided to a model that results in a deterministic, but highly unexpected output from the model.
Contexts in which a model's incorrect output or classification does not pose a compelling attack scenario or feasible path to Google or user harm.
Model Theft/Exfiltration: AI models often include sensitive intellectual property, so we place a high priority on protecting these assets.
Exfiltration attacks allow attackers to steal details about a model such as its architecture or weights.
Attacks in which the exact architecture or weights of a confidential/proprietary model are extracted.
Attacks in which the architecture and weights are not extracted precisely, or when they're extracted from a non-confidential model.
As consistent with our program, issues that we already know about are not eligible for reward.
This Cyber News was published on www.darkreading.com. Publication date: Fri, 15 Dec 2023 23:15:05 +0000