Anthropic Launches Petri Tool To Automate AI Safety Audits

0
71
Anthropic Keeps Bun Open Source After Acquisition To Build An Open Foundation For AI Agents And Developer Tools
Anthropic Keeps Bun Open Source After Acquisition To Build An Open Foundation For AI Agents And Developer Tools

Anthropic’s new open source tool, Petri, uses autonomous agents to test and flag risky behaviours in leading AI models, aiming to make AI safety research more collaborative and standardised.

Anthropic PBC has unveiled Petri (Parallel Exploration Tool for Risky Interactions), an open-source AI safety auditing tool designed to automatically test large language models (LLMs) for risky behaviours. Using autonomous agents, Petri identifies tendencies such as deception, whistleblowing, cooperation with misuse, and facilitating terrorism.

The company has already audited 14 leading models, including its own Claude Sonnet 4.5, OpenAI GPT-5, Google Gemini 2.5 Pro, and xAI Corp. Grok-4, finding problematic behaviours in all of them. Models were tested across 111 risky tasks in four safety categories: deception, power-seeking, sycophancy, and refusal failure. Claude Sonnet 4.5 performed best overall, although misalignment issues were detected in every model.

Petri launches auditor agents that interact with models in varied ways, while a judge model ranks outputs across honesty and refusal metrics, flagging risky responses for human review. Developers can use included prompts, evaluation code, and guidance to extend Petri’s capabilities, significantly reducing manual testing effort.

On whistleblowing behaviour, Anthropic researchers noted: “While running Petri across our diverse set of seed instructions, we observed multiple instances of models attempting to whistleblow — autonomously disclosing information about perceived organisational wrongdoing … While this in principle could play an important role in preventing certain large-scale harms, it is not generally appropriate behavior for current AI systems: There are serious privacy considerations to contend with, and the potential for leaks stemming from confused attempts at whistleblowing is substantial.”

While Petri has limitations, judge models may inherit biases, and some agents may inadvertently alert models being tested. Anthropic hopes that open sourcing the tool will make alignment research more transparent, collaborative, and standardised. By shifting AI safety testing from static benchmarks to automated, continuous audits, Petri enables the community to collectively monitor and improve LLM behaviour.

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here