Making third-party auditing of frontier AI systems effective and universal.

Vision

AVERI aims to bring about a world in which the most capable AI systems—whether built by OpenAI or DeepSeek, the DoD or the PLA—whether built by OpenAI or DeepSeek, the DoD or the PLA—are continuously, independently, and credibly evaluated against common safety and security standards. If we are successful, independent assessment of AI systems will have gone from optional, time-limited, narrow studies to an expected, always‑on verification layer, assuring all parties that severe risks are being addressed. We consider third-party auditing a complement to, not a substitute for, transparency: if we are successful, AI developers will share public, well-reasoned safety cases for their most capable systems, and auditors will verify the parts of those cases that involve the most sensitive IP.

Problem

As frontier AI systems become more capable, the safety and security safeguards applied to them are increasingly critical. These safeguards prevent AI systems from being misused to carry out terrorist attacks, protect the sensitive data used by AI agents, and ensure AI systems understand and follow human intent. Companies generally “check their own homework” on these safeguards, which is concerning because even well-intentioned teams are susceptible to blind spots, groupthink, and corner-cutting under competitive pressure.

Many in industry, government, and academia broadly agree that self-evaluation is not enough: rigorous, truly independent assessment is needed in order to surface risks that internal teams might otherwise miss, validate the claims that companies make to regulators and the public, help distill best practices from across the industry, and prevent a race to the bottom by applying a shared standard across organizations. But we are far from achieving that in practice.

Today’s external assessments of AI systems don’t involve the rigor and access that the term “audit” implies. Assessments are typically done in a black-box fashion, typically ignore many key aspects of AI development such as platform-level safeguards and internal deployment, and provide only a “snapshot in time” rather than continuous verification of safety and security.

Gold Standard Articulation

Articulating what AI assessment should look like in the long-term (the “gold standard”) and why it’s important to achieve that gold standard.

Helps inform the actions and investments of a range of stakeholders, including policymakers, companies, and researchers.

Our Approach

Audit Research & Engineering

Conduct (or partner with other organizations to conduct) pilot assessments of frontier AI systems, pushing the envelope in terms of access, rigor, and scope. Produce open-source tools and training materials to make audits cheaper, faster, and standardized.

Raises the bar for what counts as a good external assessment (raising the bar for quality on the supply side of auditing) and decreases the costs of achieving a given level of quality.

Policy & Advocacy

Advocate for public and private policies that incentivize higher tiers of auditing through mechanisms such as (non-exhaustively) regulation, procurement criteria, investor due diligence, and insurance premiums.

Builds the demand side, making rigorous audits more economically appealing to all parties and ultimately paving the way to universality.

AVERI works closely with other organizations in the AI assessment space, but has different priorities. Our emphasis on the process side of auditing – defining what and how audits are performed and enabling others to provide them at scale – is complementary to organizations that focus on specific risk domains like biosecurity or misalignment. We invest heavily in building capacity and demand for the AI audit ecosystem as a whole and are especially committed to distilling and applying lessons from safety-critical systems engineering, financial auditing, and other fields beyond AI.
We place an especially high premium on being able to give credible policy input and articulating a gold standard, and avoiding any incentive to maximize assessment throughput.
Research vs. auditing -- we'll only do it to the extent it is teaching the world something new, but we don't want to do assessments for the sake of being the provider.