James Odene 1/15/26 James Odene 1/15/26

Exclusive: Former OpenAI policy chief creates nonprofit institute, calls for independent safety audits of frontier AI models

By Jeremy Kahn

Featured on FORTUNE.COM | January 15, 2026, 12:01 PM ET

Former OpenAI policy chief Miles Brundage, who has just founded a new nonprofit institute called AVERI that is advocating for independent AI safety auditing of the top AI labs.

Miles Brundage, a well-known former policy researcher at OpenAI, is launching an institute dedicated to a simple idea: AI companies shouldn’t be allowed to grade their own homework.

Today Brundage formally announced the AI Verification and Evaluation Research Institute (AVERI), a new nonprofit aimed at pushing the idea that frontier AI models should be subject to external auditing. AVERI is also working to establish AI auditing standards.

The launch coincides with the publication of a research paper, coauthored by Brundage and more than 30 AI safety researchers and governance experts, that lays out a detailed framework for how independent audits of the companies building the world’s most powerful AI systems could work.

Brundage spent seven years at OpenAI, as a policy researcher and an advisor on how the company should prepare for the advent of human-like artificial general intelligence. He left the company in October 2024.

“One of the things I learned while working at OpenAI is that companies are figuring out the norms of this kind of thing on their own,” Brundage told Fortune. “There’s no one forcing them to work with third-party experts to make sure that things are safe and secure. They kind of write their own rules.”

That creates risks. Although the leading AI labs conduct safety and security testing and publish technical reports on the results of many of these evaluations, some of which they conduct with the help of external “red team” organizations, right now consumers, business and governments simply have to trust what the AI labs say about these tests. No one is forcing them to conduct these evaluations or report them according to any particular set of standards.

Brundage said that in other industries, auditing is used to provide the public—including consumers, business partners, and to some degree regulators—assurance that products are safe and have been tested in a rigorous way.

“If you go out and buy a vacuum cleaner, you know, there will be components in it, like batteries, that have been tested by independent laboratories according to rigorous safety standards to make sure it isn’t going to catch on fire,” he said.

New institute will push for policies and standards

Brundage said that AVERI was interested in policies that would encourage the AI labs to move to a system of rigorous external auditing, as well as researching what the standards should be for those audits, but was not interested in conducting audits itself.

“We’re a think tank. We’re trying to understand and shape this transition,” he said. “We’re not trying to get all the Fortune 500 companies as customers.”

He said existing public accounting, auditing, assurance, and testing firms could move into the business of auditing AI safety, or that startups would be established to take on this role.

AVERI said it has raised $7.5 million toward a goal of $13 million to cover 14 staff and two years of operations. Its funders so far include Halcyon Futures, Fathom, Coefficient Giving, former Y Combinator president Geoff Ralston, Craig Falls, Good Forever Foundation, Sympatico Ventures, and the AI Underwriting Company.

The organization says it has also received donations from current and former non-executive employees of frontier AI companies. “These are people who know where the bodies are buried” and “would love to see more accountability,” Brundage said.

Insurance companies or investors could force AI safety audits

Brundage said that there could be several mechanisms that would encourage AI firms to begin to hire independent auditors. One is that big businesses that are buying AI models may demand audits in order to have some assurance that the AI models they are buying will function as promised and don’t pose hidden risks.

Insurance companies may also push for the establishment of AI auditing. For instance, insurers offering business continuity insurance to large companies that use AI models for key business processes could require auditing as a condition of underwriting. The insurance industry may also require audits in order to write policies for the leading AI companies, such as OpenAI, Anthropic, and Google.

“Insurance is certainly moving quickly,” Brundage said. “We have a lot of conversations with insurers.” He noted that one specialized AI insurance company, the AI Underwriting Company, has provided a donation to AVERI because “they see the value of auditing in kind of checking compliance with the standards that they’re writing.”

Investors may also demand AI safety audits to be sure they aren’t taking on unknown risks, Brundage said. Given the multi-million and multi-billion dollar checks that investment firms are now writing to fund AI companies, it would make sense for these investors to demand independent auditing of the safety and security of the products these fast-growing startups are building. If any of the leading labs go public—as OpenAI and Anthropic have reportedly been preparing to do in the coming year or two—a failure to employ auditors to assess the risks of AI models could open these companies up to shareholder lawsuits or SEC prosecutions if something were to later go wrong that contributed to a significant fall in their share prices.

Brundage also said that regulation or international agreements could force AI labs to employ independent auditors. The U.S. currently has no federal regulation of AI and it is unclear whether any will be created. President Donald Trump has signed an executive order meant to crack down on U.S. states that pass their own AI regulations. The administration has said this is because it believes a single, federal standard would be easier for businesses to navigate than multiple state laws. But, while moving to punish states for enacting AI regulation, the administration has not yet proposed a national standard of its own.

In other geographies, however, the groundwork for auditing may already be taking shape. The EU AI Act, which recently came into force, does not explicitly call for audits of AI companies’ evaluation procedures. But its “Code of Practice for General Purpose AI,” which is a kind of blueprint for how frontier AI labs can comply with the Act, does say that labs building models that could pose “systemic risks” need to provide external evaluators with complimentary access to test the models. The text of the Act itself also says that when organizations deploy AI in “high-risk” use cases, such as underwriting loans, determining eligibility for social benefits, or determining medical care, the AI system must undergo an external “conformity assessment” before being placed on the market. Some have interpreted these sections of the Act and the Code as implying a need for what are essentially independent auditors.

Establishing ‘assurance levels,’ finding enough qualified auditors

The research paper published alongside AVERI’s launch outlines a comprehensive vision for what frontier AI auditing should look like. It proposes a framework of “AI Assurance Levels” ranging from Level 1—which involves some third-party testing but limited access and is similar to the kinds of external evaluations that the AI labs currently employ companies to conduct—all the way to Level 4, which would provide “treaty grade” assurance sufficient for international agreements on AI safety.

Building a cadre of qualified AI auditors presents its own difficulties. AI auditing requires a mix of technical expertise and governance knowledge that few possess—and those who do are often lured by lucrative offers from the very companies that would be audited.

Brundage acknowledged the challenge but said it’s surmountable. He talked of mixing people with different backgrounds to build “dream teams” that in combination have the right skill sets. “You might have some people from an existing audit firm, plus some people from a penetration testing firm from cybersecurity, plus some people from one of the AI safety nonprofits, plus maybe an academic,” he said.

In other industries, from nuclear power to food safety, it has often been catastrophes, or at least close calls, that provided the impetus for standards and independent evaluations. Brundage said his hope is that with AI, auditing infrastructure and norms could be established before a crisis occurs.

“The goal, from my perspective, is to get to a level of scrutiny that is proportional to the actual impacts and risks of the technology, as smoothly as possible, as quickly as possible, without overstepping,” he said.

Read on Fortune

Miles Brundage 1/15/26 Miles Brundage 1/15/26

Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies

A comprehensive framework for independent evaluation of frontier AI systems, mapping access requirements to systemic risks.

Miles Brundage¹*, Noemi Dreksler², Aidan Homewood², Sean McGregor¹, Patricia Paskov³, Conrad Stosz⁴, Girish Sastry⁵, A. Feder Cooper¹, George Balston¹, Steven Adler⁶, Stephen Casper⁷, Markus Anderljung², Grace Werner¹, Sören Mindermann⁵, Vasilios Mavroudis⁸, Ben Bucknall⁹, Charlotte Stix¹⁰, Jonas Freund², Lorenzo Pacchiardi¹¹, José Hernández-Orallo¹¹, Matteo Pistillo¹⁰, Michael Chen¹², Chris Painter¹², Dean W. Ball¹³, Cullen O’Keefe¹⁴, Gabriel Weil¹⁵, Ben Harack³, Graeme Finley⁵, Ryan Hassan¹⁶, Scott Emmons⁵, Charles Foster¹², Anka Reuel¹⁷, Bri Treece¹⁸, Yoshua Bengio¹⁹, Daniel Reti²⁰, Rishi Bommasani¹⁷, Cristian Trout²¹, Ali Shahin Shamsabadi²², Rajiv Dattani²¹, Adrian Weller¹¹, Robert Trager³, Jaime Sevilla²³, Lauren Wagner²⁴, Lisa Soder²⁵, Ketan Ramakrishnan²⁶, Henry Papadatos²⁷, Malcolm Murray²⁷, Ryan Tovcimak²⁸

¹AVERI ²GovAI ³Oxford Martin AI Governance Initiative ⁴Transluce ⁵Independent ⁶Clear-Eyed AI ⁷MIT CSAIL ⁸Alan Turing Institute ⁹University of Oxford ¹⁰Apollo Research ¹¹University of Cambridge ¹²METR ¹³Foundation for American Innovation ¹⁴Institute for Law and AI ¹⁵Touro University Law Center ¹⁶New Science ¹⁷Stanford University ¹⁸Fathom
¹⁹Mila, Université de Montréal ²⁰Exona Lab ²¹AI Underwriting Company ²²Brave Software ²³Epoch AI ²⁴Abundance Institute ²⁵interface ²⁶Yale University ²⁷SaferAI ²⁸UL Solutions

January 2026

Listed authors contributed significant writing, research, and/or review for one or more sections. The sections cover a wide range of empirical and normative topics, so with the exception of the corresponding author (Miles Brundage, miles.brundage@averi.org), inclusion as an author does not entail endorsement of all claims in the paper, nor does authorship imply an endorsement on the part of any individual’s organization.

Executive Summary

Key paper takeaways

Despite their rapidly growing importance, AI systems are subject to less rigorous third-party scrutiny than many of the other social and technological systems that we rely on daily such as consumer products, corporate financial statements, and food supply chains. This gap is becoming increasingly untenable as AI becomes more capable and widely deployed, and it inhibits confident deployment of AI in high-stakes contexts.
Transparency alone cannot enable well-calibrated trust in the most capable (“frontier”) AI systems and the companies that build them: many safety- and security-relevant details are legitimately confidential and require expert interpretation, and third parties are right to be skeptical of companies "checking their own homework" given the track record of that approach in other industries.
We outline a vision for frontier AI auditing, which we define as rigorous third-party verification of frontier AI developers’ safety and security claims, and evaluation of their systems and practices against relevant standards, based on deep, secure access to non-public information.
Frontier AI audits should not be limited to a company’s publicly deployed products, but should instead consider the full range of organization-level safety and security risks, including internal deployment of AI systems, information security practices, and safety decision-making processes.
We describe four AI Assurance Levels (AALs), the higher levels of which provide greater confidence in audit findings. We recommend AAL-1 as a baseline for frontier AI generally, and AAL-2 as a near-term goal for the most advanced subset of frontier AI developers.
Achieving the vision we outline will require (1) ensuring high quality standards for frontier AI auditing, so it does not devolve into a checkbox exercise or lag behind changes in the industry; (2) growing the ecosystem of audit providers at a rapid pace without compromising quality; (3) accelerating adoption of frontier AI auditing by clarifying and strengthening incentives; and (4) achieving technical readiness for high AI Assurance Levels so they can be applied when needed.

Frontier AI auditing motivations

Artificial intelligence (AI) is rapidly becoming critical societal infrastructure. Every day, AI systems inform decisions that affect billions of people. Increasingly, they also make consequential decisions autonomously. Although these technologies hold incredible promise, the pace of development and deployment has outpaced the creation of institutions that ensure AI works safely and as advertised.

This institutional gap is especially important for the most capable (“frontier”) systems — general-purpose AI models and systems whose performance is no more than a year behind the state-of-the-art — which many experts expect to exceed human performance across most tasks within the coming years. Already, developers of frontier AI systems need to prevent harmful system failures (e.g., outputting false medical information or buggy code), weaponization by malicious parties (e.g., to carry out cyberattacks), and theft of or tampering with sensitive data. The magnitude of risks that need to be managed is growing rapidly.

AI users, policymakers, investors, and insurers need reliable ways to verify that promised technical safeguards exist and to detect when they do not. This is challenging because the technology is complex, fast-moving, and often proprietary. Public transparency alone cannot solve this problem since many key details are — and often should remain — confidential, and require expert judgment to interpret. Many industries outside of AI already address similar challenges through independent auditors who review sensitive, non-public information and publish trustworthy conclusions that outsiders can rely on. We argue that similar practices are needed in the AI industry: broad, sustainable adoption of AI over time requires a solid foundation of trust built on credible scrutiny by independent experts.

Toward this end, we propose institutions designed to give stakeholders — including those who are uncertain about or even strongly skeptical of frontier AI companies — justified confidence that this critical technology is being developed safely and securely. Specifically, we describe and advocate for frontier AI auditing: rigorous third-party verification of frontier AI developers’ safety and security claims, and evaluation of their systems and practices against relevant standards, based on deep, secure access to non-public information.

An ecosystem of private sector frontier AI auditors (both for-profit and non-profit) would enable widespread confidence that frontier AI systems can be adopted broadly and would avoid reliance on companies “grading their own homework,” an approach with a checkered track record in many industries. It would also avoid relying entirely on governments to have the technical expertise, capacity, and agility to ensure high standards for frontier AI safety and security. If well-executed and scaled, frontier AI auditing would improve safety and security outcomes for users of AI systems and other affected parties, create a system to learn and update standards based on real-world outcomes, and enable more confident investment in and deployment of frontier AI, especially in high-stakes sectors of the economy.

Summary of the proposal

Drawing on our analysis of current practices in AI and lessons from other industries with more mature assurance regimes, we recommend eight interlinked design principles for a long-term vision for frontier AI auditing. This vision is deliberately ambitious to match the rising stakes as frontier AI capabilities advance:

Scope of risks: Comprehensive coverage of four key risk categories. Frontier AI auditing should focus on four risk categories: risks from (1) intentional misuse of frontier AI systems (e.g., for cyberattacks); (2) unintended frontier AI system behavior (e.g., errors harming the user, their property, or third parties due to pursuing the wrong goal or having an unreliable performance profile); (3) information security (e.g., theft of an AI model or user data); and (4) emergent social phenomena (e.g., addiction to AI or facilitation of self-harm). For each category of risks, auditors should (a) verify company claims and (b) evaluate the company’s systems and practices against its stated safety and security policies, applicable regulations, and industry best practices.

Organizational perspective: Auditing companies’ safety and security practices as a whole, not just individual models and systems. Auditors should use an organization-level perspective to avoid abstraction errors (i.e., forming the wrong conclusion by treating a partial or simplified unit of analysis, such as evaluating a specific component in isolation, as if it were sufficient to assess overall system and organizational risk). Risk does not come from AI models alone; it emerges from the interaction of three overarching components: digital systems, computing hardware, and governance practices, and harm can arise even when a model is never deployed in external-facing systems. Rigorous, but isolated, model and system evaluations are therefore insufficient to evaluate all safety and security claims on their own. And while individual audits may focus on particular domains depending on their goals, the ecosystem as a whole should ensure comprehensive coverage across all three components in assessing safety and security claims.

*Figure 1: Four AI Assurance Levels (AALs) for different frontier AI audits.*

Levels of assurance: A framework for calibrating and communicating confidence in audit conclusions. Not all audits provide the same level of certainty, and stakeholders need to understand these differences. We propose AI Assurance Levels (AALs) as a means of clarifying what kind of assurance particular frontier AI audits provide (Figure 1). At lower levels, auditors and other stakeholders rely more heavily on information provided by the company and can primarily speak to a particular system’s properties. At higher levels, auditors take fewer assumptions for granted, and assess the full range of relevant company systems, organizational processes, and risks. At the highest level, auditors can rule out the possibility of materially significant deception by the auditee. Determining the appropriate AAL for different contexts and purposes is complex, but we recommend AAL-1 (the peak of current practices in AI) as a starting point for frontier AI generally, and AAL-2 as a near-term goal for the companies closest to the state-of-the-art. AAL-2 involves greater access to non-public information, less reliance on companies’ statements, and a more holistic assessment of company-level risks. The two highest assurance levels (AAL-3 and AAL-4) are not yet technically and organizationally feasible, but we outline research directions to change this.
Access: Deep enough to assure auditors and other stakeholders, secure enough to reassure auditees. Frontier AI auditors should receive deep, secure access to non-public information of various kinds — including model internals, training processes, compute allocation, governance records, and staff interviews — proportional to the audit’s scope and the level of assurance being sought for the audit. Access arrangements should protect intellectual property and security-sensitive information using mechanisms imported from other domains (e.g., sharing certain information with a subset of the auditing team on-site under a restrictive nondisclosure agreement) and newly-developed techniques (e.g., AI-powered summarization or analyses of information that is too sensitive to be directly shared).
Continuous monitoring: Living assessments, not stale PDFs. AI systems change constantly, including through adjustments to the underlying model(s), surrounding software, and shifts in user behavior. An audit conclusion that was accurate at the time of the assessment may become misleading in some respects within days or weeks. Audit findings should therefore carry explicit assumptions and validity conditions, and should be automatically deprecated when key underlying assumptions no longer hold. A mature auditing ecosystem will combine periodic deep assessments of slower-moving elements (e.g., governance, safety culture) with event-triggered reviews of major changes (e.g., new releases, serious incidents) and continuous automated monitoring of fast-changing surfaces (e.g., API behavior, configuration drift), enabling timely detection of changes that could invalidate prior conclusions.
Independent experts: Trustworthy results through rigorous independence safeguards and deep expertise. Auditors must be genuinely independent third parties, free from commercial or political influence, and have deep expertise across AI evaluation, safety, security, and governance. Safeguarding independence requires mandatory disclosure of financial relationships, standardized terms of engagement that prevent companies from shopping for favorable auditors, and cooling-off periods when moving, in both directions, between industry and audit roles. Alternative payment models that reduce auditor dependence on auditees should also be urgently explored. Where single auditing organizations lack sufficient expertise, subcontracting and consortia models can enable the necessary breadth across AI evaluation, safety, security, and governance.
Rigor: Processes that are methodologically rigorous, traceable, and adaptive. Audits should follow a standardized process while giving auditors the autonomy to flexibly determine specific methods and adjust scope as issues emerge. Auditors should be able to define evaluation metrics and criteria rather than simply validating companies’ preselected approaches. Wherever feasible, audit procedures should be automated, transparent, and reproducible to support consistent application across engagements and enable continuous monitoring as systems evolve. Auditors need to safeguard evaluation construct and ecological validity, and audit criteria should be protected against gaming. Finally, audits should incorporate procedural fairness, giving companies structured opportunities to correct factual errors while preventing undue influence on conclusions.
Clarity: Clear communication of audit results. Stakeholders must be able to understand the audit results. These should be communicated in audit reports with a standardized structure, covering the audit’s scope, level of assurance, conclusions, reasoning, and recommendations. Results should be communicated appropriately to different stakeholders: to protect sensitive information, auditors and companies can publish summarized or redacted versions for external stakeholders while sharing full, unredacted audit reports with boards, company executives, and, in some cases, regulatory bodies.

Challenges and next steps

Our long-term vision will require concrete efforts by several categories of stakeholders to both achieve and maintain. The most urgent challenges are:

Ensuring high quality standards for frontier AI auditing, so it does not devolve into a checkbox exercise or lag behind changes in the AI industry.
Growing the ecosystem of audit providers at a rapid pace without compromising quality.
Accelerating adoption of frontier AI auditing by clarifying and strengthening incentives.
Achieving technical readiness for high AI Assurance Levels so they can be applied when needed.

These challenges are substantial but not unprecedented. Companies routinely share sensitive information with financial auditors, potential acquirers, penetration testers, and consumer product testing laboratories under carefully controlled terms. We believe similar practices for AI safety and security are both achievable and urgently needed. For each of the challenges we describe, we recommend specific next steps:

*Figure 2:* *Recommendations for next steps across four challenges in frontier AI auditing.*

Keeping up with the rapid pace of AI progress and deployment requires quickly importing best practices from more mature industries and immediate investment in auditing pilots, technical research, and policy research. Moving with urgency is essential if frontier AI auditing is to reach maturation and scale alongside AI development.

View Article

LinkedIn · X · Facebook · Email

James Odene 1/15/26 James Odene 1/15/26

Measuring Progress on AI Safety Practices

Without reliable measurement of AI systems, we cannot conclude a measured system is safe

AVERI’s purpose (i.e., to make third-party auditing of frontier AI effective and universal) takes inspiration from, among others, the financial industry. However, financial auditors benefit from 700+ years of bookkeeping history. AI safety does not have comparable forms of safety expression against which audits might be conducted. In short, there is no “balance sheet for AI” where frontier AI companies could fill in numbers to arrive at a bottom line finding of safe or unsafe.

Neither AVERI nor the rest of the AI assurance ecosystem has a solution to the “balance sheet problem,” but we can — at least to some extent — measure our progress towards a reliable accounting of AI risk.

"A.I. Has a Measurement Problem" an article from the New York Times — *Without reliable measurement of AI systems, we cannot conclude a measured system is safe*

Measuring the Reliability of Claims

In financial audits, the task is to verify claims of financial condition, but for AI systems, an auditor is tasked with verifying safety and security claims. Consider, for example, the claims below:

  
      Finance
      AI Safety
    
      Claim
      Liquid cash balance: $12,186,633
      Cybersecurity risk: negligible
    
      Examined Evidence
      Bank statement
      Safety case, benchmarks, evaluations, internal process control documents, red team results, …

A financial auditor can quickly verify cash on hand by checking bank statements, but safety claims quantitatively and qualitatively integrate many forms of evidence. Without the simplicity of receipts, AI safety claim verification is an exercise in weighing “how likely is this risk model to be wrong?”

In short, verification is an exercise in measuring the risk that safety claims are wrong and reliable evaluations serve to reduce that risk. Strong measures of evaluation reliability therefore enable better evaluations.

Risk Management for Benchmark Evidence: BenchRisk

To enable the measurement of evaluation reliability, we examined benchmarks as artifacts making claims and measured the risk that these claims might mislead people about the properties of a system. We bundled our process, which is a specialized form of risk management processes, into a dataset at BenchRisk.ai and presented the results at NeurIPS 2025.

McGregor, Sean, et al. Risk Management for Mitigating Benchmark Failure Modes: BenchRisk. Proceedings of the Neural Information Processing Systems Conference (NeurIPS), 2025. arXiv, arXiv:2510.21460.

BenchRisk proceeds by collecting failure modes (57 to date) and mitigations (196 to date) whose affirmation by the benchmark author increases the benchmark’s reliability and score of BenchRisk.

An image shows a sequence of events from the collection of failure modes to the calculation of the risk posed by the failure modes with and without mitigation. The mitigations selected by one hypothetical benchmark scores points on BenchRisk

Pre- and post-mitigation risk as a benchmark.

Calculate the BenchRisk points scored by a hypothetical benchmark against Failure Mode #025.

We applied this process to 26 leading benchmarks and found all benchmarks present significant risk of misleading people about the properties of frontier AI systems. For a more complete presentation of how BenchRisk is calculated you can view the NeurIPS presentation.

Takeaway: although benchmarks are regularly put forward to describe frontier AI for real-world purposes, benchmarks are often not produced with the intention or capacity for real-world decision making. A practice of benchmarking complete with organizations and methods dedicated to real-world decisionsrevolution of methods is required to make benchmarks reliable for real-world consequences decision making. BenchRisk is a tool to advance towards that reliable safety information ecosystem.

A preview of the results that are available in depth at benchrisk.ai

Visit BenchRisk

What’s Next

BenchRisk is a tool in our toolbox for assessing the reliability of claims. It can be applied to new contexts and used to measure the reliability of any safety claim resulting from benchmarks or, more broadly, evaluations. The only requirement is the development of a list of ways the safety claim might be unreliable. We invite your critique, improvement, and application of BenchRisk-ChatBot-v1.0, and to fork the project towards your own purposes. AVERI itself will be periodically applying BenchRisk through its pilot evaluation programs.

Learn more

James Odene 1/14/26 James Odene 1/14/26

Risk Management for Mitigating Benchmark Failure Modes: BenchRisk

The BenchRisk workflow allows for comparison between benchmarks; as an open-source tool, it also facilitates the identification and sharing of risks and their mitigations.

Sean McGregor,¹,²,∗ Victor Lu,³,† Vassil Tashev,³,† Armstrong Foundjem,⁴,‡ Aishwarya Ramasethu,⁵,‡ Mahdi Kazemi,¹⁰,‡ Chris Knotz,³,‡ Kongtao Chen,⁶,‡ Alicia Parrish,⁷,◦ Anka Reuel,⁸,¶ Heather Frase⁹,²,¶

¹AI Verification and Evaluation Research Institute, ²Responsible AI Collaborative, ³Independent, ⁴Polytechnique Montreal, ⁵Prediction Guard, ⁶Google, ⁷Google Deepmind, ⁸Stanford University, ⁹Veraitech,¹⁰University of Houston Contribution equivalence classes (∗, †, ‡, ◦, ¶) detailed in acknowledgments

Abstract

Large language model (LLM) benchmarks inform LLM use decisions (e.g., “is this LLM safe to deploy for my use case and context?”). However, benchmarks may be rendered unreliable by various failure modes that impact benchmark bias, variance, coverage, or people’s capacity to understand benchmark evidence. Using the National Institute of Standards and Technology’s risk management process as a foundation, this research iteratively analyzed 26 popular benchmarks, identifying 57 potential failure modes and 196 corresponding mitigation strategies. The mitigations reduce failure likelihood and/or severity, providing a frame for evaluating “benchmark risk,” which is scored to provide a metaevaluation benchmark: BenchRisk. Higher scores indicate that benchmark users are less likely to reach an incorrect or unsupported conclusion about an LLM. All 26 scored benchmarks present significant risk within one or more of the five scored dimensions (comprehensiveness, intelligibility, consistency, correctness, and longevity), which points to important open research directions for the field of LLM benchmarking. The BenchRisk workflow allows for comparison between benchmarks; as an open-source tool, it also facilitates the identification and sharing of risks and their mitigations.

Visit BenchRisk

A 5 minute presentation prepared for NeurIPS explaining BenchRisk.

View PDF

SHARE ARTICLE:

LinkedIn · X · Facebook · Email