What Access is Needed for Effective Auditing? Insights from Near-Verbatim Extraction on Open-Weight Models
A new method for studying near-verbatim memorization in language models that has implications for access requirements in frontier AI auditing
AVERI’s mission is to make frontier AI auditing effective and universal. By this, we mean that we’d like to see rigorous third-party verification of frontier AI developers’ safety and security claims, and evaluation of their systems and practices against relevant standards, based on deep, secure access to non-public information.
A. Feder Cooper, Research Scientist at AVERI, recently posted a preprint that advances scientific understanding of training data memorization by language models, and in the process, sheds light on what kind of access auditors will likely need in practice to better understand how much frontier models memorize.
The Importance of Memorization
Language models memorize a portion of their training data. The extent of such memorization can signal important things about model behavior and capabilities, including overfitting to the training data. How much memorization is happening, and whether it’s feasible to extract memorized data in outputs, matters for additional safety and security reasons: extraction can expose private data, leak information that could be abused for malicious purposes, or reproduce copyrighted material present in the training data.
Memorization is not the only thing that matters in AI safety and security, but it’s a key building block to fully understanding an AI system’s risk profile. Consider an analogy to assessing security at a bank. A bank needs better security if it has more stored in its vault. If the bank manager hires a consultant to advise on improving security, the consultant couldn’t really do their job effectively if they didn’t have at least a rough sense of how hard a thief might try to break in.
Likewise, in order to know the stakes of many AI safety and security interventions (such as increasing robustness to “jailbreaks”), it’s important to know how often language models are actually generating new insights on the fly versus nearly exactly reproducing something they read somewhere, and how much knowledge they have about a range of topics.
This analogy also helps clarify why it’s important to also study base models (or models that have been “pre-trained” on largely Internet-derived data, but not “post-trained” to behave as a helpful assistant). The post-training process can conceal or reduce the likelihood of undesirable knowledge or behaviors emerging (such as memorized training data), but it often doesn’t actually remove that knowledge entirely or make the behavior impossible. Studying base models can give a sense of worse-case degrees of potential extraction risk.
Verbatim and Near-Verbatim Extraction
On a technical level, memorization means that a language model assigns a very high probability to a particular sequence of “tokens” (words or chunks of words) in its training data. The way that researchers typically quantify this behavior is through extraction: reproducing memorized training data at generation time through prompting the model in specific ways.
The standard way that researchers measure extraction has (at least) two blind spots. First, these methods are typically used only to count exact (verbatim) matches. Prompt a model with a “prefix” from a known training sequence, generate a continuation, and check whether it matches the training data (the “target suffix”) exactly – token-for-token. If even one token is off – a punctuation mark, a space – it doesn't get counted as extraction. Prior work shows this misses memorized sequences that the model reproduces with near-perfect fidelity, which can similarly pose privacy, safety, or copyright issues.
Fig. 1. Illustrating extraction of a sequence of training data from The Great Gatsby. Figure reproduced from Cooper et al. (2025).
Second, it typically uses greedy decoding: during generation, this produces the locally highest-probability token for each token in the output. This is deterministic – you always get the same generation for the same prompt – and produces a binary outcome: extracted or not (Fig. 1). But in practice, LLMs use non-deterministic decoding schemes, so the relevant question isn't whether a sequence can be extracted, but how likely extraction is on any given generation. A sequence the model reproduces 15% of the time poses very different leakage risk than one it reproduces 0.1% of the time. As prior work shows for verbatim extraction, that probability – the extraction risk– provides a lot more information than a yes/no determination about extraction.
Our new paper addresses both of the above limitations together. We introduce a tractable method for estimating near-verbatim extraction risk: the probability that a model, when prompted with a prefix, generates a continuation that is within a small edit distance of the suffix.
Experiments on open-weight models show that accounting for near-verbatim extraction reveals far more memorized training data and much higher extraction risk than verbatim methods suggest.
We find that models can produce near-exact copies of their training data that differ by trivial edits such as spacing and punctuation ("money— that" vs. "money—that") and hyphenation ("honeycomb" vs. "honey-comb"). The memorization is clearly real, but the model encoded slightly different forms of the same text, instead of putting all (or, sometimes, even most of) the probability on the verbatim target suffix being tested. This means that just looking at verbatim risk can conceal how much meaningful (near-verbatim) extraction risk there actually is.
These are all cases where verbatim extraction tests say "not extracted," but the model has clearly reproduced the training data in its outputs. And because each of these near-verbatim variants carries its own probability of generation, the total extraction risk – the chance of producing any near-copy – can be far larger than the probability of producing the exact original.
Make it stand out
Fig. 2. For Llama 1 13B and the same training sequence as in Fig. 1 from The Great Gatsby, we show the training prefix, three continuations generated using top-k (temperature = 1, k = 40), and the probability of generating each continuation given the prefix under top-k. We visually diff the characters in each continuation with the target suffix’s characters. Blue shows text in the generation that isn’t present in the target suffix (additions), and red shows text missing from the generation that’s present in the target suffix (deletions). Quantitatively, we compute the edit distance, comparing each tokenized continuation to the tokenized target suffix. We highlight in yellow the case of verbatim extraction of the target suffix, which is not the greedy continuation that the traditional extraction method returns (the top row). All three continuations are near-verbatim matches to the target suffix. Figure reproduced from Cooper et al. (2026).
In Fig. 2, we show what this looks like for Llama 1 13B and the same quote from The Great Gatsby that’s in Fig. 1. Traditional greedy-decoded extraction would return the continuation in the top row, which would fail a verbatim check (but pass even a very stringent near-verbatim one). Verbatim extraction risk would capture the middle row: this is the verbatim suffix (edit distance = 0). A probability of 0.1431 means that, when prompted with the prefix, Llama 1 13B outputs the suffix verbatim about 1 out of every 7 times (under top-k decoding; 1/0.1431 times ∼1/7). But all three variations are clearly almost identical to the verbatim target suffix. Outputting any one of them would indicate extraction risk. So the relevant risk – the near-verbatim risk – is actually the sum of the probabilities over all of the near-verbatim continuations. In this case, the near-verbatim risk associated with just these three continuations is 0.1477 + 0.1431 + 0.0671 = 0.3579, which is 2.5x the verbatim risk!
Why is near-verbatim extraction risk hard to estimate?
One can compute verbatim extraction risk exactly for a cost comparable to doing greedy decoding.¹ But for near-verbatim, it’s not so simple. To compute it exactly, you need the total probability mass on all continuations within some maximum edit distance of the target suffix. And that set is enormous. For a 50-token sequence, allowing for just two token edits means that there are over a trillion near-verbatim suffixes to evaluate.²
An alternative is to actually sample from the model with the chosen decoding scheme: prompt the model a bunch of times and count how often the output lands close enough to the target suffix. This works, and is cheaper than brute-forcing every possible near-verbatim sequence, but is still wildly expensive. Detecting a sequence with a 1% near-verbatim extraction risk (which is really high for a language model) requires roughly 300 samples; reliably estimating it requires around 10,000.³ So at scale this is still impractical.
Our deterministic and tractable estimation approach
We introduce decoding-constrained beam search, which exploits a key property of memorized sequences: they are high-probability under the model. Beam search is a decoding algorithm that deterministically explores a high-probability region of the output space, so we intuited that, for memorized sequences, it should surface near-verbatim continuations.
We make some modifications to beam search to operate under a chosen decoding scheme (e.g., top-k with k = 40), so each candidate continuation comes with its exact probability under that scheme. We then filter the outputs of the search for continuations within a chosen edit distance of the training-data suffix, and sum their probabilities to produce a deterministic lower bound on near-verbatim extraction risk. It's guaranteed to be correct, requires no repeated sampling, and costs roughly the same as 20 samples (instead of thousands).⁴
What we find
We ran experiments across three different open-weight model families (OLMo 2, Llama 2, Pythia), covering different model sizes and different types of text data from their respective training datasets (Wikipedia, public domain books, Enron emails, respectively). We also ran a series of negative controls: experiments on non-training, where we expect to see no extraction (which would support the validity of our extraction procedure).
Verbatim methods substantially undercount extraction (and memorization). For one example (Fig. 3), our experiments on OLMo 2 32B and 10,000 training sequences from Wikipedia find that 2.57% of sequences are near-verbatim extractable with our method, compared to 1.42% for verbatim probabilistic extraction and 0.61% for the standard greedy method. The sequences that verbatim methods miss have very high risk – on average, over 0.08 (a high number for an LLM).
Fig. 3. Evaluating extraction for the OLMo 2 model family, which has 7B, 13B and 32B model sizes. Each plot shows extraction rates for a different model size, for verbatim (maximum distance ε = 0) and near-verbatim extraction (k-CBS) for edit distance (Levenshtein) with maximum distances ε ∈ {1, 2, 3, 4, 5}. For greedy near-verbatim extraction, one generates the single greedy continuation and checks if the distance with the target suffix doesn’t exceed ε. We use a sample of 10,000 sequences from Wikipedia from OLMo 2’s training data. To assess validity, we also run analogous negative controls on 5,000 held-out sequences scraped from Wikipedia that post-date OLMo 2’s training cutoff. The greedy extraction rates are all exact. The probabilistic extraction rates use decoding-constrained beam search, and so these rates should be interpreted as lower bounds on the true extraction rates. Figure reproduced from Cooper et al. (2026).
Near-verbatim extraction risk is much larger for individual sequences. The example sequence from The Great Gatsby (Fig. 2) already suggests this: near-verbatim extraction risk is often significantly larger than the verbatim risk. For that sequence, decoding-constrained beam search returns a lower bound of 0.7155 – over 4x the verbatim risk of 0.1431! The model outputs a near-verbatim copy of the suffix over 7 out of every 10 times it’s prompted with the prefix. In many cases, sequences with no verbatim risk have enormous near-verbatim risk – in extreme cases, going from 0 to over 0.85. This degree and variation of extraction risk is entirely invisible to verbatim and greedy methods.
The undercount grows with model size. As expected, we can extract more sequences from larger models. But, similar to the original work on verbatim probabilistic extraction, we also find that as model size increases, greedy extraction and verbatim extraction become worse undercounts of total extraction. (You can see this in Fig. 3: the gaps between rates widen.)
Implications for Audit Access
These results are on open-weight models, but the implications extend well beyond them.
For open-weight models, researchers automatically have full access to output probabilities. This is what makes decoding-constrained beam search possible: we can score every candidate continuation exactly. Furthermore, open-weight model releases often include base models.
Extending our technique to closed models would require access to log-probabilities or logits – not the full weights, but at least the output distributions for base models, which our method needs for producing and scoring continuations. Even limited access, such as returning log-probabilities for a set of provided continuations and chosen decoding scheme, could enable far more informative extraction audits than black box sampling.
Since this access is rare today, our research implies that frontier AI auditors would require privileged access to be specifically provided by the frontier AI company in order to conduct an audit where memorization was an important consideration, of which we believe there are likely to be several (e.g., measuring overfitting on evaluation tasks, estimating near-verbatim knowledge of certain facts with significant misuse risk like scientific facts from virology, etc.). More generally, over time, the limitations of relying on publicly available information and interfaces are growing, not shrinking, as we argued in our launch paper.
These findings have immediate relevance for policy implementation. For example, the EU AI Act’s Code of Practice requires “adequate access” to be provided to third-party evaluators. And while there are not yet frontier AI auditing requirements in the US, proposals for such requirements would have to either specify access requirements, or use general terms like adequate access that will then need to be interpreted based on the available scientific evidence. Notably, our findings suggest the necessity of base model access for many purposes, but not the sufficiency – frontier models are increasingly post-trained with significant amounts of data and computing power, which can add new knowledge and behaviors. Understanding what is learned at this stage would require different techniques than those discussed here.
This preprint is not the only line of evidence supporting the need for deep access in order to conduct effective audits. Indeed, much of the evidence for this comes from companies themselves, such as when they describe safety interventions that are only possible with access to model internals and training data as well as model chains-of-thought. Deep access for auditors need not mean unrestricted access: in many cases, the right standard is secure, claim-scoped access that gives auditors only the privileged visibility needed to analyze a specific safety or security claim, using time-bound, rate-limited model interfaces and data query tools.
We are optimistic that deep, secure access is ultimately a solvable problem. Frontier AI auditing is challenging primarily because the technology itself is being developed and deployed rapidly, not because there is no precedent for such deep access in other contexts, or because there aren’t promising techniques for providing access in a way that protects sensitive intellectual property (see our launch paper for related discussion). But if we are to achieve effective and universal frontier AI auditing in a timely fashion, frontier AI companies will need to provide deep access to systems and information about company practices. We hope to share more in the coming months about the practical aspects of providing this access under mutually agreeable terms.
Notes
¹ You can run the whole verbatim sequence through the LLM, and use the logits to compute the exact probability of the target suffix given the prefix, according to a chosen decoding scheme.
² It’s intractable to run them all through the verbatim procedure, and then tally up their probabilities. But it also shouldn’t be necessary to do that. Many near-verbatim suffixes will have 0 or extremely low probability. Consider replacing the token for “ the” with the token for “ jazz” or “ github” near the end of the target suffix from The Great Gatsby.
³ For lower risk sequences, these numbers are even higher.
⁴ We also develop variants that integrate the distance check directly into the search, often yielding tighter bounds at lower cost. The preprint covers in detail why the different variants of decoding-constrained beam search produce a valid lower bound on extraction risk for a given sequence, why those bounds are also useful in practice, and why our algorithm is cheaper than sampling.
Frontier AI Auditing: Toward Rigorous Third-Party Assessment of Safety and Security Practices at Leading AI Companies
A comprehensive framework for independent evaluation of frontier AI systems, mapping access requirements to systemic risks.
Miles Brundage¹*, Noemi Dreksler², Aidan Homewood², Sean McGregor¹, Patricia Paskov³, Conrad Stosz⁴, Girish Sastry⁵, A. Feder Cooper¹, George Balston¹, Steven Adler⁶, Stephen Casper⁷, Markus Anderljung², Grace Werner¹, Sören Mindermann⁵, Vasilios Mavroudis⁸, Ben Bucknall⁹, Charlotte Stix¹⁰, Jonas Freund², Lorenzo Pacchiardi¹¹, José Hernández-Orallo¹¹, Matteo Pistillo¹⁰, Michael Chen¹², Chris Painter¹², Dean W. Ball¹³, Cullen O’Keefe¹⁴, Gabriel Weil¹⁵, Ben Harack³, Graeme Finley⁵, Ryan Hassan¹⁶, Scott Emmons⁵, Charles Foster¹², Anka Reuel¹⁷, Bri Treece¹⁸, Yoshua Bengio¹⁹, Daniel Reti²⁰, Rishi Bommasani¹⁷, Cristian Trout²¹, Ali Shahin Shamsabadi²², Rajiv Dattani²¹, Adrian Weller¹¹, Robert Trager³, Jaime Sevilla²³, Lauren Wagner²⁴, Lisa Soder²⁵, Ketan Ramakrishnan²⁶, Henry Papadatos²⁷, Malcolm Murray²⁷, Ryan Tovcimak²⁸
¹AVERI ²GovAI ³Oxford Martin AI Governance Initiative ⁴Transluce ⁵Independent ⁶Clear-Eyed AI ⁷MIT CSAIL ⁸Alan Turing Institute ⁹University of Oxford ¹⁰Apollo Research ¹¹University of Cambridge ¹²METR ¹³Foundation for American Innovation ¹⁴Institute for Law and AI ¹⁵Touro University Law Center ¹⁶New Science ¹⁷Stanford University ¹⁸Fathom
¹⁹Mila, Université de Montréal ²⁰Exona Lab ²¹AI Underwriting Company ²²Brave Software ²³Epoch AI ²⁴Abundance Institute ²⁵interface ²⁶Yale University ²⁷SaferAI ²⁸UL Solutions
January 2026
Listed authors contributed significant writing, research, and/or review for one or more sections. The sections cover a wide range of empirical and normative topics, so with the exception of the corresponding author (Miles Brundage, miles.brundage@averi.org), inclusion as an author does not entail endorsement of all claims in the paper, nor does authorship imply an endorsement on the part of any individual’s organization.
Executive Summary
Key paper takeaways
Despite their rapidly growing importance, AI systems are subject to less rigorous third-party scrutiny than many of the other social and technological systems that we rely on daily such as consumer products, corporate financial statements, and food supply chains. This gap is becoming increasingly untenable as AI becomes more capable and widely deployed, and it inhibits confident deployment of AI in high-stakes contexts.
Transparency alone cannot enable well-calibrated trust in the most capable (“frontier”) AI systems and the companies that build them: many safety- and security-relevant details are legitimately confidential and require expert interpretation, and third parties are right to be skeptical of companies "checking their own homework" given the track record of that approach in other industries.
We outline a vision for frontier AI auditing, which we define as rigorous third-party verification of frontier AI developers’ safety and security claims, and evaluation of their systems and practices against relevant standards, based on deep, secure access to non-public information.
Frontier AI audits should not be limited to a company’s publicly deployed products, but should instead consider the full range of organization-level safety and security risks, including internal deployment of AI systems, information security practices, and safety decision-making processes.
We describe four AI Assurance Levels (AALs), the higher levels of which provide greater confidence in audit findings. We recommend AAL-1 as a baseline for frontier AI generally, and AAL-2 as a near-term goal for the most advanced subset of frontier AI developers.
Achieving the vision we outline will require (1) ensuring high quality standards for frontier AI auditing, so it does not devolve into a checkbox exercise or lag behind changes in the industry; (2) growing the ecosystem of audit providers at a rapid pace without compromising quality; (3) accelerating adoption of frontier AI auditing by clarifying and strengthening incentives; and (4) achieving technical readiness for high AI Assurance Levels so they can be applied when needed.
Frontier AI auditing motivations
Artificial intelligence (AI) is rapidly becoming critical societal infrastructure. Every day, AI systems inform decisions that affect billions of people. Increasingly, they also make consequential decisions autonomously. Although these technologies hold incredible promise, the pace of development and deployment has outpaced the creation of institutions that ensure AI works safely and as advertised.
This institutional gap is especially important for the most capable (“frontier”) systems — general-purpose AI models and systems whose performance is no more than a year behind the state-of-the-art — which many experts expect to exceed human performance across most tasks within the coming years. Already, developers of frontier AI systems need to prevent harmful system failures (e.g., outputting false medical information or buggy code), weaponization by malicious parties (e.g., to carry out cyberattacks), and theft of or tampering with sensitive data. The magnitude of risks that need to be managed is growing rapidly.
AI users, policymakers, investors, and insurers need reliable ways to verify that promised technical safeguards exist and to detect when they do not. This is challenging because the technology is complex, fast-moving, and often proprietary. Public transparency alone cannot solve this problem since many key details are — and often should remain — confidential, and require expert judgment to interpret. Many industries outside of AI already address similar challenges through independent auditors who review sensitive, non-public information and publish trustworthy conclusions that outsiders can rely on. We argue that similar practices are needed in the AI industry: broad, sustainable adoption of AI over time requires a solid foundation of trust built on credible scrutiny by independent experts.
Toward this end, we propose institutions designed to give stakeholders — including those who are uncertain about or even strongly skeptical of frontier AI companies — justified confidence that this critical technology is being developed safely and securely. Specifically, we describe and advocate for frontier AI auditing: rigorous third-party verification of frontier AI developers’ safety and security claims, and evaluation of their systems and practices against relevant standards, based on deep, secure access to non-public information.
An ecosystem of private sector frontier AI auditors (both for-profit and non-profit) would enable widespread confidence that frontier AI systems can be adopted broadly and would avoid reliance on companies “grading their own homework,” an approach with a checkered track record in many industries. It would also avoid relying entirely on governments to have the technical expertise, capacity, and agility to ensure high standards for frontier AI safety and security. If well-executed and scaled, frontier AI auditing would improve safety and security outcomes for users of AI systems and other affected parties, create a system to learn and update standards based on real-world outcomes, and enable more confident investment in and deployment of frontier AI, especially in high-stakes sectors of the economy.
Summary of the proposal
Drawing on our analysis of current practices in AI and lessons from other industries with more mature assurance regimes, we recommend eight interlinked design principles for a long-term vision for frontier AI auditing. This vision is deliberately ambitious to match the rising stakes as frontier AI capabilities advance:
Scope of risks: Comprehensive coverage of four key risk categories. Frontier AI auditing should focus on four risk categories: risks from (1) intentional misuse of frontier AI systems (e.g., for cyberattacks); (2) unintended frontier AI system behavior (e.g., errors harming the user, their property, or third parties due to pursuing the wrong goal or having an unreliable performance profile); (3) information security (e.g., theft of an AI model or user data); and (4) emergent social phenomena (e.g., addiction to AI or facilitation of self-harm). For each category of risks, auditors should (a) verify company claims and (b) evaluate the company’s systems and practices against its stated safety and security policies, applicable regulations, and industry best practices.
Organizational perspective: Auditing companies’ safety and security practices as a whole, not just individual models and systems. Auditors should use an organization-level perspective to avoid abstraction errors (i.e., forming the wrong conclusion by treating a partial or simplified unit of analysis, such as evaluating a specific component in isolation, as if it were sufficient to assess overall system and organizational risk). Risk does not come from AI models alone; it emerges from the interaction of three overarching components: digital systems, computing hardware, and governance practices, and harm can arise even when a model is never deployed in external-facing systems. Rigorous, but isolated, model and system evaluations are therefore insufficient to evaluate all safety and security claims on their own. And while individual audits may focus on particular domains depending on their goals, the ecosystem as a whole should ensure comprehensive coverage across all three components in assessing safety and security claims.
Figure 1: Four AI Assurance Levels (AALs) for different frontier AI audits.
Levels of assurance: A framework for calibrating and communicating confidence in audit conclusions. Not all audits provide the same level of certainty, and stakeholders need to understand these differences. We propose AI Assurance Levels (AALs) as a means of clarifying what kind of assurance particular frontier AI audits provide (Figure 1). At lower levels, auditors and other stakeholders rely more heavily on information provided by the company and can primarily speak to a particular system’s properties. At higher levels, auditors take fewer assumptions for granted, and assess the full range of relevant company systems, organizational processes, and risks. At the highest level, auditors can rule out the possibility of materially significant deception by the auditee. Determining the appropriate AAL for different contexts and purposes is complex, but we recommend AAL-1 (the peak of current practices in AI) as a starting point for frontier AI generally, and AAL-2 as a near-term goal for the companies closest to the state-of-the-art. AAL-2 involves greater access to non-public information, less reliance on companies’ statements, and a more holistic assessment of company-level risks. The two highest assurance levels (AAL-3 and AAL-4) are not yet technically and organizationally feasible, but we outline research directions to change this.
Access: Deep enough to assure auditors and other stakeholders, secure enough to reassure auditees. Frontier AI auditors should receive deep, secure access to non-public information of various kinds — including model internals, training processes, compute allocation, governance records, and staff interviews — proportional to the audit’s scope and the level of assurance being sought for the audit. Access arrangements should protect intellectual property and security-sensitive information using mechanisms imported from other domains (e.g., sharing certain information with a subset of the auditing team on-site under a restrictive nondisclosure agreement) and newly-developed techniques (e.g., AI-powered summarization or analyses of information that is too sensitive to be directly shared).
Continuous monitoring: Living assessments, not stale PDFs. AI systems change constantly, including through adjustments to the underlying model(s), surrounding software, and shifts in user behavior. An audit conclusion that was accurate at the time of the assessment may become misleading in some respects within days or weeks. Audit findings should therefore carry explicit assumptions and validity conditions, and should be automatically deprecated when key underlying assumptions no longer hold. A mature auditing ecosystem will combine periodic deep assessments of slower-moving elements (e.g., governance, safety culture) with event-triggered reviews of major changes (e.g., new releases, serious incidents) and continuous automated monitoring of fast-changing surfaces (e.g., API behavior, configuration drift), enabling timely detection of changes that could invalidate prior conclusions.
Independent experts: Trustworthy results through rigorous independence safeguards and deep expertise. Auditors must be genuinely independent third parties, free from commercial or political influence, and have deep expertise across AI evaluation, safety, security, and governance. Safeguarding independence requires mandatory disclosure of financial relationships, standardized terms of engagement that prevent companies from shopping for favorable auditors, and cooling-off periods when moving, in both directions, between industry and audit roles. Alternative payment models that reduce auditor dependence on auditees should also be urgently explored. Where single auditing organizations lack sufficient expertise, subcontracting and consortia models can enable the necessary breadth across AI evaluation, safety, security, and governance.
Rigor: Processes that are methodologically rigorous, traceable, and adaptive. Audits should follow a standardized process while giving auditors the autonomy to flexibly determine specific methods and adjust scope as issues emerge. Auditors should be able to define evaluation metrics and criteria rather than simply validating companies’ preselected approaches. Wherever feasible, audit procedures should be automated, transparent, and reproducible to support consistent application across engagements and enable continuous monitoring as systems evolve. Auditors need to safeguard evaluation construct and ecological validity, and audit criteria should be protected against gaming. Finally, audits should incorporate procedural fairness, giving companies structured opportunities to correct factual errors while preventing undue influence on conclusions.
Clarity: Clear communication of audit results. Stakeholders must be able to understand the audit results. These should be communicated in audit reports with a standardized structure, covering the audit’s scope, level of assurance, conclusions, reasoning, and recommendations. Results should be communicated appropriately to different stakeholders: to protect sensitive information, auditors and companies can publish summarized or redacted versions for external stakeholders while sharing full, unredacted audit reports with boards, company executives, and, in some cases, regulatory bodies.
Challenges and next steps
Our long-term vision will require concrete efforts by several categories of stakeholders to both achieve and maintain. The most urgent challenges are:
Ensuring high quality standards for frontier AI auditing, so it does not devolve into a checkbox exercise or lag behind changes in the AI industry.
Growing the ecosystem of audit providers at a rapid pace without compromising quality.
Accelerating adoption of frontier AI auditing by clarifying and strengthening incentives.
Achieving technical readiness for high AI Assurance Levels so they can be applied when needed.
These challenges are substantial but not unprecedented. Companies routinely share sensitive information with financial auditors, potential acquirers, penetration testers, and consumer product testing laboratories under carefully controlled terms. We believe similar practices for AI safety and security are both achievable and urgently needed. For each of the challenges we describe, we recommend specific next steps:
Figure 2: Recommendations for next steps across four challenges in frontier AI auditing.
Keeping up with the rapid pace of AI progress and deployment requires quickly importing best practices from more mature industries and immediate investment in auditing pilots, technical research, and policy research. Moving with urgency is essential if frontier AI auditing is to reach maturation and scale alongside AI development.
SHARE ARTICLE: