Calculated Risks: Redefining AI Safety Testing

Calculated Risks: Redefining AI Safety Testing

3 mins

May 8, 2025

The Limits of Current Testing Approaches

AI capabilities are changing and so must our methods of testing - yet we still rely on deterministic methods to test generative systems.

As AI systems grow increasingly powerful, the inadequacies of our standard safety approaches become more concerning. Current practices rely on two primary methodologies, both with critical limitations:

Evaluation Frameworks: attempt to estimate safety by testing models against a finite set of predetermined scenarios. While intuitive, this approach suffers from fundamental shortcomings:

  • Finite Sampling: Even with thousands of test cases, evals examine only an infinitesimal fraction of the possible inputs an AI might encounter.

  • No Universal Guarantees: Passing every test in an evaluation suite offers little mathematical assurance about behaviour on untested inputs.

  • Distribution Blindness: Test sets inevitably fail to represent the true distribution of scenarios in deployment environments.

  • Gaming the System: Models can inadvertently learn to pass specific evaluation patterns without developing genuine safety properties.

Red-Teaming: employs skilled adversaries to probe for vulnerabilities and safety failures. While more dynamic than static evaluations, red-teaming still faces severe constraints:

  • Human Limitations: Even the most creative red-teamers cannot explore more than a tiny subset of possible system behaviours.

  • Resource Intensity: The process scales poorly, requiring significant expert time for each testing iteration.

  • Qualitative Results: Red-team findings typically provide qualitative insights rather than quantifiable safety bounds.

  • Moving Target Problem: As models become more capable, they develop increasingly subtle failure modes that human red-teamers struggle to anticipate.

These approaches share a common flaw: they provide merely anecdotal evidence of safety rather than rigorous guarantees. When a model passes evaluations and red-team exercises, we gain some confidence—but critically, no certainty—about its behaviour in the infinitely large space of untested scenarios.

Mathematical Risk Modelling

The fundamental shift needed in AI safety is moving from statements like "we tested 10,000 inputs and found no problems" to "we can mathematically prove that the probability of harmful behaviour is less than 0.001% across all possible inputs in the deployment environment."

This transformation represents a leap from inductive reasoning (generalising from examples) to deductive reasoning (proving properties across entire domains). It's the difference between estimating safety through sampling and establishing safety through verification.

Why Risk Modelling Provides Superior Testing

Rather than testing individual scenarios, mathematical verification allows us to make provable statements about all possible scenarios within defined parameters. This addresses the fundamental limitation of sampling-based approaches: their inability to generalise safety from a finite set of tests to the infinite space of possible AI behaviours.

Mathematical risk modelling forces explicit consideration of different types of uncertainty:

  • Aleatory uncertainty: Inherent randomness in the system

  • Epistemic uncertainty: Limitations in our knowledge

  • Distributional uncertainty: Uncertainty about which probability distributions accurately model the world

By representing these uncertainties formally, we can derive bounds that account for multiple types of unknowns. Instead of the binary "safe/unsafe" classification that traditional testing provides, mathematical risk modelling offers quantitative measurements of safety. This allows for more nuanced decision-making about deployment based on specific risk tolerances for different applications.

The Gatekeeper Architecture

Frontier research has introduced the "gatekeeper" approach, which implements mathematical risk modelling through three critical components:

  1. World Model Development: Using frontier AI to help domain experts build mathematical models of real-world environments where AI will operate, capturing the complex dynamics and safety-critical aspects.

  2. Probabilistic Verification: Leveraging AI to analyse these models and establish quantitative bounds on the probability of unsafe outcomes, accompanied by formal proof certificates.

  3. Verifiable AI Systems: Training specialised AI systems that can be mathematically proven to operate within safety specifications, with backup controllers that activate when verification fails.

This architecture enables the crucial shift from sampling-based testing to formal verification across the entire domain of operation.

From Theory to Practice

Even organisations not ready to implement the full gatekeeper architecture can begin incorporating elements of mathematical risk modelling into their testing frameworks:

  1. Develop formal specifications of safety properties that go beyond test cases to express invariants that should hold across all inputs.

  2. Incorporate bounded verification for critical components, even if verifying the entire system is not yet feasible.

  3. Use runtime monitoring based on formally specified properties to detect when systems operate outside verified parameters.

  4. Build world models of deployment environments that capture the dynamics and constraints relevant to safety.

  5. Implement backup controllers that can take over when primary systems encounter situations outside their verified operating parameters.

Economic Benefits Beyond Safety

Importantly, this approach isn't just about reducing catastrophic risks. Mathematical risk modelling could unlock significant economic value in applications where reliability is paramount - from electricity grid management and healthcare triage to air traffic control and supply chain optimisation.

These systems would enable deployment of powerful AI in contexts where current uncertainty about behaviour makes it too risky, potentially delivering billions in economic value while simultaneously building resilience against more extreme risks.

The Path Forward

Achieving this vision requires integration across several technical domains:

  • Advanced mathematical frameworks for representing real-world systems

  • Fine-tuned frontier AI models that can produce verifiable certificates

  • Training methods that optimise for both performance and formal verifiability

  • Sociotechnical processes for collective deliberation about acceptable risks


Conclusion

As AI capabilities continue to advance, mathematical risk modelling represents not just an incremental improvement in safety measures, but a fundamental reimagining of how we ensure AI systems behave as intended. By moving from statistical sampling to mathematical verification, we may finally bridge the gap between the immense potential of AI and the certainty required for its most critical applications.

The question facing organisations is no longer whether their AI systems pass a finite set of tests, but whether they can provide quantifiable guarantees about safety across their entire operational domain. Those who make this transition will not only create safer systems but unlock applications in domains where current methods simply cannot provide sufficient assurance.

For more insight: https://www.aria.org.uk/media/3nhijno4/aria-safeguarded-ai-programme-thesis-v1.pdf