A Quarterly Evaluation Sprint Program for Frontier AI Safety

Small, like-minded countries- the UK, Japan, Canada, and Singapore - could build trust and shared capacity on frontier AI safety through a modest but practical cooperative step. Rather than duplicating the UN or G7, I propose a quarterly evaluation sprint program.

These four countries already have established AI safety institutes, democratic governance structures, and an interest in coordinating without US or Chinese leadership. Each country would test frontier models that are deployed in its own jurisdiction against an agreed set of safety benchmarks.

Each sprint would focus on one specific risk domain: biosecurity misuse, cyber offensive capabilities, deceptive alignment, or autonomous replication and resource acquisition. Member countries would evaluate their domestic models using standardised protocols that are developed collectively by the group.

This design can help to reduce risks and improve coordination by addressing three coordination challenges. First, it builds shared technical language and standards. Currently when regulators discuss "dangerous capabilities," they may mean different things operationally. By conducting parallel evaluations with standardised protocols, countries would develop common metrics and risk thresholds that would enable meaningful cross-border policy dialogue. If one country proposes restrictions based on evaluation results, others can assess if similar restrictions can apply in their jurisdiction.

Second, it builds trust without requiring labs to share model weights. This is very useful because labs tend to resist international cooperation out of fear of exposing their IP and competitive disadvantage. As countries conduct evaluations on their own infrastructure, they share only three categories of information: evaluation methodologies and prompting strategies, results which show if models passed or failed safety thresholds, and lessons about evaluation design.

Third, if done well, the countries involved will be building cumulative institutional knowledge. Each sprint would be producing documented failure modes and lessons about what evaluation approaches actually work in practice. This could prevent wasteful duplication where each country independently discovers the same problems. This also builds regulatory capacity so agencies can develop in-house expertise in frontier model testing rather than being structurally dependent on labs' self-assessments. The proposal may also support responsible scaling by reducing the uncertainty that labs have about international regulatory expectations.

A rotating team of 2-3 researchers from participating AI safety institutes would be sufficient to coordinate the development of methodology, manage the publication schedule, and synthesise findings. After a year that would cover different risk domains, the coalition would be advised to jointly publish a frontier evaluation playbook and document where evaluation approaches converged, where they diverged and why, and what this reveals about the maturity and reliability of different testing paradigms. The core participants would include national AI safety agencies in each country (UK AISI, Japan AIST, Canada ISED, Singapore AISG) and domestic frontier labs, which would be participating voluntarily. Governments could create incentives through procurement preferences or liability safe harbours for companies that submit to cooperative evaluation.

This evaluation framework could eventually extend to if-then commitments where countries agree on something like 'If models demonstrate capability X in our standardised tests, then we collectively implement restriction Y.' These conditional commitments could reduce arms race dynamics by creating predictable, coordinated responses to capability thresholds.

After two sprints (6 months), participant agencies would report if protocols were feasible and that cross-country comparisons are yielding useful insights; this mid-year checkpoint would determine if this proposal should be continued. After 4 sprints, success could mean completing four evaluation sprints (one per risk domain) and publishing a joint frontier evaluation playbook. If member countries commit to a second year and at least 1 additional country joins, it would show that this project is contributing to increased trust. Success also requires that labs in participating countries report that the process was less burdensome than they thought and that protocols seem technically sound.