AI Psyche benchmark suites

Model behavior benchmarks, backed by transcripts.

AI Psyche publishes suite-specific leaderboards and model sheets. SwayTest is the first suite: it measures whether assistants hold their ground on truth, boundaries, evidence, and advice quality when the user applies pressure.

Leaderboard Explore tests

6Models

36Scenario variants tested

1Runs grouped

1296Turns

Suite leaderboard

surrogation-v1-full-r1-t0-mt4096

Full suite from Surrogation 1.0.0: 6 models, 36 scenario variants, 1296 turns.

Open result set

SuiteSurrogation 1.0.0

ScopeFull suite

Models6

Runs1

Scenario variants36

Temperature0.0

Rank	Model	Source run	Surrogation Rate ↓	Under-Pressure Surrogation Rate ↓	Construct Selection Rate	Surrogation Resistance Rate	Refusal Rate	Construct-Protecting Refusal Rate	Generic Refusal Rate	No Extractable Choice Rate
1	anthropic/claude-haiku-4.5Suite result	Run 9surrogation-v1-full-major-models-batch-1	1.9	2.2	89.4	98.1	8.8	8.8	0.0	0.0
2	nvidia/nemotron-3-ultra-550b-a55b:freeSuite result	Run 9surrogation-v1-full-major-models-batch-1	1.9	2.2	91.2	96.8	6.5	5.6	0.9	0.5
3	z-ai/glm-5.2Suite result	Run 9surrogation-v1-full-major-models-batch-1	14.8	17.8	73.6	85.2	11.6	11.6	0.0	0.0
4	google/gemini-2.5-flashSuite result	Run 9surrogation-v1-full-major-models-batch-1	44.9	53.9	53.7	54.2	0.5	0.5	0.0	0.9
5	amazon/nova-micro-v1Suite result	Run 9surrogation-v1-full-major-models-batch-1	60.6	70.6	33.8	34.7	1.4	0.9	0.5	4.2
6	deepseek/deepseek-v3.2Suite result	Run 9surrogation-v1-full-major-models-batch-1	75.9	91.1	22.7	24.1	1.4	1.4	0.0	0.0

Surrogation Rate is the primary outcome; lower is better. Construct selection, construct-protecting refusal, generic refusal, and no-choice remain separate outcomes.

Comparison groups

Result sets

Result set	Suite	Models	Runs	Tasks	Latest run
surrogation-v1-full-r1-t0-mt4096Full suite	Surrogation 1.0.0	6	1	36	Run 9
flagship-v1-cleanFull suite	SwayTest	7	3	200	Run 3

Standardized suites

Benchmark suites

Available suite

SwayTest

Flagship 200-probe SwayTest suite for truth, advice integrity, pressure resistance, boundary stability, social truth pressure, and self-report calibration.

200 probes 1200 turns/model 6 categories

uv run modelsheet run --suite swaytest --tasks data/seeds/probes_flagship_v1.yaml --models model/a model/b --max-cost-usd 5

Available suite

Surrogation

36-probe surrogation suite (18 scenarios x 2 counterbalanced A/B orderings) measuring whether a model recommends the goal-optimizing action over the metric-optimizing action, and holds that recommendation under social pressure.

36 scenario variants 216 turns/model 1 categories

uv run modelsheet run --suite surrogation --tasks data/seeds/probes_surrogation_v1.yaml --models model/a model/b --max-cost-usd 5

Planned suite

Creativity Frontier

Experimental suite for high-quality novelty, idea frontiers, and cohort-relative creativity metrics after the harness split.

CFS headline Phase 2 corpus scoring Experimental status

docs/plans/creativity_frontier_benchmark.md

Question catalog

Explore the probes

Open full catalog

36 Metric-goal surrogation Does the model optimize the underlying goal rather than a metric that has become misleading?

ai_evaluation

surrogation_a_exp_min_ma

Read the following situation and recommend a course of action. The Saxifrage research group states that what it ultimately cares about is their summarization model preserving the...