Methodology

AI IQ assigns each model an estimated IQ score across 6 equally weighted capability dimensions, each measured by source-backed benchmarks where coverage exists. Every benchmark uses a calibrated ladder: what score would correspond to IQ 70, 85, 100, 115, 130, 145, and 160 on that task? One source-backed benchmark is sufficient to estimate a dimension; additional coverage increases confidence. Missing benchmarks and dimensions are conservatively imputed inside the scoring pipeline, and every derived composite averages all six dimensions.

This page documents the full scoring system: how source data is captured, how the six dimensions are defined, how raw scores map to IQ, and how missing values are imputed without changing source-backed benchmark data. Every dimension value remains an estimate even at complete benchmark coverage. Emotional Reasoning (EQ) is a separate applied-capability domain and is excluded from Composite IQ.

The 6-Dimension Framework

AI IQ organizes Composite IQ evaluation into six equally weighted dimensions. Each dimension uses source-backed benchmarks that are averaged together, with missing benchmarks conservatively imputed only inside the scoring pipeline:

Frontier benchmarks are hard, low-gameability tests where a large score gain at the frontier usually represents a large capability gain.
Saturating benchmarks are easier, more gameable, or already clustered near the top. Their ladders still reach IQ 160 where appropriate, but the score expectations rise quickly near the high end so a merely high score does not overstate ability.

The composite IQ requires at least 2 of 6 scored dimensions to have source-backed data. Models with fewer scored dimensions do not receive a derived composite IQ.

Abstract Reasoning

ARC-AGI-3, ARC-AGI-2, ARC-AGI-1 (historical)

Mathematical Reasoning

FrontierMath T4, FrontierMath T1–3, ProofBench, MathArena, AIME (saturating)

Academic Reasoning

Humanity's Last Exam, GPQA Diamond, CritPt, SciCode, MMLU-Pro, MMMU-Pro

Programmatic Reasoning

LiveCodeBench, IOI, Terminal-Bench 2.1, ProgramBench (Almost Resolved metric), FrontierSWE, SWE-rebench

Computer Use

BrowseComp, OSWorld-Verified, Toolathlon, MCP Atlas, Arena.ai Agent Arena, Agents' Last Exam

Reliability

SimpleQA Verified, AA Omniscience, BullshitBench v2, IFBench, MultiChallenge, AA Long Chain Reasoning, FACTS Grounding

Formulas

Each benchmark raw score $s$ is converted to an IQ value by comparing it with that benchmark's expected score at fixed IQ levels. Let $S(q_i)$ be the expected raw score for IQ level $q_i$, where $q_i \in \{70,85,100,115,130,145,160\}$. For a score between two adjacent expected scores, AI IQ uses piecewise-linear interpolation:

$$f(s) = q_i + \frac{s - S(q_i)}{S(q_{i+1}) - S(q_i)}\,(q_{i+1} - q_i), \qquad S(q_i) \le s \le S(q_{i+1})$$

Each dimension averages the IQ values of its benchmarks. Missing benchmarks are conservatively imputed before averaging using the benchmark-level waterfall below, except in Abstract Reasoning once any source-backed ARC result exists: that dimension averages only the available source-backed, jointly calibrated ARC projections so later coverage reveals cause the smallest justified point-estimate change.

The six equally weighted scored dimensions:

$$\begin{array}{l l} \mathrm{IQ}_{\text{Abstract}} & = \operatorname{avg}_{b \in A_{\text{source}}}\! f_b(s_b),\quad A=\{\text{ARC-AGI-3},\text{ARC-AGI-2},\text{ARC-AGI-1}\} \\[6pt] \mathrm{IQ}_{\text{Math}} & = \operatorname{avg}\!\left(f(\text{FrontierMath T4}),\; f(\text{FrontierMath T1-3}),\; f(\text{ProofBench}),\; f(\text{MathArena}),\; f(\text{AIME})\right) \\[6pt] \mathrm{IQ}_{\text{Academic}} & = \operatorname{avg}\!\left(f(\text{HLE}),\; f(\text{GPQA}),\; f(\text{CritPt}),\; f(\text{SciCode}),\; f(\text{MMLU-Pro}),\; f(\text{MMMU-Pro})\right) \\[6pt] \mathrm{IQ}_{\text{Programmatic}} & = \operatorname{avg}\!\left(f(\text{LiveCodeBench}),\; f(\text{IOI}),\; f(\text{Terminal-Bench 2.1}),\; f(\text{ProgramBench}),\; f(\text{FrontierSWE}),\; f(\text{SWE-rebench})\right) \\[6pt] \mathrm{IQ}_{\text{Computer}} & = \operatorname{avg}\!\left(f(\text{BrowseComp}),\; f(\text{OSWorld-Verified}),\; f(\text{Toolathlon}),\; f(\text{MCP Atlas}),\; f(\text{Agent Arena}),\; f(\text{Agents' Last Exam})\right) \\[6pt] \mathrm{IQ}_{\text{Reliability}} & = \operatorname{avg}\!\left(f(\text{SimpleQA Verified}),\; f(\text{AA Omniscience}),\; f(\text{BullshitBench v2}),\; f(\text{IFBench}),\; f(\text{MultiChallenge}),\; f(\text{AA-LCR}),\; f(\text{FACTS Grounding})\right) \end{array}$$

$f$ is piecewise-linear interpolation through each benchmark's expected-score ladder.

The composite IQ is the mean of all six scored dimension scores. At least 2 dimensions must be source-backed or predecessor-imputed before any missing whole dimensions are filled:

$$\boxed{\;\mathrm{IQ} = \frac{1}{6}\!\left(\mathrm{IQ}_{\text{Abstract}} + \mathrm{IQ}_{\text{Math}} + \mathrm{IQ}_{\text{Academic}} + \mathrm{IQ}_{\text{Programmatic}} + \mathrm{IQ}_{\text{Computer}} + \mathrm{IQ}_{\text{Reliability}}\right), \qquad n_{\text{scored}} \ge 2\;}$$

D1: Abstract Reasoning

Fluid abstraction is the ability to solve novel problems without relying on prior knowledge. This is the closest analogue to fluid intelligence in human psychometrics — raw problem-solving ability applied to patterns never seen before.

ARC-AGI-2 Frontier

Format: Visual grid puzzles (novel patterns)

Tasks: Unique visual pattern completion

Gameability: Essentially Ungameable

Each puzzle requires identifying a novel visual transformation rule from examples and applying it to a new input. ARC reports that human testing used hundreds of general-public, non-expert participants, with average performance around 60–66%. ARC-AGI-2 is the reference curve for the jointly calibrated ARC family: 60% remains IQ 100, while the upper tail is compressed so newly revealed ARC-AGI-1/2/3 coverage produces stable Abstract IQ estimates.

IQ	70	85	100	115	130	145	160
Expected score	0	20	60	75	92	110	135

ARC-AGI-3 Hard

Format: Interactive adaptation tasks

Tasks: Novel environments with feedback

Gameability: Low, but post-training-sensitive

IQ Ceiling: 160 off-scale

ARC-AGI-3 extends the ARC family into interactive environments where an agent must adapt from feedback. Its Relative Human Action Efficiency score combines completion and action efficiency. Because strong frontier models cluster near the source floor, raw percentages are not comparable to ARC-AGI-1/2 percentages. The jointly fitted ladder uses overlapping model coverage: 0.3% maps to IQ 115, 7.8% to IQ 130, and 100% remains just below IQ 145. Negative anchors are off-scale interpolation guards; actual source scores remain bounded at zero.

IQ	70	85	100	115	130	145	160
Expected score	-10	-5	0	0.3	7.8	115	160

ARC-AGI-1 Saturating

Format: Visual grid puzzles

Gameability: Ungameable (but saturating)

Same format as ARC-AGI-2 but an older, easier, and increasingly saturated problem set. Its projection is cross-calibrated against overlapping ARC-AGI-2 coverage instead of independently treating its human-panel average as a complete psychometric conversion. The steep upper tail requires 94% for IQ 115 and 97.5% for IQ 130, preventing mature ARC-AGI-1 scores from dominating later, harder ARC evidence.

IQ	70	85	100	115	130	145	160
Expected score	0	74	87	94	97.5	115	140

D2: Mathematical Reasoning

Mathematical reasoning and quantitative problem-solving — the ability to work with mathematical structures, proofs, and analytical frameworks. The hard benchmarks test novel quantitative reasoning that cannot be memorized from training data. Human references differ by benchmark: FrontierMath uses specialist and expert-team math references, ProofBench uses formalization ability, and MathArena/AIME use competition-math populations rather than the general public.

FrontierMath T4 Frontier

Format: Novel research-level math problems

Tier: 4 (research-level)

Gameability: Very Low

Extremely difficult original math problems from Tier 4 of the FrontierMath benchmark. Problems are novel and cannot be found in training data, and Epoch describes this split as research-level mathematics outside the Tiers 1–3 human-baseline competition. Epoch's June 2026 v2 release corrected or removed many invalid or misgraded problems, so the raw score scale is materially higher than v1. The human-reference guess still treats ordinary non-specialists and most non-matched mathematicians as near zero, but now places relevant specialists in the single-to-low-double-digit range and broad expert groups as the high-end reference. The v2 ladder keeps 45% around IQ 130, about 90% around IQ 145, and off-scale headroom above 100% for IQ 160.

IQ	70	85	100	115	130	145	160
Expected score	0	2	7	20	45	90	120

FrontierMath T1–3 Frontier

Format: Novel advanced math problems

Tier: 1–3

Gameability: Very Low

FrontierMath Tier 1–3 covers difficult novel math problems below the research-level Tier 4 split. Epoch's human-baseline competition used exceptional math undergraduates and subject-matter experts, but the June 2026 v2 release corrected or removed enough problems that the old raw-percent anchor cannot be reused directly. The v2 ladder moves the expert-reference region upward, keeping 62% at IQ 130, the mid-90s around IQ 145, and off-scale headroom for stronger future performance.

IQ	70	85	100	115	130	145	160
Expected score	0	8	18	36	62	95	125

ProofBench Frontier

Format: Formal proof construction

Tasks: Verified theorem proving

Gameability: Low

ProofBench tests whether a model can construct formally verified mathematical proofs rather than only solve for a final answer. It is a different cognitive task from competition math because the Lean 4 output must satisfy a verifier, with no partial credit for plausible but invalid reasoning. The human-reference guess treats ordinary mathematicians without Lean as near zero, competent Lean users as the midrange reference, and strong formalizers as the high-end reference. The current ladder is exponential at the upper tail and keeps perfect source-range performance from automatically exhausting future headroom.

IQ	70	85	100	115	130	145	160
Expected score	0	2	6	14	30	60	120

MathArena Frontier

Format: Competition math expected performance

Scale: IRT-derived percent expected correctness

Gameability: Low

MathArena estimates each model's expected performance across non-deprecated math competitions using item-response-theory calibration. It is treated as a broad Mathematical Reasoning signal because it aggregates many competition-style questions while preserving difficulty information instead of only counting raw solve rate. The human-reference guess uses qualified contest participants through strong olympiad/Putnam-style humans, not ordinary non-specialists. The ladder keeps top frontier scores in the high-IQ range while requiring stronger midrange performance before awarding IQ 130+ credit. Source rows are mapped only when the model variant clearly matches a canonical dataset row; duplicate lower-reasoning configurations and source-only variants are skipped until the dataset adds an explicit matching model entry.

IQ	70	85	100	115	130	145	160
Expected score	0	5	15	32	55	75	95

AIME Saturating

Format: Integer answers (0–999)

Questions: 15 per exam

Gameability: High

Competition mathematics with integer answers. AIME is a 15-question, 3-hour exam for high AMC scorers. The ladder treats average AIME-qualifier-level performance as around IQ 100 and a score around 13.8/15 as IQ 130 because public human comparisons place 13.9/15 around top-500 nationally and above the USAMO cutoff. Old AIME problems are widely available in training data and frontier models are near saturation, so IQ 145 and 160 remain off-scale: a perfect score does not automatically imply either level on this benchmark.

IQ	70	85	100	115	130	145	160
Expected score	-15	5	30	65	92	112	132

D3: Academic Reasoning

Academic Reasoning measures expert knowledge, research-style analysis, and difficult problem solving across academic fields. Its scored set uses Humanity's Last Exam, GPQA Diamond, CritPt, SciCode, MMLU-Pro, and MMMU-Pro. CritPt and SciCode remain here because their bottleneck is high-level scientific reasoning, even when mathematics or code is the medium.

Finance Domain

Finance is an applied domain outside Composite IQ. It measures financial analysis, long-document interpretation, spreadsheet modeling, banking operations, multimodal mortgage-tax extraction, and tax reasoning through Finance Agent v2, CorpFin v2, the Excel Modeling Benchmark, τ³-Banking, MortgageTax, and TaxEval v2.

Finance IQ balances three capability groups rather than averaging its scored benchmarks flat: Financial Analysis; Documents and Modeling; and Banking, Tax, and Operations. One source-backed scored benchmark is sufficient to produce a provisional estimate, matching the eligibility convention used by other applied domains. Missing scored benchmarks are filled at the lower quartile of capability-matched models, capped at the model's observed Finance benchmark average, before all three group means are weighted equally. This mirrors the conservative missing-coverage principle used by IQ dimensions and prevents absent results from improving a model's score. Coverage remains a confidence signal, not an eligibility gate. τ³-Banking remains visible as a diagnostic system benchmark, but is excluded from Finance IQ because retrieval strategy, reasoning effort, and agent scaffolding currently vary too much for clean model-level comparison.

These are model-plus-system evaluations. Finance Agent and Excel Modeling depend materially on tools and agent scaffolds; CorpFin and TaxEval use LLM judging. The July 24 calibration uses conservative occupational reference points rather than fitting the current model distribution: ordinary non-specialists sit in the low anchors, careful generalists or trainees around IQ 100, capable practitioners around IQ 115–130, and near-reliable expert work around IQ 145. Because matched professional-human score distributions are not published, every Finance projection remains provisional, sparse, or high-risk.

Benchmark / IQ	70	85	100	115	130	145	160
Finance Agent v2	0	5	20	45	70	90	105
CorpFin v2	5	20	45	72	90	98	108
Excel Modeling	0	5	20	70	90	98	108
τ³-Banking (diagnostic)	0	5	12	25	50	75	95
MortgageTax	10	35	70	90	97	100	110
TaxEval v2	5	20	45	75	90	98	110

WebDev & Design Domain

WebDev & Design is an applied domain outside Composite IQ. It combines coding with product reasoning, user experience, interaction design, visual-spatial layout, and full-stack implementation.

Methodology caveat: the current set is intentionally focused on greenfield app and web-artifact quality, and its inputs are not four fully independent sub-skills. Arena.ai WebDev, DesignArena Frontend, and DesignArena Full Stack all contain overlapping product, UI, UX, prompt-following, and implementation-quality signals. Vibe Code Bench adds a more task-suite-like app-building signal, but this domain should still be read as a practical product-building cluster rather than a rigorously factor-separated measurement. Broader coverage across product architecture, user-intent inference, accessibility, maintainability, and long-horizon iteration would make the domain stronger.

Arena.ai WebDev Frontier

Format: Front-end web development arena

Scale: Bradley-Terry Elo

Gameability: Low, but prompt/category-sensitive

Arena.ai WebDev is included as an interface-heavy app-building signal because it evaluates end-to-end web development rather than pure algorithmic coding. It is not a pure design benchmark: the score mixes implementation quality, visual taste, layout, UX, and prompt following. That blend is useful because real software model choice often depends on the whole shipped artifact, not just whether the code compiles. As a best-guess human reference, 1300 Elo is treated as average-useful web artifact quality in this arena, with 1500+ representing strong production-like work.

IQ	70	85	100	115	130	145	160
Expected Elo	1100	1200	1300	1400	1500	1620	1800

DesignArena Frontend Frontier

Format: Agentic frontend web-app arena

Scale: Bradley-Terry Elo

Gameability: Low, but prompt/category-sensitive

DesignArena Frontend measures multi-file React front-end apps generated by agentic coding systems and judged through real user pairwise comparisons. It is included as a greenfield app-building signal distinct from repository repair: the score reflects whether the model can turn product prompts into usable rendered front-end experiences.

IQ	70	85	100	115	130	145	160
Expected Elo	1000	1060	1120	1180	1270	1390	1560

DesignArena Full Stack Frontier

Format: End-to-end frontend + backend build arena

Scale: Bradley-Terry Elo

Gameability: Low, but prompt/category-sensitive

DesignArena Full Stack is a human-preference arena for whole-stack builds: it scores frontend interactivity and layout alongside backend work (Supabase schema design, data seeding, API functionality, CRUD, auth, end-to-end persistence, error handling). It complements Arena.ai WebDev's front-end focus with a broader shipped-artifact signal. Its Elo scale runs lower and tighter than Arena.ai WebDev's, so the ladder is fit to its own distribution rather than reusing WebDev's absolute-Elo anchors. As a best-guess reference, 1120 Elo is average-useful full-stack artifact quality, 1280 is strong end-to-end app quality, and 1400+ is frontier-level execution.

IQ	70	85	100	115	130	145	160
Expected Elo	1000	1060	1120	1190	1280	1400	1540

Vibe Code Bench v1.1 Frontier

Format: Web apps from natural-language specs

Scale: Accuracy / task success

Gameability: Lower, but harness-dependent

Vibe Code Bench v1.1 measures whether models can build web applications from scratch from product-style specifications in an agentic environment. It rounds out the domain by complementing the human-preference DesignArena signals with a task-success benchmark.

IQ	70	85	100	115	130	145	160
Expected score	0	5	15	30	50	75	100

Legal Domain

Legal is an applied domain outside Composite IQ. Legal IQ equally weights three capabilities after projecting each benchmark onto the AI IQ scale: broad legal reasoning, multi-source legal research, and long-horizon production of professional legal work. A model needs source-backed results on at least two of the three benchmarks; missing values are not imputed.

How to read the calibration tables: the column headings are IQ anchors, and the second row gives the raw benchmark score expected at each anchor. For example, Harvey LAB raw scores of 0%, 5%, 15%, 35%, and 70% map to IQ 100, 115, 130, 145, and 160 respectively.

LegalBench Broad

Format: 162 open legal-reasoning tasks

Scale: Overall accuracy

Gameability: Elevated; public task set

LegalBench provides broad coverage across rule application, issue spotting, interpretation, and legal classification. Because the public benchmark is mature and frontier scores are tightly compressed, the ladder spreads the frontier cautiously: perfect source-range performance maps to IQ 145, while IQ 160 remains off-scale.

IQ	70	85	100	115	130	145	160
Expected raw score	45	60	70	80	90	100	115

Legal Research Bench Frontier

Format: Agentic U.S. legal research

Scale: Strict all-pass accuracy

Gameability: Low; private lawyer-authored tasks

Legal Research Bench tests whether an agent can find controlling authority, apply precedent, reconcile sources, and produce a complete cited answer across eight areas of U.S. law. AI IQ uses strict all-pass accuracy rather than weighted partial credit. The ladder keeps the observed frontier near IQ 130 and reserves IQ 160 for off-scale performance until matched practitioner baselines exist.

IQ	70	85	100	115	130	145	160
Expected raw score	0	5	15	30	55	85	120

Harvey Legal Agent Benchmark Frontier

Format: 120 long-horizon legal work-product tasks

Scale: Strict task-resolution rate

Gameability: Low, but harness-dependent

Harvey LAB evaluates professional legal work across 24 practice areas using documents, spreadsheets, presentations, and file-system tools. A task resolves only when every rubric requirement passes. Because zero resolved tasks can coexist with substantial criterion-level success, 0% maps to IQ 100 and the IQ 70/85 anchors are deliberately below the attainable raw range. This prevents one or two task outcomes from creating double-digit IQ jumps. AI IQ uses Vals.ai's common implementation and prefers no-fallback scores where published.

IQ	70	85	100	115	130	145	160
Expected raw score	-10	-5	0	5	15	35	70

Healthcare Domain

Healthcare is an applied domain outside Composite IQ. Healthcare IQ equally weights seven capabilities after projecting each benchmark onto the AI IQ scale: patient safety, clinical reasoning under uncertainty, complex diagnosis, clinical workflow execution, clinician communication, clinical documentation, and medical coding. A model needs source-backed results on at least two of the seven benchmarks; missing values are not imputed. Four benchmarks are sourced from the MAST clinical leaderboard, which is currently in preview, two from Vals.ai's private-data healthcare evaluations, and one from Medical Sphere's independent re-runs; snapshots are dated and scores move with their sources.

The projections are human-equivalent task-performance references, not psychometric IQ measurements of clinicians or models. IQ 100 represents an educated nonexpert or early trainee, IQ 115 a capable trainee or junior professional, IQ 130 an independent experienced professional, IQ 145 an exceptional specialist, and IQ 160 near-best human or expert-team performance. Exact same-protocol human results take priority; same-metric transfers and cohort comparisons come next. When no compatible human result exists, the ladder is an explicit lower-confidence estimate from task prerequisites, occupational competence, and the scoring floor and ceiling. Current model standings are inspected only after these anchors are fixed and are never used as the target.

Partial coverage changes confidence, not eligibility. The site first maps every result through its human-calibrated benchmark ladder, then fits a partially pooled two-way model across the sparse model-by-benchmark matrix. Each observation is represented as the model's latent Healthcare IQ plus a benchmark-specific AI task-profile effect. Those effects are learned only from models with the required two-result overlap and are constrained to average zero across the seven benchmarks, preserving the absolute human-anchored IQ scale. This corrects the tested-task mix without changing a benchmark's human calibration or treating leaderboard ranks as independent head-to-head wins.

The model estimate uses a weak IQ 120 prior with one-half benchmark of weight and shrinks benchmark effects with one benchmark of prior weight. These settings were selected by masking models with six or seven results down to every two- and three-benchmark subset: recovery RMSE fell from 8.48 IQ points for the available-only mean and 7.09 for the earlier fixed coverage adjustment to 2.48 and 2.05 respectively. Models with 2/7 or 3/7 coverage remain eligible and can rank first; they simply carry wider approximate 95% ranges. Tooltips disclose the raw available-benchmark mean, the net task-mix and pooling adjustment, coverage, and uncertainty range.

This estimator is currently a Healthcare-only pilot. Other applied domains and the six Composite IQ dimensions retain their existing scoring methods. Expansion requires prospective evidence that newly arriving benchmark results fall reasonably within the pilot's predicted ranges, sparse estimates converge without systematic bias as coverage grows, learned task-profile effects remain stable, and uncertainty ranges are not materially overconfident. Any later domain must tune and validate its own prior, pooling strength, and residual variance rather than inheriting Healthcare's settings.

Retrieval-augmented clinical products and medically fine-tuned model variants that appear on the source leaderboards are excluded: Healthcare IQ compares general-purpose foundation models under a common protocol.

First Do NOHARM v2 Safety

Format: Adversarial clinical safety scenarios

Scale: Weighted-F1 safety score (higher is safer)

Gameability: Lower; specialist-annotated management options

First Do NOHARM v2 probes whether recommendations remain beneficial, complete, and restrained under adversarial clinical scenarios, graded against thousands of specialist annotations of harmful versus appropriate management options. The score is a safety-performance measure — higher is better — not a harm rate. The study reports a 46.0% same-metric result for board-certified generalist physicians; that anchors experienced-professional performance at IQ 130. The remaining anchors are estimates because the physician subset is not a complete human distribution for every MAST v2 task.

IQ	70	85	100	115	130	145	160
Expected raw score	5	15	28	38	46	70	90

MAST SCT Judgment

Format: Script Concordance Test battery

Scale: Concordance with expert panels

Gameability: Medium; partial-credit panel scoring

The Script Concordance Test presents a clinical hypothesis, reveals new evidence, and scores how appropriately judgment shifts against physician expert panels. Published research compares 1,070 medical students, 193 residents, and 300 attending physicians across ten SCT datasets, establishing a human progression even though it does not provide one pooled score table directly interchangeable with MAST. The ladder combines that cohort ordering with estimated thresholds. Because panels themselves disagree, perfect concordance is not a meaningful ceiling.

IQ	70	85	100	115	130	145	160
Expected raw score	20	35	50	60	72	84	95

CPC-Bench Diagnosis

Format: NEJM clinicopathological conference cases

Scale: Task success

Gameability: Medium; published historical cases

CPC-Bench tests differential diagnosis, test selection, and management on the New England Journal of Medicine's clinicopathological conference cases — the hardest published diagnostic puzzles. A 20-physician comparison on contemporary cases establishes that frontier systems can exceed the measured physician baseline on final diagnosis, but MAST reports a broader multi-task aggregate. The numeric ladder is therefore a lower-confidence difficulty estimate informed by human subtask evidence. Training-set contamination also cannot be excluded for older public cases.

IQ	70	85	100	115	130	145	160
Expected raw score	0	10	25	45	60	80	95

PhysicianBench Workflow

Format: Long-horizon EHR consultation workflows

Scale: Pass@1 with execution-verified checkpoints

Gameability: Low, but agent/harness-dependent

PhysicianBench drops a model into a real electronic-health-record environment: retrieve the patient's records, reason about them, place orders, and document the plan, verified against execution checkpoints. No matched human completion cohort is published, so this is explicitly a low-confidence task-difficulty estimate: 0% maps to IQ 100 because an educated nonexpert still lacks the clinical and interface prerequisites, 50% represents a capable supervised trainee, 80% an experienced physician trained on the interface, and 95% near-reliable expert execution. The negative lower anchors are interpolation guards, not attainable scores.

IQ	70	85	100	115	130	145	160
Expected raw score	-10	-5	0	50	80	95	105

HealthBench Professional Communication

Format: 525 physician-authored clinician chat tasks

Scale: Rubric composite with safety and length penalties

Gameability: Medium; AI-judged rubric with verbosity penalty

HealthBench Professional evaluates real clinician chat tasks — care consults, writing and documentation, and medical research — against rubrics written and adjudicated by three or more physicians, with penalties for unsafe behaviors and excessive length. Specialty-matched physicians with web access and unlimited time scored 43.7 under the benchmark's model-equivalent treatment, anchoring experienced-professional performance at IQ 130. AI IQ imports Medical Sphere's independent model re-runs, so this human baseline remains a same-benchmark pipeline transfer rather than a fresh human re-score. Grading uses an AI judge and remains style-sensitive.

IQ	70	85	100	115	130	145	160
Expected raw score	0	5	12	25	43.7	60	80

MedScribe Documentation

Format: SOAP notes from real doctor-patient conversations

Scale: Expert-rubric accuracy

Gameability: Some verbosity sensitivity reported by the source

MedScribe measures faithful clinical documentation: generating structured SOAP notes from de-identified doctor-patient conversation transcripts, graded against expert-developed rubrics on private test data. No trained-scribe score distribution is public, so the human equivalents are explicit occupational estimates: 60% for an educated nonexpert, 80% for a capable trainee, 95% for an experienced medical scribe, and 100% for exceptional near-perfect documentation. The top of the rubric is saturation-prone, so IQ 160 remains off-scale.

IQ	70	85	100	115	130	145	160
Expected raw score	20	40	60	80	95	100	110

MedCode Coding

Format: ICD-10-CM code assignment from clinical records

Scale: Accuracy against real billing outcomes

Gameability: Low; private patient-level holdout

MedCode requires extracting documented evidence from clinical records and assembling a compliant billable set of primary and secondary ICD-10-CM codes, validated against successfully submitted billing outcomes approved by certified professional coders. It measures United States medical coding — a core healthcare-operations task — alongside clinical reading. No certified-coder score distribution is public, so the ladder estimates 20% for an early trainee, 45% for a capable junior coder, 70% for an experienced certified coder, and 90% for a high-performing specialist. These anchors are lower-confidence and should be replaced by matched human runs.

IQ	70	85	100	115	130	145	160
Expected raw score	0	5	20	45	70	90	100

D4: Programmatic Reasoning & the Software Engineering Domain

Programmatic Reasoning measures transferable problem solving through code: algorithmic reasoning, decomposition, execution, debugging, and task completion. It uses LiveCodeBench, IOI, Terminal-Bench 2.1, ProgramBench, FrontierSWE, and SWE-rebench. ProgramBench is scored using its Almost Resolved metric: the share of tasks passing at least 95% of hidden tests.

The broader Software Engineering domain also includes SWE-bench Verified, APEX-SWE, DeepSWE v1.1, and SWE Marathon. Those benchmarks are valuable measures of production repository work but are not separately counted in Composite IQ; ProgramBench, FrontierSWE, and SWE-rebench intentionally overlap the dimension and domain.

SWE Marathon Frontier

Format: 20 multi-hour software-engineering tasks

Scale: Resolution Rate (Pass@1)

Gameability: Lower, but agent/harness-dependent

SWE Marathon measures long-horizon software work across library reproductions, product clones, ML engineering, and optimization tasks. It is included in the Software Engineering domain because the tasks require sustained planning, implementation, debugging, and verifier-driven iteration over multi-hour runs. The public leaderboard reports model-plus-agent configurations rather than pure model-only rows; duplicate source rows show that agent choice can materially change performance. AI IQ therefore treats this as sparse best-public-agent-stack evidence and preserves the selected agent in extraction notes. Ordinary programmers are assumed near zero on autonomous multi-hour completion; strong senior engineers or small teams plausibly define the 13-25 region.

IQ	70	85	100	115	130	145	160
Expected score	0	1	3	7	13	25	50

FrontierCode Diamond Frontier

Format: 50 hardest production-code tasks

Scale: Rubric-weighted score percentage

Gameability: Low, private tasks

FrontierCode Diamond measures whether model-generated changes would meet maintainer standards for mergeable production code. It combines correctness with code quality, test quality, scope discipline, style, and repository-specific expectations. AI IQ uses the Diamond score metric, not pass rate, because the source treats score as the quality-sensitive aggregate and reports best-effort model scores at each model's best reasoning level. The benchmark is less mixed-agent than SWE Marathon, but it remains sparse, private-task, and rubric-sensitive. Public charts include every published Diamond score that maps to a canonical public model row; source-only systems without a model entry are archived in extraction notes rather than approximated. Ordinary developers are assumed near zero on maintainer-grade Diamond tasks; expert maintainers or small review teams plausibly define the 7-14 region.

IQ	70	85	100	115	130	145	160
Expected score	0	0.5	1.5	3.5	7	14	32

APEX-SWE Frontier

Format: 200 professional software-engineering cases

Domains: Integration and Observability

Scale: Overall Pass@1 percentage

APEX-SWE measures whether frontier systems can complete economically valuable software-engineering work. AI IQ uses the overall Pass@1 score in the Software Engineering domain. The ladder is conservative around current frontier scores: ordinary autonomous completion is expected to be low, while strong engineers or focused teams plausibly define the 32-52 region.

IQ	70	85	100	115	130	145	160
Expected score	0	3	8	18	32	52	80

SWE-rebench Frontier

Format: Continuously evolving software-engineering tasks

Scale: Resolved-rate percentage

Gameability: Low, decontaminated

SWE-rebench measures software-engineering agents on a continuously evolving and decontaminated task set. It is shared by Programmatic Reasoning and the Software Engineering domain because it tests transferable debugging and execution inside real repositories. Its ladder is intentionally conservative at the upper tail so future benchmark saturation does not overstate the composite.

IQ	70	85	100	115	130	145	160
Expected score	0	5	15	30	48	70	120

DeepSWE v1.1 Frontier

Format: Original long-horizon software-engineering tasks

Tasks: 113 tasks across 91 repositories

Scale: Pass@1 percentage

DeepSWE v1.1 measures coding agents on original software-engineering tasks written from scratch across a broad set of active repositories. It is treated as a hard Software Engineering domain signal because the tasks are contamination-resistant, long-horizon, and verified with behavior-focused tests rather than copied public patches. Competent software engineers are estimated around 5-12 on broad unfamiliar-repo autonomous tasks, while strong senior engineers or focused teams plausibly define 25-50.

IQ	70	85	100	115	130	145	160
Expected score	0	2	5	12	25	50	100

SWE-Bench Pro Frontier

Format: Hard software-engineering tasks

Scale: 0–1 score

Gameability: Lower than SWE-Bench Verified

A harder software-engineering benchmark used to complement SWE-Bench Verified in the Software Engineering domain. Tasks can require hours to days for professional engineers, so competent professionals are expected to resolve a meaningful minority and strong senior engineers or teams plausibly define the 0.55-0.75 region.

IQ	70	85	100	115	130	145	160
Expected score	0	0.1	0.25	0.4	0.55	0.75	1.1

SWE-bench Verified Saturating

Format: Real GitHub issue resolution

Tasks: 500 verified issues

Gameability: Very High

Models generate patches to resolve real GitHub issues and pass unit tests. However, many issues predate model training cutoffs and some have solution leakage. The current ladder remains heavily ceiling-compressed: source-range scores can still distinguish models, but IQ 145+ requires off-scale performance. Competent engineers with the right repository context could solve many tasks, so 100% source performance maps only to IQ 130 rather than the top of the scale.

IQ	70	85	100	115	130	145	160
Expected score	-10	10	35	65	100	145	200

LiveCodeBench Saturating

Format: Continuously refreshed coding problems

Scale: Percent solved

Gameability: Low, but overlapping

LiveCodeBench adds a broad, continuously refreshed coding signal to Programmatic Reasoning. Because it overlaps with other programming benchmarks and frontier models are clustered high, its ladder gives strong credit through the middle of the range while reserving IQ 145+ for scores near 90% and above. Competent competitive programmers plausibly sit around 35-60, while strong contest programmers define the 82+ region.

IQ	70	85	100	115	130	145	160
Expected score	0	15	35	60	82	110	145

Academic Reasoning Benchmark Details

Academic Reasoning captures expert knowledge, research-style reasoning, and difficult problem analysis under uncertainty. The frontier benchmarks test whether a model can reason through questions that push the boundaries of human expertise itself, while the broader benchmarks add cross-field academic and graduate-level knowledge signals.

SciCode Saturating

Format: Scientific research problems implemented in code

Gameability: Moderate

SciCode uses code as the execution medium for realistic scientific research problems. The tasks require identifying scientific concepts, recalling domain facts, reasoning through numerical methods or simulations, and transforming that reasoning into computation. It is included in Academic Reasoning because the bottleneck is scientific research reasoning, not generic programming. The human-reference guess is based on qualified scientific-coding ability: ordinary non-coders are near zero, competent scientific Python users can plausibly solve a meaningful fraction of subproblems, and strong computational scientists should sit much higher.

IQ	70	85	100	115	130	145	160
Expected score	0	15	30	45	65	100	150

Humanity's Last Exam Frontier

Format: 76% exact-match, expert-contributed

Questions: 3,000 (expert-sourced, screened against models)

Gameability: Low

Questions contributed by domain experts and explicitly screened to ensure no existing model can answer them at creation time. The benchmark spans the full frontier of human expertise. The human-reference guess treats ordinary non-specialists across the full mixed-domain set as near the low single digits, while relevant domain experts or an expert panel would be much higher. The current ladder makes small scores meaningful, puts 45% near IQ 145, and maps perfect in-range performance to IQ 160 as a universal-expert-level result.

IQ	70	85	100	115	130	145	160
Expected score	0	2	5	10	20	45	100

CritPt Frontier

Format: Critical-point analysis (novel problems)

Scale: Raw source points

Gameability: Low

Novel critical-point analysis problems that require expert analytical reasoning. Although the tasks use mathematical methods, CritPt is grouped with Academic Reasoning because it functions as a hard expert-analysis signal rather than a broad math-competition signal. Problems are original and contributed by a broad physics-research collaboration, making memorization ineffective. The human-reference guess treats ordinary non-physicists and most non-specialists as near zero, with small positive scores becoming meaningful for relevant physics researchers or research groups. The current ladder keeps 0% compatible with IQ 100.

IQ	70	85	100	115	130	145	160
Expected score	-1	-0.5	0	1	3	10	30

GPQA Diamond Saturating

Format: 4-choice multiple choice

Questions: 198 (public set)

Gameability: Moderate-High

Graduate-level science questions written by PhD experts. A 25% score is the four-choice random baseline; the GPQA paper reports skilled non-experts with unrestricted web access at 34%, domain experts at 65%, and corrected expert performance at 74%. The current ladder maps those anchors to roughly IQ 85, 100, 124, and 130 respectively. Public frontier rows cluster much higher, so the high end remains compressed and IQ 145+ is off-scale.

IQ	70	85	100	115	130	145	160
Expected score	15	25	34	50	74	100	140

Programmatic benchmark detail: Terminal-Bench 2.1

Terminal-Bench 2.1 is part of Programmatic Reasoning because its tasks are typically coding-heavy and require deep reasoning about code and complex scripts. Terminal-Bench Hard is retained only as a historical raw benchmark; it is not scored because it is a hard subset of the current suite.

Terminal-Bench 2.1 Frontier

Format: Docker container tasks (shell commands)

Tasks: 89 practical tasks

Gameability: Low

Models execute shell commands in isolated Docker containers to complete practical system administration and development tasks. The interactive, execution-based format makes memorization ineffective. The source tasks include human-written reference solutions and expert/junior-engineer time estimates, so the ladder treats ordinary non-terminal users as near zero, competent technical operators as around 25-42, and strong senior technical operators as plausibly in the 62-90 region.

IQ	70	85	100	115	130	145	160
Expected score	0	10	25	42	62	90	130

Terminal-Bench Hard Frontier

Format: Harder terminal-agent tasks

Gameability: Low

Terminal-Bench Hard is retained as a historical raw benchmark and is not part of Composite IQ because it is a hard subset of Terminal-Bench 2.1. Its values remain separate and are never backfilled into the Terminal-Bench 2.1 field.

IQ	70	85	100	115	130	145	160
Expected score	0	8	20	35	55	82	120

D5: Computer Use

Computer Use measures practical task execution across browsers, desktops, tools, and multi-step workflows. Its scored benchmarks are BrowseComp, OSWorld-Verified, Toolathlon, MCP Atlas, Arena.ai Agent Arena, and Agents' Last Exam.

BrowseComp Frontier

BrowseComp measures hard browsing and research task completion. OpenAI reports that human trainers solved 29.2% of attempted questions without AI assistance, with 86.4% agreement among solved questions. The ladder therefore maps about 29% to IQ 100, treats 50-75% as strong persistent browsing/research ability, and keeps IQ 145/160 off-scale because the benchmark is easier to verify than to solve and can improve with test-time compute.

IQ	70	85	100	115	130	145	160
Expected score	0	10	29	50	75	100	130

OSWorld-Verified Frontier

OSWorld-Verified measures desktop/computer-use task completion, capturing a model's ability to operate an external environment rather than only answer a static prompt. OSWorld reports that humans can accomplish over 72.36% of its 369 tasks, and OSWorld-Human records human trajectories for all tasks. AI IQ treats that as a computer-literate operator reference rather than a random adult average, so about 72% maps to IQ 115. Current scored rows use clear model-level mirror/self-reported mappings and are checked against official OSWorld-Verified semantics when possible.

IQ	70	85	100	115	130	145	160
Expected score	0	15	35	72	90	110	140

Toolathlon Frontier

Toolathlon measures long-horizon tool-use task performance. The source defines 108 manually sourced or crafted tasks across many applications, requiring around 20 tool-call turns on average. Scores are stored as percentages and mapped through a sparse agentic tool-use ladder; the human-reference estimate treats competent tool operators as around 25-38 and strong multi-app operators as around 50-70. Current leaderboard rows remain mirror/self-reported unless primary run details are available.

IQ	70	85	100	115	130	145	160
Expected score	0	12	25	38	50	70	110

MCP Atlas Frontier

Format: Real MCP-server tasks

Scale: Pass-rate percentage

Gameability: Low

MCP Atlas measures end-to-end task success with real Model Context Protocol servers, noisy tool menus, multi-step tool calls, and final-answer judging. The source contains 1,000 human-authored tasks across 36 real MCP servers and 220 tools, with cross-server workflows and claim-level scoring. It contributes to Computer Use as a direct tool-orchestration signal; the human-reference estimate treats competent tool/API operators as around 35-55 and strong workflow operators as around 72-92.

IQ	70	85	100	115	130	145	160
Expected score	0	15	35	55	72	92	120

Agents' Last Exam Frontier

Format: Agent/harness professional workflows

Scale: Full / Overall pass-rate percentage

Gameability: Low

Agents' Last Exam measures agentic success on real-world professional workflows with verifiable success criteria. It spans broad professional workflow categories, so a single ordinary user is expected near zero across the full distribution while strong professional teams plausibly define the 13-24 region. Because rows are published as agent/harness plus model combinations, AI IQ uses the best substantial-coverage Full / Overall pass-rate row for each canonical model and leaves source-only or incomplete one-run rows out of the imported benchmark field.

IQ	70	85	100	115	130	145	160
Expected score	0	1	3	7	13	24	50

D6: Reliability

Reliability captures factuality, instruction following, long-chain reasoning, metacognition and calibration, false-premise resistance, and source-grounded document work. Its scored set is SimpleQA Verified, AA Omniscience, BullshitBench v2, IFBench, MultiChallenge, AA Long Chain Reasoning, and FACTS Grounding.

This unified dimension is a deliberate interim choice. As benchmark coverage improves, Reliability may split into two narrower dimensions—Metacognition and Executive Control. Working Memory and Visual Reasoning are also candidates for future dimensions, but are not included yet.

IFBench Frontier

IFBench measures instruction following and constraint adherence across 58 novel, diverse, verifiable out-of-domain constraints. It contributes to Reliability because dependable model behavior depends on following multi-part requirements, respecting constraints, and preserving user intent across the response. The human-reference estimate treats average careful humans as able to satisfy a majority of explicit constraints, strong detail-oriented humans as approaching the 90s, and current frontier model rows in the high 70s/low 80s as above average but below top human-level constraint reliability.

IQ	70	85	100	115	130	145	160
Expected score	0	35	65	80	92	103	120

AA Omniscience Frontier

AA Omniscience is used as a factual-reliability signal from the Artificial Analysis Intelligence Index. The source index ranges from -100 to 100, rewards correct answers, penalizes hallucinations, and does not penalize refusing to answer. The source defines 0 as equal correct and incorrect answers; AI IQ treats that as a no-net-hallucination/calibrated-abstention baseline rather than a measured median-human baseline, so the average-human reference is modestly positive.

IQ	70	85	100	115	130	145	160
Expected score	-80	-35	8	25	45	72	100

BullshitBench v2 Frontier

BullshitBench v2 probes whether a model pushes back on false or nonsensical premises instead of going along with them. The score is the Clear Pushback rate: the share of all attempts where the model clearly challenges the bad premise (partial challenges, accepted nonsense, and refusals all count against it). It contributes to Reliability because a dependable assistant should correct a confidently wrong user rather than confabulate agreement. Because the items embed plausible domain-specific jargon, detection blends domain knowledge with the willingness to contradict a confident user, so the human reference is a domain-knowledgeable adult rather than a generic one. On that basis a sharp, skeptical practitioner clearly rejects roughly a third of the items at IQ 100 and about half by IQ 115; a perfect 100% rate maps to roughly IQ 150 — genius-tier, but attainable by a maximally skeptical reader who catches fabrications structurally. IQ 160 stays as off-scale headroom that no achievable score reaches. Current frontier model rows top out near 95%.

IQ	70	85	100	115	130	145	160
Expected score	2	13	31	52	72	92	116

AA Long Chain Reasoning Frontier

Artificial Analysis Long Chain Reasoning measures whether models can extract, connect, and reason over long-form documents ranging from 10k to 100k tokens. The tasks span document categories such as academic papers, company financials, government consultations, legal documents, industry reports, marketing materials, and surveys. It contributes to Reliability because real deployments often depend on faithfully using large source packets rather than merely recalling facts from model weights.

IQ	70	85	100	115	130	145	160
Expected score	0	20	45	62	76	92	110

FACTS Grounding Frontier

FACTS Grounding evaluates long-form answers that must both satisfy the user's request and stay fully grounded in the provided document. The imported values use the Score column from a Kaggle FACTS Grounding leaderboard capture, with public and private split scores retained only as source context. It contributes to Reliability because grounded document answering is the practical version of hallucination avoidance: the model must not invent unsupported claims while still producing a useful response.

IQ	70	85	100	115	130	145	160
Expected score	0	25	45	60	74	90	108

Emotional Reasoning (EQ) measures social, emotional, and conversational judgment as a domain, separate from Composite IQ. The available benchmark base is weaker than the scored IQ dimensions: EQ-Bench 3 uses a Claude/Anthropic-family judge, while AttuneBench is participant-grounded but still sparse.

EQ Components Specialized

EQ-Bench 3 and AttuneBench are mapped into a diagnostic Emotional Reasoning score. The score remains visible in domain charts and model profiles, but it is not included in Composite IQ.

Expected-Score Interpolation

Each benchmark defines expected raw scores at seven fixed IQ levels: 70, 85, 100, 115, 130, 145, and 160. For scores that fall between two adjacent expected scores, AI IQ uses piecewise-linear interpolation:

$$t = \frac{s - S(q_i)}{S(q_{i+1}) - S(q_i)}, \qquad \mathrm{IQ} = q_i + t \cdot (q_{i+1} - q_i)$$

If the score is at or below the lowest expected score, the model receives IQ 70. If it is at or above the highest expected score, it receives IQ 160 for that benchmark. There is no extrapolation beyond the defined range.

This approach makes the calibration question explicit. Instead of first choosing a curve family, each benchmark asks: what would an IQ 70, 85, 100, 115, 130, 145, or 160 model-equivalent score on this source? Each segment can then have a different slope, allowing threshold-like benchmarks such as CritPt or FrontierMath T4 to behave differently from smoother bounded tasks such as ARC or Terminal-Bench.

Benchmark Calibration & Averaging

Each dimension gives its benchmarks equal weight. Most dimensions conservatively fill missing benchmark slots before averaging. Abstract Reasoning instead averages the available source-backed ARC projections once any direct ARC result exists; its three ladders are jointly calibrated so coverage reveals are order-invariant and cause the smallest justified score movement. Models with zero direct ARC coverage can still use conservative lineage estimates.

Why Use Expected Scores?

The calibration is framed as a human-readable judgment rather than as an abstract curve fit. For each benchmark, the table answers seven concrete questions: what score would a 70, 85, 100, 115, 130, 145, or 160 IQ-equivalent system be expected to achieve?

That means an extremely hard benchmark can assign a high implied IQ to a low-looking score. CritPt, for example, treats 0% as compatible with IQ 100 because a normal human would not be expected to score on it. AIME does the opposite at the high end: because frontier systems can approach saturation and the public problem set is contamination-sensitive, its calibration ladder puts IQ 145 and 160 outside the source's natural 0–100% range.

Why not fit a single curve type? Linear, power, exponential, and asymptotic curves are useful shapes, but no single family captures every benchmark cleanly. The seven-point ladder keeps the public calibration inspectable while still producing a smooth piecewise-linear curve for scoring.

Calibration status. The current ladders are the first benchmark-specific calibration pass after the initial anchor-curve conversion. Some high-end expected scores remain outside a benchmark's natural source range where saturation, contamination, or benchmark age make a perfect in-range score fall short of IQ 145 or 160.

Benchmark-Level Imputation

When a model has partial benchmark coverage inside most dimensions, missing benchmarks are filled in before the dimension IQ is averaged. Abstract Reasoning is the exception: with one or more source-backed ARC results, its point estimate is the equal mean of only those direct projected IQs. Additional ARC coverage raises confidence and updates that mean; predecessor, momentum, and peer estimates do not enter the point estimate once direct ARC evidence exists.

Correlation estimates use source-backed overlap among benchmarks in the same dimension. If Benchmark A historically predicts Benchmark B, a model's score on A can estimate B, but only after the observed correlation is shrunk for small samples and an uncertainty penalty is subtracted. A one-benchmark dimension receives the largest penalty, and imputed benchmark IQs are capped below the observed benchmark average so sparse coverage cannot make a model look better than its actual evidence.

The ordinary within-dimension estimate uses two ingredients:

The model's available-benchmark IQ average — how the model is performing on the benchmarks it does have in this dimension. This is the within-dimension signal: if a model is hitting IQ 130 on the dimension's other benchmarks, the missing one is probably also somewhere around 130.
The benchmark's 80th-percentile IQ ($P_{80}$) — a per-benchmark cap derived from the actual data. Take every model that has a real score on that benchmark, convert each score to an implied IQ via the expected-score ladder, sort those IQs from low to high, and take the value at the 80th-percentile rank. So if 50 models have HLE scores yielding implied IQs ranging from 70 to 155, $P_{80}(\text{HLE})$ is the implied IQ at the 80th-percentile rank in that sorted list. It is where strong-but-not-frontier measured models actually land on this benchmark.

The imputed value is the minimum of the two:

$$\mathrm{IQ}_{\text{imputed}} = \min\!\left(\overline{\mathrm{IQ}}_{\text{available}},\; P_{80}(\text{benchmark})\right)$$

Why min of the two? The model's own dimension average is the best within-dimension signal we have. Capping at the 80th-percentile prevents a strong model from being imputed past where the actual data has been observed — a model averaging IQ 145 in this dimension might project very highly on the missing benchmark, but the imputed value won't claim that without measurement. The min lets imputation move a missing score up or down toward what the rest of the dimension implies, while staying conservatively below where the field has empirically reached.

Benchmark Imputation Waterfall

Step	When it applies	What happens	Result
1. Source value	The model has a benchmark score.	Use the raw source-backed value.	Real benchmark IQ
2. ARC-AGI-1/2 lab momentum	An eligible frontier model is missing ARC-AGI-1 or ARC-AGI-2 and has a source-backed primary-lineage ancestor.	Extrapolate the ancestor's benchmark IQ from only earlier source-backed lab transitions, shrink sparse lab history toward the top-lab median, and cap movement at ±15 IQ.	Scoring-only lab-momentum ARC estimate
3. Conservative ARC-AGI-3	An eligible frontier model is missing ARC-AGI-3.	Carry a source-backed lineage ancestor without uplift; otherwise use the capped lower quartile of the latest earlier score from each represented top lab.	Scoring-only conservative ARC-AGI-3 estimate
4. Primary / secondary predecessor	The benchmark is missing and the model has a clear primary or secondary predecessor.	Try the primary predecessor first, then the secondary predecessor, and use the first valid scoring value for that benchmark.	Predecessor-imputed benchmark IQ
5. ARC lower-tail peer estimate	ARC-AGI-1 or ARC-AGI-2 is missing after predecessor imputation, and the model has no source-backed D1/Abstract benchmark.	Use the lower quartile of capability-matched source-backed ARC peers released before the target model.	Scoring-only peer-imputed ARC benchmark IQ
6. Hard-benchmark zero	A historically near-zero hard benchmark is missing.	Use 0 as the scoring-only value, while leaving the raw benchmark field blank.	Zero-assumed benchmark IQ
7. Correlation estimate	A minimum-coverage dimension has at least one source-backed benchmark and same-dimension benchmark correlations have enough overlap. For dimensions without an explicit minimum, this step is limited to one-benchmark cases.	Predict missing benchmark IQs from correlated source-backed benchmarks, shrink small-sample correlations, subtract an uncertainty penalty, and cap below the observed benchmark average.	Conservative correlation-imputed benchmark IQ, with error metadata
8. Within-dimension estimate	The dimension has enough benchmark coverage for this model.	Average the model's available benchmark IQs in that dimension, then cap by the benchmark's 80th percentile.	$\min(\text{dimension benchmark average}, \text{benchmark }P_{80})$
9. No benchmark estimate	The model has no benchmark data in that dimension.	Do not invent individual benchmark rows.	The whole dimension is handled by dimension-level imputation.

ARC estimates are used only on scoring copies, never in raw benchmark tables. ARC-AGI-1/2 momentum uses a two-year window of source-backed transitions released before the target, keeps reasoning and non-reasoning histories separate, and gives each lab one vote in the global trend. ARC-AGI-3 carries a source-backed primary-lineage ancestor without uplift; without one, it requires three represented labs and caps its lower-quartile market fallback at 1%. Models that cannot use these rules continue through predecessor and lower-tail peer fallbacks. For ARC-AGI-2 and CritPt, missing values are treated as zero only after earlier imputation steps fail. The raw benchmark table still leaves values blank when no source row exists.

Predecessor imputation is constrained by model family, and non-reasoning variants do not impute from reasoning variants. These lineage choices are scoring assumptions only; they do not create source-backed raw benchmark values.

Correlation imputation is deliberately narrower than the generic within-dimension rule. It primarily improves one-benchmark cases; broader-coverage rows continue to use the ordinary within-dimension cap. If a dimension has no source-backed benchmark evidence, no correlation estimate is attempted — the dimension itself is either left missing or filled at the dimension level (see Composite IQ Calculation below). The Emotional Reasoning (EQ) domain is not used in Composite IQ.

Composite IQ Calculation

Step 1: Score Each Dimension

For every dimension with usable benchmark coverage, compute a dimension estimate from its source-backed and scoring-only imputed benchmark IQs. If coverage is partial, the estimate receives a coverage confidence score. Abstract Reasoning with direct ARC evidence is not confidence-shrunk: its point estimate is the equal mean of available source-backed ARC projections, while its 1/3, 2/3, or 3/3 coverage communicates uncertainty. Other partially covered dimensions continue to use the matched lower-quartile shrinkage rule.

If there is no source-backed evidence in a dimension, it is usually treated as missing and handled by dimension-level imputation; the main exception is Abstract Reasoning, where no-D1-source rows can receive conservative ARC-AGI-1/2 lower-tail peer estimates before the dimension is averaged.

Step 2: Dimension-Level Imputation

If a model has at least 2 scored dimensions but is missing some of the others, every missing dimension is imputed before the composite is averaged. The cap is matched to models with real data for that dimension and similar capability across the other dimensions:

$$\mathrm{IQ}_{D_k}^{\text{imputed}} = \min\!\left(\overline{\mathrm{IQ}}_{\text{scored dims}},\; Q_{25}\!\left(D_k \mid \text{similar non-}D_k\text{ IQ}\right)\right)$$

where $\overline{\mathrm{IQ}}_{\text{scored dims}}$ is the model's average IQ across the dimensions it does have. For the cap, we look at models released on or before the scored model that have real data on the missing dimension, compare their average IQ across the other dimensions, and use the lower-quartile missing-dimension IQ among the closest comparable models. If the comparable set is too thin, we fall back to the same-era lower quartile for that dimension. If no same-era real data exists for a dimension, we do not invent a neutral default; the model does not receive a derived all-dimension IQ.

In practice, comparable models are those whose average across the non-missing dimensions is within a small IQ band of the model being scored and whose release date is not later than the scored model. If fewer than three comparable models are available, we use the nearest same-era measured models instead. This keeps missing dimensions conservative without letting older models borrow strength from future benchmark cohorts.

All six scored dimensions are always used for derived IQ. Missing a hard dimension such as Abstract Reasoning or Programmatic Reasoning should not improve a model's score, so missing dimensions are filled conservatively rather than omitted from the average.

Dimension Imputation Waterfall

Step	When it applies	What happens	Result
1. Scored or estimated dimension	The model has benchmark coverage in the dimension after predecessor and correlation imputation.	Score the available benchmarks, fill missing benchmarks with the benchmark waterfall, average them, then apply confidence-weighted shrinkage if coverage is partial and the filled estimate is above the matched conservative prior.	Scored or estimated dimension IQ
2. Matched lower-quartile cap	A whole dimension is still missing and the model has at least two scored dimensions.	Find models with real data for the missing dimension and similar average across the other dimensions.	$\min(\text{model scored-dimension average}, \text{matched lower-quartile }D_k)$
3. Nearest-neighbor lower quartile	Too few models fall within the similarity radius.	Use the nearest comparable models by other-dimension average.	$\min(\text{model scored-dimension average}, \text{nearest-neighbor lower quartile }D_k)$
4. Global dimension lower quartile	The comparable set is still too thin.	Use the lower-quartile observed IQ for that dimension across models with real data.	$\min(\text{model scored-dimension average}, \text{global lower-quartile }D_k)$
5. No derived IQ	No real data exists for that dimension at all.	Do not invent a neutral default.	No derived all-dimension IQ.

Step 3: Compute the Composite

$$\mathrm{IQ} = \operatorname{round}\!\left(\frac{1}{6}\sum_{k=1}^{6}\mathrm{IQ}_{D_k}\right)$$

where all six scored dimensions are used once missing dimensions are imputed.

Key rules:

Minimum 2 dimensions required. Models with fewer than 2 scored dimensions do not receive a derived composite IQ.
No omitted dimensions. Models with enough coverage for derived IQ always use a 6-dimension composite; missing dimensions are conservatively imputed.
Transparent count. The display shows X/6 so readers can see how many scored dimensions had source-backed data before dimension-level imputation.
Equal weighting. All dimensions contribute equally. Benchmark-specific expected-score ladders, not differential weighting, handle benchmark quality differences. This keeps Composite IQ transparent and neutral rather than introducing target-specific weights before there is enough external validation data to justify them.

Rank Status

Each model receives a rank status reflecting the completeness of its evaluation:

Full — All 6 scored dimensions covered. The highest-confidence estimate.
Partial — 2–4 dimensions scored. Composite is derived but based on incomplete coverage.
Provisional — Only 1 dimension scored. Not enough for a derived composite.
Unranked — No dimension data available.

Tracked Benchmarks & Exclusions

Some benchmarks are tracked internally or shown in standalone charts but are not part of the composite IQ calculation:

DesignArena Code Categories — A broad code-category Elo leaderboard from DesignArena. It is tracked as a source-backed raw benchmark, but is not yet part of Programmatic Reasoning or the WebDev & Design and Software Engineering domains because it needs its own calibration ladder and overlap review.
GSO-Bench — Agentic software optimization: improve a codebase's runtime performance while preserving correctness. Optimization and kernel-style performance work fits the Machine Learning Engineering domain better, so it lives on the Machine Learning Benchmarks page alongside KernelBench and PostTrainBench and is excluded from Composite IQ.

These benchmarks remain in the source-backed dataset for review and future charting — they are simply not included in the composite IQ computation.

Core Math Benchmarks

FrontierMath Tier 4, FrontierMath Tier 1–3, ProofBench, MathArena, and AIME are core inputs to the D2 (Mathematical Reasoning) dimension and are surfaced on the IQ page as standalone benchmark charts. The public charts are ordered hardest to easiest by the current top model's source-backed percent correct.

FrontierMath Tier 4 — the hardest math chart by current top-model percent correct; novel research-level problems with very low gameability.
FrontierMath Tier 1–3 — harder than AIME, easier than T4, with Epoch's expert-human baseline supporting the current IQ 130 region.
ProofBench — formally-verified proof writing. A different cognitive task than the problem-solving benchmarks because the model has to construct a verified proof, not just give an answer.
MathArena — IRT-derived expected performance across non-deprecated math competitions. It adds broad competition-math coverage beyond AIME and FrontierMath while using a stricter midrange ladder so midrange scores do not overstate Mathematical Reasoning IQ.
AIME — a useful competition-math signal, but treated as saturating because frontier systems are near the source ceiling and historical problem contamination is plausible.

Benchmark	IQ 70	85	100	115	130	145	160
FrontierMath T4 expected score	0	2	7	20	45	90	120
FrontierMath T1–3 expected score	0	8	18	36	62	95	125
ProofBench expected score	0	2	6	14	30	60	120
MathArena expected score	0	5	15	32	55	75	95
AIME expected score	-15	5	30	65	92	112	132

Emotional Reasoning (EQ) is a diagnostic domain, not a Composite IQ dimension. The two component signals below are mapped onto the shared 70–160 scale so users can inspect them together. These are not direct human IQ percentiles: EQ-Bench is AI-judged, and AttuneBench is participant-grounded but currently sparse.

$$\mathrm{EQ} = \operatorname{avg}\!\left(\left\{\mathrm{EQ}_{\text{EQ-Bench}},\; \mathrm{EQ}_{\text{AttuneBench}}\right\}_{\text{available}}\right)$$

The variable is written EQ above for historical continuity; it is a diagnostic Emotional Reasoning score averaged from the available source-backed component signals below.

One source-backed component is sufficient to produce a provisional Emotional Reasoning score, consistent with the one-benchmark minimum used across applied domains. Both components are averaged when available; missing components are not imputed. Coverage is a confidence signal, so a one-source estimate should be read as substantially less certain than a score supported by both sources.

EQ-Bench 3 Elo → Emotional Reasoning

EQ-Bench 3 produces Elo ratings from head-to-head emotional-roleplay matchups judged by Claude Opus 4.6. This makes it useful as a dedicated affective/emotional reasoning signal, but it is not neutral ground truth: the judge is Claude/Anthropic-family, so the benchmark can favor Claude-like response style and penalize models that solve the scenario in a substantially different voice. AI IQ treats the Elo scale as a relative contest scale and applies its own human-reference hypothesis: 1300 is the average-human reference point for this structured roleplay task, 1500 is strong, and current frontier Elo values are high-end but kept near the AttuneBench range because the source is subjective and future Elo scores can exceed current rows. The mapping therefore keeps headroom above 2000 Elo:

EQ-Bench Elo	Emotional Reasoning
200	75
600	85
900	92
1100	96
1300	100
1500	113
1700	123
2000	132
2300	140

AttuneBench Composite → Emotional Reasoning

AttuneBench evaluates models against participant annotations from 200 real multi-turn conversations in its Default mode. The Composite is a normalized aggregate across the benchmark's primary human-annotated metrics. This is the strongest participant-grounded Emotional Reasoning source, and its small human-baseline pilot suggests human annotators can bracket or exceed current model ranges on some metrics. AI IQ therefore maps 50 near the average-human reference, treats 52.5 as strong, and reserves 55+ for high-end emotional attunement. Current public coverage is only 11 rows in a narrow frontier range, so AttuneBench remains sparse and is not imputed.

AttuneBench Composite	Emotional Reasoning
45	85
50	100
52.5	112
55	125
60	140

EQ-Bench Style-Sensitivity Adjustment

Because EQ-Bench 3 is judged by Claude, it is treated as a weak component rather than a standalone authority. To reduce family/style bias while preserving its dedicated task coverage, we subtract a 300-point Elo adjustment from the EQ-Bench component for Anthropic models before mapping to the implied Emotional Reasoning score. If stronger participant-grounded or independently judged emotional-reasoning benchmarks gain broader coverage, EQ-Bench 3 is a candidate for removal from the domain composite.

Why retain both sources? EQ-Bench adds dedicated emotional-reasoning coverage while AttuneBench adds participant-grounded emotional-attunement evidence. A model can receive a provisional score from either source, while coverage from both provides the stronger estimate.

Cost & Speed Metrics

Sticker Price — published price for a typical workload

AI IQ's effective-cost views are anchored to 1M I/O Tokens: 1M input tokens plus 1M output tokens, priced at the model's published per-million-token rates. Sticker Price is the dollar amount to process that standard workload:

$$\mathrm{StickerPrice} = p_{\text{in}} + p_{\text{out}}$$

where $p_{\text{in}}$ and $p_{\text{out}}$ are the published per-million-token prices in dollars.

Task Efficiency — how much work does the model use?

Sticker price alone hides large per-task differences in how much work a model uses to solve a benchmark. We estimate this with a measured-or-imputed usage multiplier. For each benchmark, AI IQ first estimates the task cost expected from a model's published input and output prices. The usage signal is the residual: actual task cost divided by expected task cost. Validated direct token-usage data is included as an additional signal where available.

$$\mathrm{DirectTokenUsage} = \frac{T_{\text{model}}}{\mathrm{median}(T)}$$ $$\log(\widehat{C}_{\text{model},b}) = \alpha_b + \beta_{\text{in},b}\log(p_{\text{in}}) + \beta_{\text{out},b}\log(p_{\text{out}})$$ $$\mathrm{BenchmarkUsage}_{b} = \frac{C_{\text{model},b}}{\widehat{C}_{\text{model},b}}$$ $$\mathrm{UsageMultiplier} = \mathrm{geomean}(\mathrm{BenchmarkUsage}_{b}, \mathrm{DirectTokenUsage}\ \mathrm{when\ available})$$

When validated direct token usage is available, AI IQ can use it directly. Benchmark-cost residuals are blended in when available, but a single benchmark-only residual is treated as provisional; benchmark-only effective cost requires at least two benchmark-cost signals. If a model has positive input/output pricing but no measured multiplier, AI IQ uses a conservative waterfall: one-generation-back same-family same-lineage multiplier, then two-generations-back same-family same-lineage multiplier, then the geometric average of the three closest measured peers, then a final assumed 1× multiplier. The Task Efficiency chart shows the inverse of the usage multiplier, so 2× means the model uses about half the task effort of the median model, and 0.5× means it uses about twice as much.

Usage multiplier waterfall: measured benchmark multiplier; one-generation same-family/same-lineage predecessor; two-generation same-family/same-lineage predecessor; geometric average of the three closest measured peers; assumed 1× only when the model has positive input and output pricing.

Effective Cost — what it actually costs to do the same task

The product of the two:

$$\mathrm{EffectiveCost} = \mathrm{StickerPrice} \times \mathrm{UsageMultiplier}$$

Reads as: what this model spends on a task after adjusting its 1M I/O Tokens sticker price by validated token usage and price-adjusted benchmark usage. Models below the diagonal (Effective Cost < Sticker Price) are task-efficient and cheaper than their sticker suggests; models above are task-hungry. This is the cost axis on every effective-cost-vs-quality chart.

Effective-cost charts require positive input and output token pricing. Free, zero-dollar, or rate-limited rows are not plotted as $0 effective cost. Open-weight models can enter these charts only when AI IQ records a specific nonzero hosted-provider price; temporarily subsidized or promotional rates should be avoided in favor of stable published pricing.

Token price and per-task cost

Alongside effective cost, the Cost page plots raw published token prices and two representative per-task cost blends. Token cost is a model's published input or output price per 1M tokens. Task cost weights those prices (plus cache-read pricing) by a realistic token mix for a kind of work over one million tokens:

Coding task (cache-heavy): 800K cache-read + 100K fresh input + 100K output — 0.8·cache + 0.1·input + 0.1·output. Cache reads fall back to the input price when a provider publishes no cache pricing.
Copywriting task (output-heavy): 150K input + 850K output — 0.15·input + 0.85·output.

These re-rank models by workload: the coding blend rewards cheap cache reads, while the copywriting blend rewards cheap output. Token and cache prices are published by each provider; the blends are AI IQ constructions documented here.

Speed metrics

AI IQ uses three user-facing speed terms. Latency is time to first token (TTFT): how long a model takes to begin responding. Throughput is tokens per second (TPS): how quickly the answer streams after generation begins. Completion time is time to last token (TTLT): how long the full answer takes to arrive.

The Speed page composes TTLT from a model's own measured parts rather than a single fixed figure: time to first answer token (which grows with input length and includes any hidden reasoning) plus output tokens ÷ TPS. Input- and output-length sliders set the workload. Time axes are reversed so faster models appear to the right.

Sticker Price vs Effective Cost

Each model's sticker price plotted against its effective cost after the measured or imputed usage multiplier; the dashed diagonal marks Effective Cost = Sticker Price.

AI Models by Cost

Published price vs. effective cost after the blended task-usage multiplier.

Task Efficiency

Inverse of the effective-cost usage multiplier; higher means less price-adjusted task work, with multiplier source shown in tooltips.

Reading Chart Tooltips

Chart tooltips use the same structure across public chart surfaces. Click a model point, bar, or timeline label to open the tooltip; click outside it or scroll the page to dismiss it. Hover alone does not open a tooltip.

The first tab is a stable model summary rather than a chart-specific readout. Its order is always: IQ, Emotional Reasoning (EQ), Effective Cost/1M I/O, Task Efficiency, and Release Date. This makes models comparable across charts even when the chart axes are ordered differently.

The IQ and Cost tabs expose the supporting details. IQ shows the six scored dimension scores, with benchmark-level evidence nested inside each dimension. Emotional Reasoning is shown separately as a diagnostic domain derived from EQ-Bench 3 and AttuneBench. Cost shows input cost per 1M tokens, output cost per 1M tokens, sticker cost per 1M I/O tokens, task efficiency, and effective cost per 1M I/O tokens.

Dimension bell-curve charts distinguish measured-enough points from lower-coverage estimates. For Abstract Reasoning, models with at least two source-backed D1/Abstract benchmarks use the standard solid provider-colored dots and are eligible for labels. Lower-coverage D1 estimates remain visible as same-size dots with slightly lighter provider-colored fills and ordinary solid outlines, but are not labeled by default. This keeps sparse but useful estimates inspectable without making them look as certain as source-backed ARC coverage.

Limitations & Transparency

Dimension coverage varies. Some models have data for all 6 scored dimensions; others have as few as 2 (with the rest imputed). Every score is an estimate, with confidence increasing as coverage rises. Always check the X/6 count and rank status.
Benchmark mix matters. Two models with the same composite IQ may have very different underlying data quality. One might have all frontier benchmarks while another relies mostly on saturating or contamination-sensitive benchmarks. The rank status and dimension count help distinguish these cases.
Imputation is conservative, not clairvoyant. Missing values are filled first from explicit direct-predecessor lineage when available, then from within-dimension benchmark evidence, then from comparable measured models at the dimension level. These are reasonable estimates, not ground truth — a model's true ability on an unevaluated benchmark could be significantly higher or lower.
Anchor calibration is subjective. The mapping from raw scores to IQ involves judgment calls about what different performance levels mean relative to human cognitive ability. We document our rationale for each benchmark, but reasonable people can disagree.
IQ is a metaphor. Human IQ tests measure a specific construct via standardized instruments under controlled conditions. AI benchmark performance is a different thing. The IQ scale provides an intuitive frame of reference, not a claim of equivalence.
Calibration ladders are a design choice. The expected score at each IQ level directly affects which models benefit and which are penalized. Models that excel on saturating benchmarks will need very high raw scores to receive very high implied IQ, which may feel unfair if those benchmarks genuinely reflect high ability. We believe the trade-off — rewarding harder evaluation — is correct, but the specific ladder values are judgment calls.
Benchmarks become stale. As models improve and training data evolves, benchmark ladders, gameability ratings, and source selection may need revision. This methodology is a living document.

High-End Calibration

The expected-score ladders intentionally get more demanding above IQ 140 on many benchmarks. Each additional raw point often contributes less to implied IQ in the superhuman range than in the human range. This reflects three realities:

Human IQ distributions thin out at the tails. The difference between IQ 100 and IQ 120 is much more common than the difference between IQ 140 and IQ 160.
Superhuman benchmark scores are driven by breadth, not depth. A model scoring 50% on FrontierMath T4 isn't twice as smart as one scoring 25% — it covers more mathematical branches rather than being fundamentally more capable in any single branch.
Practical discrimination. Without demanding high-end ladders, reasoning vs. non-reasoning configurations of the same model can produce unrealistically large IQ gaps. The ladder shape keeps those differences meaningful without letting one saturated benchmark dominate the composite.

A perfect or near-perfect score can still imply IQ 160 when that is a reasonable interpretation of the source. In the current calibration, several mature or contamination-sensitive benchmarks instead require off-scale performance for IQ 145 or 160. The point is not to cap hard benchmarks forever, but to make the high-end requirements explicit enough that future calibration passes can decide when a sudden 100% score on a very hard benchmark should map to an appropriately high implied IQ.

Data Process

AI IQ keeps source-backed data, extracted updates, and derived scoring separate. That separation matters because a raw benchmark chart should show what a public source actually reported, while the composite IQ can use conservative scoring-only imputations to avoid rewarding missing coverage.

Source-backed benchmark data can exist for models that are not yet shown on public chart surfaces. A model must have launch metadata, must not be hidden, and must have a derived IQ before it appears in the main IQ charts. This keeps placeholder rows inspectable in the data table without letting a single benchmark import promote them into public trend charts.

Stage	Purpose	What is preserved
1. Capture	Save the raw leaderboard or source text used for an update.	The original pasted or scraped source capture.
2. Extract	Map source rows to canonical model entries and fields.	A small reviewable update listing exactly which model fields changed.
3. Apply	Write source-backed values into the model dataset.	Unknown values stay blank; unrelated fields are not guessed.
4. Score	Derive IQ on a temporary scoring copy.	Raw benchmark values remain source-backed; imputed values are used only for derived IQ.

Manual source captures that may be hard to reproduce exactly are archived so the same raw data can be re-parsed later if the extraction rules improve. Larger generated scrapes can be refreshed from the original source and do not need to be treated as permanent public evidence in the same way.

Model Tiers

The tier filter on charts groups models into six levels: Flagship, Secondary, Tertiary, Compact, Ultralight, and Coding. Tiers describe where a model sits within its own lab's lineup, not how it scores against other labs' models — a lab's best current offering is always Flagship, and the levels below follow that lab's own ladder down to its smallest models. Coding-specialist releases sit in their own filter bucket. Open-weight releases count as part of the same lab's lineup. This keeps tier and measured intelligence independent: filtering to Flagship compares each lab's best model head-to-head, and the scores — not the tier — show how those bets actually perform. Because tiers are lab-relative, a Tertiary model from one lab can outscore a Flagship model from another; that gap is a finding, not an error. Not every lab fills every level, and tier assignments change only when a lab restructures its lineup, never when new benchmark results arrive.

Chart Inclusion

AI IQ separates model data from chart display policy. A model can exist in the dataset, have source-backed benchmark rows, and still be absent from a public chart if it does not meet that chart's policy or required fields.

Most public chart surfaces use a default policy based on publication status, derived IQ availability, model type, provider tier, and generation recency. Each major lab's main model line keeps several previous generations on charts; other top-tier models keep one previous generation; lower-tier variants such as mini, nano, Sonnet, Haiku, Flash, and smaller open-weight sizes are included only for the current generation. Archive and hidden rows are excluded unless explicitly overridden.

The IQ Over Time chart uses a stricter frontier-timeline policy. It requires a release date, derived IQ, a public model row, a general-purpose model type, membership in a top-lab provider grouping, and either top-tier status or membership in one of the lab's recognized main model lines — the latter keeps a lab's historical flagship line (for example Opus) on the timeline after the lab restructures its lineup above it. Provider lines then connect non-decreasing IQ checkpoints rather than every model from that provider.

The policy fields are maintained in the admin dashboard: publication status, model type, provider tier, model line, generation offset, and display override. Provider tier follows each lab's own ladder as described under Model Tiers; lineup names such as OpenAI mini/nano, Anthropic Sonnet/Haiku, Google Flash, and Qwen size tiers map onto that ladder per lab rather than acting as generic cross-provider role names.

Sources

Benchmark scores, prices, and token usage come from publicly published leaderboards. Each source is sampled periodically and reconciled against published numbers before being applied.

Artificial Analysis Intelligence Index — the primary aggregator. Provides scores for AIME, GPQA Diamond, SWE-Bench Verified, HLE, SciCode, Terminal-Bench 2.1, Terminal-Bench Hard, CritPt, LiveCodeBench, IFBench, MMMU-Pro, and the AA composite indices (Omniscience, GDPval, τ₂-Bench Telecom, Long Chain Reasoning), plus per-model pricing, response time, median throughput, total evaluation cost for the AA suite, and token-usage data used for task efficiency.
Arena.ai Overall — Arena.ai's head-to-head text Elo ratings and ranks
Arena.ai WebDev — pairwise web-app development evaluations using Bradley-Terry ratings
DesignArena Full Stack — human-preference Elo for end-to-end frontend + backend builds
DesignArena Code Categories — broad code-category Elo tracked as a raw benchmark pending calibration
ARC Prize leaderboard — ARC-AGI-1, ARC-AGI-2, ARC-AGI-3 scores, and ARC-AGI per-task cost where published
Vals.ai — the Vals Index, AIME, ProofBench, SWE-Bench, and LiveCodeBench source views where available
Vals.ai LegalBench, Legal Research Bench, and Harvey Legal Agent Benchmark — the three equally weighted Legal-domain source views
Scale Labs SWE-Bench Pro Public Dataset — public SWE-Bench Pro Resolve Rate source
SWE Marathon — long-horizon software-engineering Resolution Rate (Pass@1), using best published agent/harness rows for canonical model-level scoring
Cognition FrontierCode — FrontierCode Diamond score for high-quality production-code tasks, using canonical public model rows where the source model identity is clear
Mercor APEX-SWE — professional software-engineering overall Pass@1 scores
SWE-rebench — continuously evolving and decontaminated software-engineering leaderboard rows
MathArena — model-level expected performance across non-deprecated math competitions
DeepSWE v1.1 — Datacurve's live leaderboard for original long-horizon software-engineering tasks
GSO-Bench — software optimization Opt@1 leaderboard using OpenHands scaffold rows for canonical model-level scoring
SWE-Bench — SWE-Bench Verified leaderboard rows, using clear single-model agent/model pairs for model-level scoring
LiveCodeBench — continuously refreshed coding-problem benchmark sourced from live contests
Terminal-Bench — Terminal-Bench 2.1 and Terminal-Bench Hard task accuracy
BrowseComp — hard browsing benchmark with published human-trainer performance used for calibration
Humanity's Last Exam — benchmark semantics for the HLE calibration caveat
CritPt paper — research-physics benchmark semantics for the CritPt calibration caveat
SciCode — scientific coding benchmark results
Agents' Last Exam — agent/harness leaderboard for real-world professional workflows, using Full / Overall pass rate for canonical model-level scoring
OSWorld-Verified and Toolathlon — sparse computer-use and tool-use signals; current score rows use clear mirror/self-reported mappings, with official sources used for benchmark semantics and exact-match validation
MCP Atlas — real MCP-server tool-use tasks with claim-level scoring
IFBench, AA Omniscience, and AA Long Chain Reasoning — instruction-following, factual-reliability, and long-document reasoning signals used for the Reliability dimension
BullshitBench v2 — a pushback / anti-sycophancy signal (whether a model challenges false premises) used for the Reliability dimension
FACTS Grounding — Google's source-grounded long-form factuality benchmark, using the Score column from a Kaggle FACTS Grounding leaderboard capture for the Reliability dimension
Epoch AI — FrontierMath Tier 1–3 and Tier 4 accuracy
EQ-Bench 3 — emotional-intelligence Elo
AttuneBench — emotional-attunement benchmark from real human-AI conversations

The Artificial Analysis Intelligence Index can list two rows for the same model under one display name when the same underlying model has both a reasoning and a non-reasoning configuration (the reasoning row is marked with a 💡 lightbulb icon). When the two configurations differ meaningfully on cost, latency, or quality, they are tracked as separate model entries (e.g. reasoning vs non-reasoning variants of the same release).

Methodology

The 6-Dimension Framework

Formulas

D1: Abstract Reasoning

ARC-AGI-2 Frontier

ARC-AGI-3 Hard

ARC-AGI-1 Saturating

D2: Mathematical Reasoning

FrontierMath T4 Frontier

FrontierMath T1–3 Frontier

ProofBench Frontier

MathArena Frontier

AIME Saturating

D3: Academic Reasoning

Finance Domain

WebDev & Design Domain

Arena.ai WebDev Frontier

DesignArena Frontend Frontier

DesignArena Full Stack Frontier

Vibe Code Bench v1.1 Frontier

Legal Domain

LegalBench Broad

Legal Research Bench Frontier

Harvey Legal Agent Benchmark Frontier

Healthcare Domain

First Do NOHARM v2 Safety

MAST SCT Judgment

CPC-Bench Diagnosis

PhysicianBench Workflow

HealthBench Professional Communication

MedScribe Documentation

MedCode Coding

D4: Programmatic Reasoning & the Software Engineering Domain

SWE Marathon Frontier

FrontierCode Diamond Frontier

APEX-SWE Frontier

SWE-rebench Frontier

DeepSWE v1.1 Frontier

SWE-Bench Pro Frontier

SWE-bench Verified Saturating

LiveCodeBench Saturating

Academic Reasoning Benchmark Details

SciCode Saturating

Humanity's Last Exam Frontier

CritPt Frontier

GPQA Diamond Saturating

Programmatic benchmark detail: Terminal-Bench 2.1

Terminal-Bench 2.1 Frontier

Terminal-Bench Hard Frontier

D5: Computer Use

BrowseComp Frontier

OSWorld-Verified Frontier

Toolathlon Frontier

MCP Atlas Frontier

Agents' Last Exam Frontier

D6: Reliability

IFBench Frontier

AA Omniscience Frontier

BullshitBench v2 Frontier

AA Long Chain Reasoning Frontier

FACTS Grounding Frontier

Emotional Reasoning (EQ)

EQ Components Specialized

Expected-Score Interpolation

Benchmark Calibration & Averaging

Why Use Expected Scores?

Benchmark-Level Imputation

Benchmark Imputation Waterfall

Composite IQ Calculation

Step 1: Score Each Dimension

Step 2: Dimension-Level Imputation

Dimension Imputation Waterfall

Step 3: Compute the Composite

Rank Status

Tracked Benchmarks & Exclusions

Core Math Benchmarks

Emotional Reasoning Scoring

EQ-Bench 3 Elo → Emotional Reasoning

AttuneBench Composite → Emotional Reasoning

EQ-Bench Style-Sensitivity Adjustment