Methodology

AI IQ assigns each model an estimated IQ score by evaluating performance across 7 scored capability dimensions, each measured by source-backed benchmarks where coverage exists. Every scored benchmark uses a calibrated ladder: what score would correspond to IQ 70, 85, 100, 115, 130, 145, and 160 on that task? Missing benchmarks and dimensions are conservatively imputed inside the scoring pipeline, and every derived composite IQ averages all seven scored dimension scores.

This page documents the full scoring system: how source data is captured, how the seven scored dimensions are defined, how raw scores map to IQ by interpolating through those expected-score ladders, how saturated or gameable benchmarks are constrained by their calibration shape, and how missing values are imputed without changing the source-backed benchmark table. Emotional Reasoning (EQ) is retained as an experimental diagnostic metric, but excluded from Composite IQ.

The 7-Dimension Framework

AI IQ organizes Composite IQ evaluation into seven scored dimensions. Each dimension uses source-backed benchmarks that are averaged together, with missing benchmarks conservatively imputed only inside the scoring pipeline:

  • Frontier benchmarks are hard, low-gameability tests where a large score gain at the frontier usually represents a large capability gain.
  • Saturating benchmarks are easier, more gameable, or already clustered near the top. Their ladders still reach IQ 160 where appropriate, but the score expectations rise quickly near the high end so a merely high score does not overstate ability.

The composite IQ requires at least 2 of 7 scored dimensions to have data. Models with fewer scored dimensions do not receive a derived IQ and are skipped by IQ-ranked chart surfaces.

D1
Mathematical Reasoning
FrontierMath T4, FrontierMath T1–3, ProofBench, MathArena, AIME (saturating)
D2
Scientific Reasoning
Humanity's Last Exam, CritPt, SciCode (saturating), GPQA Diamond (saturating)
D3
Abstract Reasoning
ARC-AGI-2, ARC-AGI-1 (saturating), ARC-AGI-3
D4
App Building
Arena.ai WebDev, DesignArena Frontend, DesignArena Full Stack, Vibe Code Bench
D5
Production Engineering
LiveCodeBench (saturating), FrontierCode Diamond, SWE-Bench Verified (saturating), SWE-Bench Pro, DeepSWE, SWE-rebench, SWE Marathon
D6
Computer Use
Terminal-Bench 2.0, Terminal-Bench Hard, BrowseComp, OSWorld-Verified, Toolathlon, MCP Atlas, Agents' Last Exam
D7
Reliability
IFBench, AA Omniscience

Formulas

Each benchmark raw score \(s\) is converted to an IQ value by comparing it with that benchmark's expected score at fixed IQ levels. Let \(S(q_i)\) be the expected raw score for IQ level \(q_i\), where \(q_i \in \{70,85,100,115,130,145,160\}\). For a score between two adjacent expected scores, AI IQ uses piecewise-linear interpolation:

$$f(s) = q_i + \frac{s - S(q_i)}{S(q_{i+1}) - S(q_i)}\,(q_{i+1} - q_i), \qquad S(q_i) \le s \le S(q_{i+1})$$

Each dimension averages the IQ values of its benchmarks. Missing benchmarks are conservatively imputed before averaging using the benchmark-level waterfall below (see Benchmark-Level Imputation).

The seven scored dimensions:

$$\begin{array}{l l} \mathrm{IQ}_{\text{Math}} & = \operatorname{avg}\!\left(f(\text{FrontierMath T4}),\; f(\text{FrontierMath T1-3}),\; f(\text{ProofBench}),\; f(\text{MathArena}),\; f(\text{AIME})\right) \\[6pt] \mathrm{IQ}_{\text{Science}} & = \operatorname{avg}\!\left(f(\text{HLE}),\; f(\text{CritPt}),\; f(\text{SciCode}),\; f(\text{GPQA})\right) \\[6pt] \mathrm{IQ}_{\text{Abstract}} & = \operatorname{avg}\!\left(f(\text{ARC-AGI-2}),\; f(\text{ARC-AGI-1}),\; f(\text{ARC-AGI-3})\right) \\[6pt] \mathrm{IQ}_{\text{AppBuild}} & = \operatorname{avg}\!\left(f(\text{Arena.ai WebDev}),\; f(\text{DesignArena Frontend}),\; f(\text{DesignArena Full Stack}),\; f(\text{Vibe Code Bench})\right) \\[6pt] \mathrm{IQ}_{\text{ProdEng}} & = \operatorname{avg}\!\left(f(\text{LiveCodeBench}),\; f(\text{FrontierCode Diamond}),\; f(\text{SWE-Bench Verified}),\; f(\text{SWE-Bench Pro}),\; f(\text{DeepSWE}),\; f(\text{SWE-rebench}),\; f(\text{SWE Marathon})\right) \\[6pt] \mathrm{IQ}_{\text{Computer}} & = \operatorname{avg}\!\left(f(\text{Terminal-Bench 2.0}),\; f(\text{Terminal-Bench Hard}),\; f(\text{BrowseComp}),\; f(\text{OSWorld-Verified}),\; f(\text{Toolathlon}),\; f(\text{MCP Atlas}),\; f(\text{Agents' Last Exam})\right) \\[6pt] \mathrm{IQ}_{\text{Reliability}} & = \operatorname{avg}\!\left(f(\text{IFBench}),\; f(\text{AA Omniscience})\right) \end{array}$$

\(f\) is piecewise-linear interpolation through each benchmark's expected-score ladder.

The composite IQ is the mean of all seven scored dimension scores. At least 2 dimensions must be source-backed or predecessor-imputed before any missing whole dimensions are filled:

$$\boxed{\;\mathrm{IQ} = \frac{1}{7}\!\left(\mathrm{IQ}_{\text{Math}} + \mathrm{IQ}_{\text{Science}} + \mathrm{IQ}_{\text{Abstract}} + \mathrm{IQ}_{\text{AppBuild}} + \mathrm{IQ}_{\text{ProdEng}} + \mathrm{IQ}_{\text{Computer}} + \mathrm{IQ}_{\text{Reliability}}\right), \qquad n_{\text{scored}} \ge 2\;}$$

D1: Mathematical Reasoning

Mathematical reasoning and quantitative problem-solving — the ability to work with mathematical structures, proofs, and analytical frameworks. The hard benchmarks test novel quantitative reasoning that cannot be memorized from training data. Human references differ by benchmark: FrontierMath uses specialist and expert-team math references, ProofBench uses formalization ability, and MathArena/AIME use competition-math populations rather than the general public.

FrontierMath T4 Frontier

Format: Novel research-level math problems
Tier: 4 (research-level)
Gameability: Very Low

Extremely difficult original math problems from Tier 4 of the FrontierMath benchmark. Problems are novel and cannot be found in training data, and Epoch describes this split as research-level mathematics outside the Tiers 1–3 human-baseline competition. The human-reference guess treats ordinary non-specialists and most non-matched mathematicians as near zero, relevant specialists as able to score a small but meaningful amount, and broad expert groups as the high-end reference. The ladder keeps 25% around IQ 130, the mid-50s around IQ 145, and off-scale headroom above 100% for IQ 160.

IQ7085100115130145160
Expected score01.55122555120

FrontierMath T1–3 Frontier

Format: Novel advanced math problems
Tier: 1–3
Gameability: Very Low

FrontierMath Tier 1–3 covers difficult novel math problems below the research-level Tier 4 split. Epoch's human-baseline competition used exceptional math undergraduates and subject-matter experts, and its adjusted discussion places the relevant human baseline around 30–50%. The ladder therefore keeps 38% at IQ 130, current top scores in the low-to-mid 140s, and off-scale headroom for stronger future performance.

IQ7085100115130145160
Expected score0410223860120

ProofBench Frontier

Format: Formal proof construction
Tasks: Verified theorem proving
Gameability: Low

ProofBench tests whether a model can construct formally verified mathematical proofs rather than only solve for a final answer. It is a different cognitive task from competition math because the Lean 4 output must satisfy a verifier, with no partial credit for plausible but invalid reasoning. The human-reference guess treats ordinary mathematicians without Lean as near zero, competent Lean users as the midrange reference, and strong formalizers as the high-end reference. The current ladder is exponential at the upper tail and keeps perfect source-range performance from automatically exhausting future headroom.

IQ7085100115130145160
Expected score026143060120

MathArena Frontier

Format: Competition math expected performance
Scale: IRT-derived percent expected correctness
Gameability: Low

MathArena estimates each model's expected performance across non-deprecated math competitions using item-response-theory calibration. It is treated as a broad Mathematical Reasoning signal because it aggregates many competition-style questions while preserving difficulty information instead of only counting raw solve rate. The human-reference guess uses qualified contest participants through strong olympiad/Putnam-style humans, not ordinary non-specialists. The ladder keeps top frontier scores in the high-IQ range while requiring stronger midrange performance before awarding IQ 130+ credit. Source rows are mapped only when the model variant clearly matches a canonical dataset row; duplicate lower-reasoning configurations and source-only variants are skipped until the dataset adds an explicit matching model entry.

IQ7085100115130145160
Expected score051532557595

AIME Saturating

Format: Integer answers (0–999)
Questions: 15 per exam
Gameability: High

Competition mathematics with integer answers. AIME is a 15-question, 3-hour exam for high AMC scorers. The ladder treats average AIME-qualifier-level performance as around IQ 100 and a score around 13.8/15 as IQ 130 because public human comparisons place 13.9/15 around top-500 nationally and above the USAMO cutoff. Old AIME problems are widely available in training data and frontier models are near saturation, so IQ 145 and 160 remain off-scale: a perfect score does not automatically imply either level on this benchmark.

IQ7085100115130145160
Expected score-155306592112132

D2: Scientific Reasoning

Scientific Reasoning measures expert scientific knowledge, research-style analysis, and difficult problem solving under uncertainty. Its scored benchmark set currently uses Humanity's Last Exam, CritPt, SciCode, and GPQA Diamond. GPQA has the clearest published human anchors; the other benchmarks use best-guess human references informed by task prerequisites and relative hardness. Detailed calibration notes for those benchmark cards appear below after the Production Engineering section.

D3: Abstract Reasoning

Fluid abstraction is the ability to solve novel problems without relying on prior knowledge. This is the closest analogue to fluid intelligence in human psychometrics — raw problem-solving ability applied to patterns never seen before.

ARC-AGI-2 Frontier

Format: Visual grid puzzles (novel patterns)
Tasks: Unique visual pattern completion
Gameability: Essentially Ungameable

Each puzzle requires identifying a novel visual transformation rule from examples and applying it to a new input. The puzzles are unique and cannot be memorized. This is the purest test of abstract reasoning in the benchmark set — no prior knowledge helps, only the ability to infer abstract rules from examples. ARC reports that human testing used hundreds of general-public, non-expert participants, with average human performance around 60–66%. The ladder therefore anchors 60% near IQ 100 while keeping perfect source-range performance exceptional and IQ 160 off-scale.

IQ7085100115130145160
Expected score020607585100125

ARC-AGI-3 Hard

Format: Interactive adaptation tasks
Tasks: Novel environments with feedback
Gameability: Low, but post-training-sensitive
IQ Ceiling: 160 off-scale

ARC-AGI-3 extends the ARC family from passive puzzle solving into interactive environments where an agent must adapt from feedback. Its score is Relative Human Action Efficiency: 100% means completing all games at or above the human action-efficiency baseline, not merely solving 100% of static tasks. ARC reports broad general-public human testing, 342 public-demo plays with 145 solves, and 100% environment solvability by at least two independent participants. The ladder therefore maps the average public-demo completion region near IQ 100 while reserving full human-baseline efficiency for IQ 145.

IQ7085100115130145160
Expected score015457090100115

ARC-AGI-1 Saturating

Format: Visual grid puzzles
Gameability: Ungameable (but saturating)

Same format as ARC-AGI-2 but an older and easier problem set. Large-scale human studies and ARC summaries put average crowd-worker performance in the roughly 64–77% range, while STEM-grad and human-panel rows are near perfect. The ladder therefore maps about 77% to IQ 100 and treats near-perfect ARC-AGI-1 performance as saturation-prone rather than as IQ 145+ evidence.

IQ7085100115130145160
Expected score035779098112130

D4: App Building

App Building measures whether a model can turn product and design prompts into usable apps, front-end experiences, and full-stack prototypes. It covers prompt following, sensible architecture choices, UI and UX judgment, implementation completeness, and the polish of the shipped artifact.

App Building is a top-level Composite IQ dimension. It requires at least two usable benchmark signals before missing benchmarks in the dimension are filled.

Methodology caveat: the current App Building set is intentionally focused on greenfield app and web-artifact quality, and its inputs are not four fully independent sub-skills. Arena.ai WebDev, DesignArena Frontend, and DesignArena Full Stack all contain overlapping product, UI, UX, prompt-following, and implementation-quality signals. Vibe Code Bench adds a more task-suite-like app-building signal, but this dimension should still be read as a practical app-building cluster rather than a rigorously factor-separated measurement. Broader coverage across product architecture, user-intent inference, accessibility, maintainability, and long-horizon iteration would make the dimension stronger.

Arena.ai WebDev Frontier

Format: Front-end web development arena
Scale: Bradley-Terry Elo
Gameability: Low, but prompt/category-sensitive

Arena.ai WebDev is included as an interface-heavy app-building signal because it evaluates end-to-end web development rather than pure algorithmic coding. It is not a pure design benchmark: the score mixes implementation quality, visual taste, layout, UX, and prompt following. That blend is useful because real software model choice often depends on the whole shipped artifact, not just whether the code compiles. As a best-guess human reference, 1300 Elo is treated as average-useful web artifact quality in this arena, with 1500+ representing strong production-like work.

IQ7085100115130145160
Expected Elo1100120013001400150016201800

DesignArena Frontend Frontier

Format: Agentic frontend web-app arena
Scale: Bradley-Terry Elo
Gameability: Low, but prompt/category-sensitive

DesignArena Frontend measures multi-file React front-end apps generated by agentic coding systems and judged through real user pairwise comparisons. It is included as a greenfield app-building signal distinct from repository repair: the score reflects whether the model can turn product prompts into usable rendered front-end experiences.

IQ7085100115130145160
Expected Elo1000106011201180127013901560

DesignArena Full Stack Frontier

Format: End-to-end frontend + backend build arena
Scale: Bradley-Terry Elo
Gameability: Low, but prompt/category-sensitive

DesignArena Full Stack is a human-preference arena for whole-stack builds: it scores frontend interactivity and layout alongside backend work (Supabase schema design, data seeding, API functionality, CRUD, auth, end-to-end persistence, error handling). It complements Arena.ai WebDev's front-end focus with a broader shipped-artifact signal. Its Elo scale runs lower and tighter than Arena.ai WebDev's, so the ladder is fit to its own distribution rather than reusing WebDev's absolute-Elo anchors. As a best-guess reference, 1120 Elo is average-useful full-stack artifact quality, 1280 is strong end-to-end app quality, and 1400+ is frontier-level execution.

IQ7085100115130145160
Expected Elo1000106011201190128014001540

Vibe Code Bench Frontier

Format: Web apps from natural-language specs
Scale: Accuracy / task success
Gameability: Lower, but harness-dependent

Vibe Code Bench v1.1 measures whether models can build web applications from scratch from product-style specifications in an agentic environment. It rounds out App Building by complementing the human-preference DesignArena signals with a task-success benchmark.

IQ7085100115130145160
Expected score0515305075100

D5: Production Engineering

Production Engineering measures coding fluency, code quality, repository repair, testing, debugging, and long-horizon engineering execution. It is separate from App Building because competitive-programming correctness and repository-maintenance skill do not necessarily imply good product architecture, UI, or UX judgment.

Production Engineering is a top-level Composite IQ dimension. It requires at least two usable benchmark signals before missing benchmarks in the dimension are filled.

SWE Marathon Frontier

Format: 20 multi-hour software-engineering tasks
Scale: Resolution Rate (Pass@1)
Gameability: Lower, but agent/harness-dependent

SWE Marathon measures long-horizon software work across library reproductions, product clones, ML engineering, and optimization tasks. It is included as a hard Production Engineering signal because the tasks require sustained planning, implementation, debugging, and verifier-driven iteration over multi-hour runs. The public leaderboard reports model-plus-agent configurations rather than pure model-only rows; duplicate source rows show that agent choice can materially change performance. AI IQ therefore treats this as sparse best-public-agent-stack evidence and preserves the selected agent in extraction notes. Ordinary programmers are assumed near zero on autonomous multi-hour completion; strong senior engineers or small teams plausibly define the 13-25 region.

IQ7085100115130145160
Expected score0137132550

FrontierCode Diamond Frontier

Format: 50 hardest production-code tasks
Scale: Rubric-weighted score percentage
Gameability: Low, private tasks

FrontierCode Diamond measures whether model-generated changes would meet maintainer standards for mergeable production code. It combines correctness with code quality, test quality, scope discipline, style, and repository-specific expectations. AI IQ uses the Diamond score metric, not pass rate, because the source treats score as the quality-sensitive aggregate and reports best-effort model scores at each model's best reasoning level. The benchmark is less mixed-agent than SWE Marathon, but it remains sparse, private-task, and rubric-sensitive. Public charts include every published Diamond score that maps to a canonical public model row; source-only systems without a model entry are archived in extraction notes rather than approximated. Ordinary developers are assumed near zero on maintainer-grade Diamond tasks; expert maintainers or small review teams plausibly define the 7-14 region.

IQ7085100115130145160
Expected score00.51.53.571432

SWE-rebench Frontier

Format: Continuously evolving software-engineering tasks
Scale: Resolved-rate percentage
Gameability: Low, decontaminated

SWE-rebench measures software-engineering agents on a continuously evolving and decontaminated task set. It is included as a hard Production Engineering signal, but its ladder is intentionally conservative while coverage is still narrow: current frontier rows around the low-60s map to low-140s IQ, and the upper tail is deliberately flat so future benchmark saturation does not overstate the composite. Competent repo-debugging engineers are estimated around 15-30, while strong senior engineers or agent-assisted teams plausibly define the 48-70 region.

IQ7085100115130145160
Expected score0515304870120

DeepSWE Frontier

Format: Original long-horizon software-engineering tasks
Tasks: 113 tasks across 91 repositories
Scale: Pass@1 percentage

DeepSWE measures coding agents on original software-engineering tasks written from scratch across a broad set of active repositories. It is treated as a hard Production Engineering signal because the tasks are contamination-resistant, long-horizon, and verified with behavior-focused tests rather than copied public patches. Competent software engineers are estimated around 5-12 on broad unfamiliar-repo autonomous tasks, while strong senior engineers or focused teams plausibly define 25-50.

IQ7085100115130145160
Expected score025122550100

SWE-Bench Pro Frontier

Format: Hard software-engineering tasks
Scale: 0–1 score
Gameability: Lower than SWE-Bench Verified

A harder software-engineering benchmark used to complement SWE-Bench Verified. It keeps code repair in the Production Engineering dimension while reducing reliance on the older verified set. Tasks can require hours to days for professional engineers, so competent professionals are expected to resolve a meaningful minority and strong senior engineers or teams plausibly define the 0.55-0.75 region.

IQ7085100115130145160
Expected score00.10.250.40.550.751.1

SWE-bench Verified Saturating

Format: Real GitHub issue resolution
Tasks: 500 verified issues
Gameability: Very High

Models generate patches to resolve real GitHub issues and pass unit tests. However, many issues predate model training cutoffs and some have solution leakage. The current ladder remains heavily ceiling-compressed: source-range scores can still distinguish models, but IQ 145+ requires off-scale performance. Competent engineers with the right repository context could solve many tasks, so 100% source performance maps only to IQ 130 rather than the top of the scale.

IQ7085100115130145160
Expected score-10103565100145200

LiveCodeBench Saturating

Format: Continuously refreshed coding problems
Scale: Percent solved
Gameability: Low, but overlapping

LiveCodeBench adds a broad, continuously refreshed coding signal to Production Engineering. Because it overlaps with other programming benchmarks and frontier models are clustered high, its ladder gives strong credit through the middle of the range while reserving IQ 145+ for scores near 90% and above. Competent competitive programmers plausibly sit around 35-60, while strong contest programmers define the 82+ region.

IQ7085100115130145160
Expected score015356082110145

Scientific Reasoning Benchmark Details

Scientific reasoning captures expert scientific knowledge, research-style reasoning, and difficult problem analysis under uncertainty. The frontier benchmarks test whether a model can reason through questions that push the boundaries of human expertise itself, while the saturating benchmarks add broader scientific and graduate-level knowledge signals.

SciCode Saturating

Format: Scientific research problems implemented in code
Gameability: Moderate

SciCode uses code as the execution medium for realistic scientific research problems. The tasks require identifying scientific concepts, recalling domain facts, reasoning through numerical methods or simulations, and transforming that reasoning into computation. It is included in Scientific Reasoning because the bottleneck is scientific research reasoning, not generic programming. The human-reference guess is based on qualified scientific-coding ability: ordinary non-coders are near zero, competent scientific Python users can plausibly solve a meaningful fraction of subproblems, and strong computational scientists should sit much higher.

IQ7085100115130145160
Expected score015304565100150

Humanity's Last Exam Frontier

Format: 76% exact-match, expert-contributed
Questions: 3,000 (expert-sourced, screened against models)
Gameability: Low

Questions contributed by domain experts and explicitly screened to ensure no existing model can answer them at creation time. The benchmark spans the full frontier of human expertise. The human-reference guess treats ordinary non-specialists across the full mixed-domain set as near the low single digits, while relevant domain experts or an expert panel would be much higher. The current ladder makes small scores meaningful, puts 45% near IQ 145, and maps perfect in-range performance to IQ 160 as a universal-expert-level result.

IQ7085100115130145160
Expected score025102045100

CritPt Frontier

Format: Critical-point analysis (novel problems)
Scale: Raw source points
Gameability: Low

Novel critical-point analysis problems that require expert analytical reasoning. Although the tasks use mathematical methods, CritPt is grouped with Scientific Reasoning because it functions as a hard expert-analysis signal rather than a broad math-competition signal. Problems are original and contributed by a broad physics-research collaboration, making memorization ineffective. The human-reference guess treats ordinary non-physicists and most non-specialists as near zero, with small positive scores becoming meaningful for relevant physics researchers or research groups. The current ladder keeps 0% compatible with IQ 100.

IQ7085100115130145160
Expected score-1-0.50131030

GPQA Diamond Saturating

Format: 4-choice multiple choice
Questions: 198 (public set)
Gameability: Moderate-High

Graduate-level science questions written by PhD experts. A 25% score is the four-choice random baseline; the GPQA paper reports skilled non-experts with unrestricted web access at 34%, domain experts at 65%, and corrected expert performance at 74%. The current ladder maps those anchors to roughly IQ 85, 100, 124, and 130 respectively. Public frontier rows cluster much higher, so the high end remains compressed and IQ 145+ is off-scale.

IQ7085100115130145160
Expected score1525345074100140

D6: Computer Use

Computer Use measures practical task execution across external environments: terminal operation, browsing, visual/GUI operation, tool use, and recovering enough from intermediate state to finish the task. Terminal-Bench lives here because it is primarily about operating a shell environment to complete tasks, even when those tasks are engineering-adjacent. Human references are benchmark-specific: terminal tasks use technical operators, BrowseComp uses published human-trainer browsing performance, OSWorld uses computer-literate human operators, and broad tool/workflow benchmarks use professional operators or teams.

Terminal-Bench 2.0 Frontier

Format: Docker container tasks (shell commands)
Tasks: 89 practical tasks
Gameability: Low

Models execute shell commands in isolated Docker containers to complete practical system administration and development tasks. The interactive, execution-based format makes memorization ineffective. The source tasks include human-written reference solutions and expert/junior-engineer time estimates, so the ladder treats ordinary non-terminal users as near zero, competent technical operators as around 25-42, and strong senior technical operators as plausibly in the 62-90 region.

IQ7085100115130145160
Expected score01025426290130

Terminal-Bench Hard Frontier

Format: Harder terminal-agent tasks
Gameability: Low

Terminal-Bench Hard is tracked separately from Terminal-Bench 2.0 and contributes an additional source-backed computer-use signal when available. It uses a harder ladder than Terminal-Bench 2.0, and raw values remain separate and are never backfilled into the Terminal-Bench 2.0 field. The human-reference estimate treats competent terminal users as around 20-35 and strong senior technical operators as around 55-82, with IQ 160 kept off-scale.

IQ7085100115130145160
Expected score0820355582120

BrowseComp Frontier

BrowseComp measures hard browsing and research task completion. OpenAI reports that human trainers solved 29.2% of attempted questions without AI assistance, with 86.4% agreement among solved questions. The ladder therefore maps about 29% to IQ 100, treats 50-75% as strong persistent browsing/research ability, and keeps IQ 145/160 off-scale because the benchmark is easier to verify than to solve and can improve with test-time compute.

IQ7085100115130145160
Expected score010295075100130

OSWorld-Verified Frontier

OSWorld-Verified measures desktop/computer-use task completion, capturing a model's ability to operate an external environment rather than only answer a static prompt. OSWorld reports that humans can accomplish over 72.36% of its 369 tasks, and OSWorld-Human records human trajectories for all tasks. AI IQ treats that as a computer-literate operator reference rather than a random adult average, so about 72% maps to IQ 115. Current scored rows use clear model-level mirror/self-reported mappings and are checked against official OSWorld-Verified semantics when possible.

IQ7085100115130145160
Expected score015357290110140

Toolathlon Frontier

Toolathlon measures long-horizon tool-use task performance. The source defines 108 manually sourced or crafted tasks across many applications, requiring around 20 tool-call turns on average. Scores are stored as percentages and mapped through a sparse agentic tool-use ladder; the human-reference estimate treats competent tool operators as around 25-38 and strong multi-app operators as around 50-70. Current leaderboard rows remain mirror/self-reported unless primary run details are available.

IQ7085100115130145160
Expected score01225385070110

MCP Atlas Frontier

Format: Real MCP-server tasks
Scale: Pass-rate percentage
Gameability: Low

MCP Atlas measures end-to-end task success with real Model Context Protocol servers, noisy tool menus, multi-step tool calls, and final-answer judging. The source contains 1,000 human-authored tasks across 36 real MCP servers and 220 tools, with cross-server workflows and claim-level scoring. It contributes to Computer Use as a direct tool-orchestration signal; the human-reference estimate treats competent tool/API operators as around 35-55 and strong workflow operators as around 72-92.

IQ7085100115130145160
Expected score01535557292120

Agents' Last Exam Frontier

Format: Agent/harness professional workflows
Scale: Full / Overall pass-rate percentage
Gameability: Low

Agents' Last Exam measures agentic success on real-world professional workflows with verifiable success criteria. It spans broad professional workflow categories, so a single ordinary user is expected near zero across the full distribution while strong professional teams plausibly define the 13-24 region. Because rows are published as agent/harness plus model combinations, AI IQ uses the best substantial-coverage Full / Overall pass-rate row for each canonical model and leaves source-only or incomplete one-run rows out of the imported benchmark field.

IQ7085100115130145160
Expected score0137132450

D7: Reliability

Reliability captures instruction following, constraint adherence, and factual robustness. It is separated from App Building, Production Engineering, and Computer Use so general task trustworthiness does not disappear inside practical software or tool-use benchmarks. Human references are stricter here because careful humans can often satisfy explicit constraints and can abstain rather than hallucinate when they do not know an answer.

IFBench Frontier

IFBench measures instruction following and constraint adherence across 58 novel, diverse, verifiable out-of-domain constraints. It contributes to Reliability because dependable model behavior depends on following multi-part requirements, respecting constraints, and preserving user intent across the response. The human-reference estimate treats average careful humans as able to satisfy a majority of explicit constraints, strong detail-oriented humans as approaching the 90s, and current frontier model rows in the high 70s/low 80s as above average but below top human-level constraint reliability.

IQ7085100115130145160
Expected score035658092103120

AA Omniscience Frontier

AA Omniscience is used as a factual-reliability signal from the Artificial Analysis Intelligence Index. The source index ranges from -100 to 100, rewards correct answers, penalizes hallucinations, and does not penalize refusing to answer. The source defines 0 as equal correct and incorrect answers; AI IQ treats that as a no-net-hallucination/calibrated-abstention baseline rather than a measured median-human baseline, so the average-human reference is modestly positive.

IQ7085100115130145160
Expected score-80-358254572100

Emotional Reasoning (EQ)

Emotional Reasoning (EQ) measures social, emotional, and conversational judgment, but it is currently excluded from Composite IQ. The available benchmark base is weaker than the other scored dimensions: EQ-Bench 3 uses a Claude/Anthropic-family judge, Arena.ai Overall is broad conversational preference rather than a dedicated EQ benchmark, and AttuneBench is the strongest participant-grounded source but still sparse.

EQ Components Specialized

EQ-Bench 3, Arena.ai Overall, and AttuneBench are mapped into a diagnostic Emotional Reasoning score. The score remains visible in charts and model profiles, but it is not included in Composite IQ until the benchmark set has stronger independent, human-grounded coverage.

Expected-Score Interpolation

Each benchmark defines expected raw scores at seven fixed IQ levels: 70, 85, 100, 115, 130, 145, and 160. For scores that fall between two adjacent expected scores, AI IQ uses piecewise-linear interpolation:

$$t = \frac{s - S(q_i)}{S(q_{i+1}) - S(q_i)}, \qquad \mathrm{IQ} = q_i + t \cdot (q_{i+1} - q_i)$$

If the score is at or below the lowest expected score, the model receives IQ 70. If it is at or above the highest expected score, it receives IQ 160 for that benchmark. There is no extrapolation beyond the defined range.

This approach makes the calibration question explicit. Instead of first choosing a curve family, each benchmark asks: what would an IQ 70, 85, 100, 115, 130, 145, or 160 model-equivalent score on this source? Each segment can then have a different slope, allowing threshold-like benchmarks such as CritPt or FrontierMath T4 to behave differently from smoother bounded tasks such as ARC or Terminal-Bench.

Benchmark Calibration & Averaging

Each dimension averages all its benchmarks together, with missing benchmarks conservatively imputed. The expected-score ladders handle benchmark quality directly: harder benchmarks can assign high IQ to modest raw scores, while saturated or gameable benchmarks can require near-perfect scores before reaching the 145–160 range.

Why Use Expected Scores?

The calibration is framed as a human-readable judgment rather than as an abstract curve fit. For each benchmark, the table answers seven concrete questions: what score would a 70, 85, 100, 115, 130, 145, or 160 IQ-equivalent system be expected to achieve?

That means an extremely hard benchmark can assign a high implied IQ to a low-looking score. CritPt, for example, treats 0% as compatible with IQ 100 because a normal human would not be expected to score on it. AIME does the opposite at the high end: because frontier systems can approach saturation and the public problem set is contamination-sensitive, its calibration ladder puts IQ 145 and 160 outside the source's natural 0–100% range.

Why not fit a single curve type? Linear, power, exponential, and asymptotic curves are useful shapes, but no single family captures every benchmark cleanly. The seven-point ladder keeps the public calibration inspectable while still producing a smooth piecewise-linear curve for scoring.

Calibration status. The current ladders are the first benchmark-specific calibration pass after the initial anchor-curve conversion. Some high-end expected scores remain outside a benchmark's natural source range where saturation, contamination, or benchmark age make a perfect in-range score fall short of IQ 145 or 160.

Benchmark-Level Imputation

When a model has partial benchmark coverage inside a dimension, missing benchmarks are filled in before the dimension IQ is averaged. For low-coverage dimensions, we first try a conservative correlation estimate; otherwise we use the long-standing within-dimension average capped by each benchmark's 80th percentile.

Correlation estimates use source-backed overlap among benchmarks in the same dimension. If Benchmark A historically predicts Benchmark B, a model's score on A can estimate B, but only after the observed correlation is shrunk for small samples and an uncertainty penalty is subtracted. A one-benchmark dimension receives the largest penalty, and imputed benchmark IQs are capped below the observed benchmark average so sparse coverage cannot make a model look better than its actual evidence.

The ordinary within-dimension estimate uses two ingredients:

  • The model's available-benchmark IQ average — how the model is performing on the benchmarks it does have in this dimension. This is the within-dimension signal: if a model is hitting IQ 130 on the dimension's other benchmarks, the missing one is probably also somewhere around 130.
  • The benchmark's 80th-percentile IQ (\(P_{80}\)) — a per-benchmark cap derived from the actual data. Take every model that has a real score on that benchmark, convert each score to an implied IQ via the expected-score ladder, sort those IQs from low to high, and take the value at the 80th-percentile rank. So if 50 models have HLE scores yielding implied IQs ranging from 70 to 155, \(P_{80}(\text{HLE})\) is the implied IQ at the 80th-percentile rank in that sorted list. It is where strong-but-not-frontier measured models actually land on this benchmark.

The imputed value is the minimum of the two:

$$\mathrm{IQ}_{\text{imputed}} = \min\!\left(\overline{\mathrm{IQ}}_{\text{available}},\; P_{80}(\text{benchmark})\right)$$

Why min of the two? The model's own dimension average is the best within-dimension signal we have. Capping at the 80th-percentile prevents a strong model from being imputed past where the actual data has been observed — a model averaging IQ 145 in this dimension might project very highly on the missing benchmark, but the imputed value won't claim that without measurement. The min lets imputation move a missing score up or down toward what the rest of the dimension implies, while staying conservatively below where the field has empirically reached.

Benchmark Imputation Waterfall

StepWhen it appliesWhat happensResult
1. Source value The model has a benchmark score. Use the raw source-backed value. Real benchmark IQ
2. Primary / secondary predecessor The benchmark is missing and the model has a clear primary or secondary predecessor. Try the primary predecessor first, then the secondary predecessor, and use the first valid scoring value for that benchmark. Predecessor-imputed benchmark IQ
3. ARC lower-tail peer estimate ARC-AGI-1 or ARC-AGI-2 is missing after predecessor imputation, and the model has no source-backed D3/Abstract benchmark. Use the lower quartile of capability-matched source-backed ARC peers released before the target model. Scoring-only peer-imputed ARC benchmark IQ
4. Hard-benchmark zero A historically near-zero hard benchmark is missing. Use 0 as the scoring-only value, while leaving the raw benchmark field blank. Zero-assumed benchmark IQ
5. Correlation estimate A low-coverage dimension has at least one source-backed benchmark and same-dimension benchmark correlations have enough overlap. Predict missing benchmark IQs from correlated source-backed benchmarks, shrink small-sample correlations, subtract an uncertainty penalty, and cap below the observed benchmark average. Conservative correlation-imputed benchmark IQ, with error metadata
6. Within-dimension estimate The dimension has enough benchmark coverage for this model. Average the model's available benchmark IQs in that dimension, then cap by the benchmark's 80th percentile. \(\min(\text{dimension benchmark average}, \text{benchmark }P_{80})\)
7. No benchmark estimate The model has no benchmark data in that dimension. Do not invent individual benchmark rows. The whole dimension is handled by dimension-level imputation.

ARC lower-tail peer estimates are used only on scoring copies, never in raw benchmark tables. The peer search excludes later releases, keeps reasoning and non-reasoning models separate, prefers closer provider/product-line/tier buckets, then ranks candidates by non-Abstract capability. For ARC-AGI-2 and CritPt, missing values are treated as zero in the scoring pipeline only after earlier imputation steps fail, because historical frontier models generally scored at or near zero until directly shown otherwise. ARC-AGI-3 does not use this zero assumption; its current coverage is too sparse, so missing values use predecessor or within-dimension imputation instead. For models released before April 2025, the same scoring-only zero assumption is also applied to FrontierMath T4, ProofBench, and Terminal-Bench 2.0. The raw benchmark table still leaves values blank when no source row exists. Other missing benchmarks use the ordinary imputation waterfall.

Predecessor imputation is constrained by model family, and non-reasoning variants do not impute from reasoning variants. These lineage choices are scoring assumptions only; they do not create source-backed raw benchmark values.

Correlation imputation is deliberately narrower than the generic within-dimension rule. For App Building, Production Engineering, and Computer Use, it can turn a single source-backed benchmark into an estimated dimension score by filling missing benchmarks with conservative correlation estimates. For dimensions without an explicit minimum, it only changes one-benchmark cases. Dimensions with broader coverage continue to use the ordinary within-dimension cap. If a dimension has no source-backed benchmark evidence, no correlation estimate is attempted — the dimension itself is either left missing or filled at the dimension level (see Composite IQ Calculation below). The experimental Emotional Reasoning (EQ) diagnostic is not used in Composite IQ.

Composite IQ Calculation

Step 1: Score Each Dimension

For every dimension with usable benchmark coverage, compute the dimension IQ as the average of its source-backed and scoring-only imputed benchmark IQs. If the dimension has only one source-backed benchmark, missing benchmarks may be filled by conservative correlation estimates and the dimension is marked as estimated. If there is no source-backed evidence in a dimension, it is usually treated as missing and handled by dimension-level imputation; the main exception is Abstract Reasoning, where no-D3-source rows can receive conservative ARC-AGI-1/2 lower-tail peer estimates before the dimension is averaged.

Step 2: Dimension-Level Imputation

If a model has at least 2 scored dimensions but is missing some of the others, every missing dimension is imputed before the composite is averaged. The cap is matched to models with real data for that dimension and similar capability across the other dimensions:

$$\mathrm{IQ}_{D_k}^{\text{imputed}} = \min\!\left(\overline{\mathrm{IQ}}_{\text{scored dims}},\; Q_{25}\!\left(D_k \mid \text{similar non-}D_k\text{ IQ}\right)\right)$$

where \(\overline{\mathrm{IQ}}_{\text{scored dims}}\) is the model's average IQ across the dimensions it does have. For the cap, we look at models released on or before the scored model that have real data on the missing dimension, compare their average IQ across the other dimensions, and use the lower-quartile missing-dimension IQ among the closest comparable models. If the comparable set is too thin, we fall back to the same-era lower quartile for that dimension. If no same-era real data exists for a dimension, we do not invent a neutral default; the model does not receive a derived all-dimension IQ.

In practice, comparable models are those whose average across the non-missing dimensions is within a small IQ band of the model being scored and whose release date is not later than the scored model. If fewer than three comparable models are available, we use the nearest same-era measured models instead. This keeps missing dimensions conservative without letting older models borrow strength from future benchmark cohorts.

All seven scored dimensions are always used for derived IQ. Missing a hard dimension such as Abstract Reasoning or Production Engineering should not improve a model's score, so missing dimensions are filled conservatively rather than omitted from the average.

Dimension Imputation Waterfall

StepWhen it appliesWhat happensResult
1. Scored dimension The model has enough benchmark coverage in the dimension after predecessor imputation. Score the available benchmarks, fill missing benchmarks with the benchmark waterfall, then average. Real/scored dimension IQ
2. Matched lower-quartile cap A whole dimension is still missing and the model has at least two scored dimensions. Find models with real data for the missing dimension and similar average across the other dimensions. \(\min(\text{model scored-dimension average}, \text{matched lower-quartile }D_k)\)
3. Nearest-neighbor lower quartile Too few models fall within the similarity radius. Use the nearest comparable models by other-dimension average. \(\min(\text{model scored-dimension average}, \text{nearest-neighbor lower quartile }D_k)\)
4. Global dimension lower quartile The comparable set is still too thin. Use the lower-quartile observed IQ for that dimension across models with real data. \(\min(\text{model scored-dimension average}, \text{global lower-quartile }D_k)\)
5. No derived IQ No real data exists for that dimension at all. Do not invent a neutral default. No derived all-dimension IQ.

Step 3: Compute the Composite

$$\mathrm{IQ} = \operatorname{round}\!\left(\frac{1}{7}\sum_{k=1}^{7}\mathrm{IQ}_{D_k}\right)$$

where all seven scored dimensions are used once missing dimensions are imputed.

Key rules:

  • Minimum 2 dimensions required. Models with fewer than 2 scored dimensions do not receive a derived composite IQ.
  • No omitted dimensions. Models with enough coverage for derived IQ always use a 7-dimension composite; missing dimensions are conservatively imputed.
  • Transparent count. The display shows X/7 so readers can see how many scored dimensions had source-backed data before dimension-level imputation.
  • Equal weighting. All dimensions contribute equally. Benchmark-specific expected-score ladders, not differential weighting, handle benchmark quality differences. This keeps Composite IQ transparent and neutral rather than introducing target-specific weights before there is enough external validation data to justify them.

Rank Status

Each model receives a rank status reflecting the completeness of its evaluation:

  • Full — All 7 scored dimensions covered. The most reliable composite.
  • Partial — 2–4 dimensions scored. Composite is derived but based on incomplete coverage.
  • Provisional — Only 1 dimension scored. Not enough for a derived composite.
  • Unranked — No dimension data available.

Tracked Benchmarks & Exclusions

Some benchmarks are tracked internally or shown in standalone charts but are not part of the composite IQ calculation:

  • MMLU-Pro — A 10-choice multiple-choice knowledge test. Overlaps with the Scientific Reasoning dimension (GPQA/HLE) and adds limited discrimination at the frontier. Models have converged to similar high scores.
  • MMMU-Pro — Multimodal academic questions. While the vision component is interesting, most frontier model evaluation focuses on text-based reasoning. This benchmark is tracked in the data but excluded from the IQ composite.

These benchmarks remain in the source-backed dataset for review and future charting — they are simply not included in the composite IQ computation.

Core Math Benchmarks

FrontierMath Tier 4, FrontierMath Tier 1–3, ProofBench, MathArena, and AIME are core inputs to the D1 (Mathematical Reasoning) dimension and are surfaced on the IQ page as standalone benchmark charts. The public charts are ordered hardest to easiest by the current top model's source-backed percent correct.

  • FrontierMath Tier 4 — the hardest math chart by current top-model percent correct; novel research-level problems with very low gameability.
  • FrontierMath Tier 1–3 — harder than AIME, easier than T4, with Epoch's expert-human baseline supporting the current IQ 130 region.
  • ProofBench — formally-verified proof writing. A different cognitive task than the problem-solving benchmarks because the model has to construct a verified proof, not just give an answer.
  • MathArena — IRT-derived expected performance across non-deprecated math competitions. It adds broad competition-math coverage beyond AIME and FrontierMath while using a stricter midrange ladder so midrange scores do not overstate Mathematical Reasoning IQ.
  • AIME — a useful competition-math signal, but treated as saturating because frontier systems are near the source ceiling and historical problem contamination is plausible.
BenchmarkIQ 7085100115130145160
FrontierMath T4 expected score01.55122555120
FrontierMath T1–3 expected score0410223860120
ProofBench expected score026143060120
MathArena expected score051532557595
AIME expected score-155306592112132

Emotional Reasoning Scoring

Emotional Reasoning (EQ) is currently an experimental diagnostic metric, not a Composite IQ dimension. It was previously presented as a standalone "EQ" score and briefly as an IQ dimension, but it is now excluded from Composite IQ because the available inputs are not rigorous enough for the main score. The component signals below are still mapped onto the shared 70–160 scale so users can inspect them separately. These are not direct human IQ percentiles: EQ-Bench is AI-judged, Arena.ai Overall is broad human preference, and AttuneBench is participant-grounded but currently sparse.

$$\mathrm{EQ} = \operatorname{avg}\!\left(\mathrm{EQ}_{\text{EQ-Bench}},\; \mathrm{EQ}_{\text{Arena.ai}},\; \mathrm{EQ}_{\text{AttuneBench}}\right)$$

The variable is written EQ above for historical continuity; it is a diagnostic Emotional Reasoning score, averaged from the three component signals below.

If only one source is available, the model remains eligible for that component's benchmark chart, but it does not receive an Emotional Reasoning diagnostic score. This keeps single-benchmark coverage from outranking models with broader interaction-quality evidence.

EQ-Bench 3 Elo → Emotional Reasoning

EQ-Bench 3 produces Elo ratings from head-to-head emotional-roleplay matchups judged by Claude Opus 4.6. This makes it useful as a dedicated affective/emotional reasoning signal, but it is not neutral ground truth: the judge is Claude/Anthropic-family, so the benchmark can favor Claude-like response style and penalize models that solve the scenario in a substantially different voice. AI IQ treats the Elo scale as a relative contest scale and applies its own human-reference hypothesis: 1300 is the average-human reference point for this structured roleplay task, 1500 is strong, and current frontier Elo values are high-end but kept near the AttuneBench range because the source is subjective and future Elo scores can exceed current rows. The mapping therefore keeps headroom above 2000 Elo:

EQ-Bench EloEmotional Reasoning
20075
60085
90092
110096
1300100
1500113
1700123
2000132
2300140

Arena.ai Overall Elo → Emotional Reasoning

Arena.ai Overall Elo reflects broad conversational quality as judged by human voters in head-to-head text matchups. It is not a dedicated emotional-intelligence benchmark: many wins come from general helpfulness, reasoning, coding, formatting, and user preference rather than pure emotional judgment. AI IQ therefore uses a softer diagnostic mapping than for EQ-Bench. The human-reference hypothesis treats 1350 as roughly average social usability in this arena, 1450 as strong, and 1500+ as top-frontier conversational preference but keeps the current upper range close to AttuneBench rather than letting broad preference dominate the diagnostic score. The observed Elo range is tighter (~1100–1520), so the anchor curve is calibrated separately:

Arena.ai Overall EloEmotional Reasoning
110070
120080
130090
1350100
1400107
1450114
1500122
1520126

AttuneBench Composite → Emotional Reasoning

AttuneBench evaluates models against participant annotations from 200 real multi-turn conversations in its Default mode. The Composite is a normalized aggregate across the benchmark's primary human-annotated metrics. This is the strongest participant-grounded Emotional Reasoning source, and its small human-baseline pilot suggests human annotators can bracket or exceed current model ranges on some metrics. AI IQ therefore maps 50 near the average-human reference, treats 52.5 as strong, and reserves 55+ for high-end emotional attunement. Current public coverage is only 11 rows in a narrow frontier range, so AttuneBench remains sparse and is not imputed.

AttuneBench CompositeEmotional Reasoning
4585
50100
52.5112
55125
60140

EQ-Bench Style-Sensitivity Adjustment

Because EQ-Bench 3 is judged by Claude, it is treated as a weak component rather than a standalone authority. To reduce family/style bias while preserving its dedicated task coverage, we subtract a 300-point Elo adjustment from the EQ-Bench component for Anthropic models before mapping to the implied Emotional Reasoning score. Arena.ai Overall is unaffected. If stronger participant-grounded or independently judged emotional-reasoning benchmarks gain broader coverage, EQ-Bench 3 is a candidate for diagnostic-only status or removal from the scored dimension.

Why multiple sources? Arena.ai Overall is human-judged and captures broad conversational preference; EQ-Bench adds dedicated emotional reasoning coverage; AttuneBench adds participant-grounded emotional-attunement coverage. Requiring at least two components balances coverage with judgment-source diversity.

Cost & Speed Metrics

Sticker Price — published price for a typical workload

AI IQ's effective-cost views are anchored to 1M I/O Tokens: 1M input tokens plus 1M output tokens, priced at the model's published per-million-token rates. Sticker Price is the dollar amount to process that standard workload:

$$\mathrm{StickerPrice} = p_{\text{in}} + p_{\text{out}}$$

where \(p_{\text{in}}\) and \(p_{\text{out}}\) are the published per-million-token prices in dollars.

Task Efficiency — how much work does the model use?

Sticker price alone hides large per-task differences in how much work a model uses to solve a benchmark. We estimate this with a blended usage multiplier. For each benchmark, AI IQ first estimates the task cost expected from a model's published input and output prices. The usage signal is the residual: actual task cost divided by expected task cost. Validated direct token-usage data is included as an additional signal where available.

$$\mathrm{DirectTokenUsage} = \frac{T_{\text{model}}}{\mathrm{median}(T)}$$ $$\log(\widehat{C}_{\text{model},b}) = \alpha_b + \beta_{\text{in},b}\log(p_{\text{in}}) + \beta_{\text{out},b}\log(p_{\text{out}})$$ $$\mathrm{BenchmarkUsage}_{b} = \frac{C_{\text{model},b}}{\widehat{C}_{\text{model},b}}$$ $$\mathrm{UsageMultiplier} = \mathrm{geomean}(\mathrm{BenchmarkUsage}_{b}, \mathrm{DirectTokenUsage}\ \mathrm{when\ available})$$

When validated direct token usage is available, AI IQ can use it directly. Benchmark-cost residuals are blended in when available, but a single benchmark-only residual is treated as provisional; benchmark-only effective cost requires at least two benchmark-cost signals. The Task Efficiency chart shows the inverse of the usage multiplier, so 2× means the model uses about half the task effort of the median model, and 0.5× means it uses about twice as much.

Effective Cost — what it actually costs to do the same task

The product of the two:

$$\mathrm{EffectiveCost} = \mathrm{StickerPrice} \times \mathrm{UsageMultiplier}$$

Reads as: what this model spends on a task after adjusting its 1M I/O Tokens sticker price by validated token usage and price-adjusted benchmark usage. Models below the diagonal (Effective Cost < Sticker Price) are task-efficient and cheaper than their sticker suggests; models above are task-hungry. This is the cost axis on every effective-cost-vs-quality chart.

Response Time

Response time is the median seconds to a complete answer (lower is better), shown on a logarithmic scale. The IQ vs Response Time chart reverses the X axis so the upper-right corner represents the ideal — high intelligence at low latency.

Sticker Price vs Effective Cost
Each model's sticker price plotted against its effective cost; the dashed diagonal marks Effective Cost = Sticker Price.
AI Models by Cost
Published price vs. effective cost after the blended task-usage multiplier.
Task Efficiency
Inverse of the effective-cost usage multiplier; higher means less price-adjusted task work.

Reading Chart Tooltips

Chart tooltips use the same structure across public chart surfaces. Click a model point, bar, or timeline label to open the tooltip; click outside it or scroll the page to dismiss it. Hover alone does not open a tooltip.

The first tab is a stable model summary rather than a chart-specific readout. Its order is always: IQ, Emotional Reasoning (EQ), Effective Cost/1M I/O, Task Efficiency, and Release Date. This makes models comparable across charts even when the chart axes are ordered differently.

The IQ and Cost tabs expose the supporting details. IQ shows the seven scored dimension scores, with benchmark-level evidence nested inside each dimension. Emotional Reasoning is shown separately as an experimental diagnostic metric from Arena.ai Overall, EQ-Bench 3, and AttuneBench. Cost shows input cost per 1M tokens, output cost per 1M tokens, sticker cost per 1M I/O tokens, task efficiency, and effective cost per 1M I/O tokens.

Dimension bell-curve charts distinguish measured-enough points from lower-coverage estimates. For Abstract Reasoning, models with at least two source-backed D3/Abstract benchmarks use the standard solid provider-colored dots and are eligible for labels. Lower-coverage D3 estimates remain visible as same-size dots with slightly lighter provider-colored fills and ordinary solid outlines, but are not labeled by default. This keeps sparse but useful estimates inspectable without making them look as certain as source-backed ARC coverage.

Limitations & Transparency

  • Dimension coverage varies. Some models have data for all 7 scored dimensions; others have as few as 2 (with the rest imputed). A model's composite IQ is most reliable when all scored dimensions are covered. Always check the X/7 count and rank status.
  • Benchmark mix matters. Two models with the same composite IQ may have very different underlying data quality. One might have all frontier benchmarks while another relies mostly on saturating or contamination-sensitive benchmarks. The rank status and dimension count help distinguish these cases.
  • Imputation is conservative, not clairvoyant. Missing values are filled first from explicit direct-predecessor lineage when available, then from within-dimension benchmark evidence, then from comparable measured models at the dimension level. These are reasonable estimates, not ground truth — a model's true ability on an unevaluated benchmark could be significantly higher or lower.
  • Anchor calibration is subjective. The mapping from raw scores to IQ involves judgment calls about what different performance levels mean relative to human cognitive ability. We document our rationale for each benchmark, but reasonable people can disagree.
  • IQ is a metaphor. Human IQ tests measure a specific construct via standardized instruments under controlled conditions. AI benchmark performance is a different thing. The IQ scale provides an intuitive frame of reference, not a claim of equivalence.
  • Calibration ladders are a design choice. The expected score at each IQ level directly affects which models benefit and which are penalized. Models that excel on saturating benchmarks will need very high raw scores to receive very high implied IQ, which may feel unfair if those benchmarks genuinely reflect high ability. We believe the trade-off — rewarding harder evaluation — is correct, but the specific ladder values are judgment calls.
  • Benchmarks become stale. As models improve and training data evolves, benchmark ladders, gameability ratings, and source selection may need revision. This methodology is a living document.

High-End Calibration

The expected-score ladders intentionally get more demanding above IQ 140 on many benchmarks. Each additional raw point often contributes less to implied IQ in the superhuman range than in the human range. This reflects three realities:

  1. Human IQ distributions thin out at the tails. The difference between IQ 100 and IQ 120 is much more common than the difference between IQ 140 and IQ 160.
  2. Superhuman benchmark scores are driven by breadth, not depth. A model scoring 50% on FrontierMath T4 isn't twice as smart as one scoring 25% — it covers more mathematical branches rather than being fundamentally more capable in any single branch.
  3. Practical discrimination. Without demanding high-end ladders, reasoning vs. non-reasoning configurations of the same model can produce unrealistically large IQ gaps. The ladder shape keeps those differences meaningful without letting one saturated benchmark dominate the composite.

A perfect or near-perfect score can still imply IQ 160 when that is a reasonable interpretation of the source. In the current calibration, several mature or contamination-sensitive benchmarks instead require off-scale performance for IQ 145 or 160. The point is not to cap hard benchmarks forever, but to make the high-end requirements explicit enough that future calibration passes can decide when a sudden 100% score on a very hard benchmark should map to an appropriately high implied IQ.

Data Process

AI IQ keeps source-backed data, extracted updates, and derived scoring separate. That separation matters because a raw benchmark chart should show what a public source actually reported, while the composite IQ can use conservative scoring-only imputations to avoid rewarding missing coverage.

Source-backed benchmark data can exist for models that are not yet shown on public chart surfaces. A model must have launch metadata, must not be hidden, and must have a derived IQ before it appears in the main IQ charts. This keeps placeholder rows inspectable in the data table without letting a single benchmark import promote them into public trend charts.

StagePurposeWhat is preserved
1. Capture Save the raw leaderboard or source text used for an update. The original pasted or scraped source capture.
2. Extract Map source rows to canonical model entries and fields. A small reviewable update listing exactly which model fields changed.
3. Apply Write source-backed values into the model dataset. Unknown values stay blank; unrelated fields are not guessed.
4. Score Derive IQ on a temporary scoring copy. Raw benchmark values remain source-backed; imputed values are used only for derived IQ.

Manual source captures that may be hard to reproduce exactly are archived so the same raw data can be re-parsed later if the extraction rules improve. Larger generated scrapes can be refreshed from the original source and do not need to be treated as permanent public evidence in the same way.

Chart Inclusion

AI IQ separates model data from chart display policy. A model can exist in the dataset, have source-backed benchmark rows, and still be absent from a public chart if it does not meet that chart's policy or required fields.

Most public chart surfaces use a default policy based on publication status, derived IQ availability, model type, provider-specific tier, and generation recency. Current and previous generations of provider-tier 0 or higher models are included; lower-tier variants such as mini, nano, Sonnet, Haiku, Flash, and smaller open-weight sizes are included only for the current generation. Archive and hidden rows are excluded unless explicitly overridden.

The IQ Over Time chart uses a stricter frontier-timeline policy. It requires a release date, derived IQ, a public model row, a general-purpose model type, provider tier 0 or higher, and membership in a top-lab provider grouping. Provider lines then connect non-decreasing IQ checkpoints rather than every model from that provider.

The policy fields are maintained in the admin dashboard: publication status, model type, provider tier, model line, generation offset, and display override. Provider tier is provider-specific; for example OpenAI mini/nano, Anthropic Sonnet/Haiku, Google Flash, NVIDIA Nano, and Qwen size tiers are not treated as generic cross-provider role names.

Sources

Benchmark scores, prices, and token usage come from publicly published leaderboards. Each source is sampled periodically and reconciled against published numbers before being applied.

  • Artificial Analysis Intelligence Index — the primary aggregator. Provides scores for AIME, GPQA Diamond, SWE-Bench Verified, HLE, SciCode, Terminal-Bench 2.0, CritPt, LiveCodeBench, IFBench, MMMU-Pro, and the AA composite indices (Omniscience, GDPval, τ2-Bench Telecom, LCR), plus per-model pricing, response time, median throughput, total evaluation cost for the AA suite, and token-usage data used for task efficiency.
  • Arena.ai Overall — Arena.ai's head-to-head text Elo ratings and ranks
  • Arena.ai WebDev — pairwise web-app development evaluations using Bradley-Terry ratings
  • DesignArena Full Stack — human-preference Elo for end-to-end frontend + backend builds
  • ARC Prize leaderboard — ARC-AGI-1, ARC-AGI-2, ARC-AGI-3 scores, and ARC-AGI per-task cost where published
  • Vals.ai — the Vals Index, AIME, ProofBench, SWE-Bench, and LiveCodeBench source views where available
  • Scale Labs SWE-Bench Pro Public Dataset — public SWE-Bench Pro Resolve Rate source
  • SWE Marathon — long-horizon software-engineering Resolution Rate (Pass@1), using best published agent/harness rows for canonical model-level scoring
  • Cognition FrontierCode — FrontierCode Diamond score for high-quality production-code tasks, using canonical public model rows where the source model identity is clear
  • SWE-rebench — continuously evolving and decontaminated software-engineering leaderboard rows
  • MathArena — model-level expected performance across non-deprecated math competitions
  • DeepSWE — Datacurve's live leaderboard for original long-horizon software-engineering tasks
  • SWE-Bench — SWE-Bench Verified leaderboard rows, using clear single-model agent/model pairs for model-level scoring
  • LiveCodeBench — continuously refreshed coding-problem benchmark sourced from live contests
  • Terminal-Bench — Terminal-Bench 2.0 and Terminal-Bench Hard task accuracy
  • BrowseComp — hard browsing benchmark with published human-trainer performance used for calibration
  • Humanity's Last Exam — benchmark semantics for the HLE calibration caveat
  • CritPt paper — research-physics benchmark semantics for the CritPt calibration caveat
  • SciCode — scientific coding benchmark results
  • Agents' Last Exam — agent/harness leaderboard for real-world professional workflows, using Full / Overall pass rate for canonical model-level scoring
  • OSWorld-Verified and Toolathlon — sparse computer-use and tool-use signals; current score rows use clear mirror/self-reported mappings, with official sources used for benchmark semantics and exact-match validation
  • MCP Atlas — real MCP-server tool-use tasks with claim-level scoring
  • IFBench and AA Omniscience — instruction-following and factual-reliability signals used for the Reliability dimension
  • Epoch AI — FrontierMath Tier 1–3 and Tier 4 accuracy
  • EQ-Bench 3 — emotional-intelligence Elo
  • AttuneBench — emotional-attunement benchmark from real human-AI conversations

The Artificial Analysis Intelligence Index can list two rows for the same model under one display name when the same underlying model has both a reasoning and a non-reasoning configuration (the reasoning row is marked with a 💡 lightbulb icon). When the two configurations differ meaningfully on cost, latency, or quality, they are tracked as separate model entries (e.g. reasoning vs non-reasoning variants of the same release).