Methodology
AI IQ assigns each model an estimated IQ score by evaluating performance across 5 cognitive dimensions, each measured by multiple benchmarks. Hard, ungameable benchmarks retain full IQ curves, while easier or gameable benchmarks have compressed ceilings that limit their influence. Missing benchmarks and dimensions are conservatively imputed inside the scoring pipeline, and every derived composite IQ averages all five dimension scores.
This page documents the full scoring system: how source data is captured, how the five dimensions are defined, how raw scores map to IQ via piecewise-linear interpolation, how benchmark ceilings are compressed for gameability, and how missing values are imputed without changing the source-backed benchmark table.
The 5-Dimension Framework
AI IQ organizes evaluation into five cognitive dimensions. Each dimension uses multiple benchmarks that are averaged together, with missing benchmarks conservatively imputed:
- Hard benchmarks are frontier-discriminating tests with low gameability. They retain full IQ curves with ceilings of 143–158 and can differentiate between the strongest models.
- Compressed benchmarks are easier or more gameable tests. Their anchor curves are compressed to lower ceilings (128–140), limiting how much a high score on a gameable benchmark can inflate the composite.
The composite IQ requires at least 2 of 5 dimensions to have data. Models with fewer scored dimensions do not receive a derived IQ and are skipped by IQ-ranked chart surfaces.
Formulas
Each benchmark raw score \(s\) is converted to an IQ value via piecewise-linear interpolation over that benchmark's anchor points \(\mathbf{A} = [(s_0, a_0),\, (s_1, a_1), \ldots]\):
Each dimension averages the IQ values of its benchmarks. Missing benchmarks are conservatively imputed before averaging using the benchmark-level waterfall below (see Benchmark-Level Imputation).
The five dimensions:
Superscripts denote compressed ceilings. \(f\) is piecewise-linear interpolation over each benchmark's anchor curve.
The composite IQ is the mean of all five dimension scores. At least 2 dimensions must be source-backed or predecessor-imputed before any missing whole dimensions are filled:
D1: Fluid Abstraction
Fluid abstraction is the ability to solve novel problems without relying on prior knowledge. This is the closest analogue to fluid intelligence in human psychometrics — raw problem-solving ability applied to patterns never seen before.
ARC-AGI-2 Hard
Each puzzle requires identifying a novel visual transformation rule from examples and applying it to a new input. The puzzles are unique and cannot be memorized. This is the purest test of abstract reasoning in the benchmark set — no prior knowledge helps, only the ability to infer abstract rules from examples. Far from saturation (top models ~85%). The curve compresses above IQ 140 to reflect diminishing returns in the superhuman range.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 20 | 85 |
| 40 | 95 |
| 60 | 100 |
| 75 | 115 |
| 85 | 125 |
| 95 | 140 |
| 100 | 143 |
ARC-AGI-1 Compressed · ceil 135
Same format as ARC-AGI-2 but an easier problem set. Top models now score ~96%, so it no longer discriminates at the frontier. The anchor curve is compressed from a ceiling of 152 down to 135 to limit the influence of saturated scores.
| Score % | IQ |
|---|---|
| 0 | 78 |
| 15 | 92 |
| 30 | 102 |
| 50 | 111 |
| 70 | 119 |
| 85 | 127 |
| 95 | 132 |
| 100 | 135 |
D2: Mathematical Reasoning
Mathematical reasoning and quantitative problem-solving — the ability to work with mathematical structures, proofs, and analytical frameworks. The hard benchmarks test novel quantitative reasoning that cannot be memorized from training data.
FrontierMath T4 Hard
Extremely difficult original math problems from Tier 4 of the FrontierMath benchmark. Problems are novel and cannot be found in training data. Top models currently score ~25%. T4 is averaged with FrontierMath Tier 1–3, AIME, and ProofBench to form the D2 (Mathematical Reasoning) dimension score. The curve compresses above IQ 140.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 5 | 100 |
| 15 | 120 |
| 30 | 135 |
| 50 | 142 |
| 70 | 148 |
| 100 | 155 |
AIME Compressed · ceil 135
Competition mathematics with integer answers. Old AIME problems are widely available in training data, with studies detecting 10–20 point contamination boosts. Models at ~98%. The original anchor curve was reshaped to be flatter in the mid-range and steeper at the top (so 80–100% scores spread out instead of bunching), then compressed from a ceiling of 146 to 135 to limit the influence of contamination-driven scores.
| Score % | IQ |
|---|---|
| 0 | 82 |
| 20 | 95 |
| 40 | 103 |
| 60 | 109 |
| 80 | 117 |
| 90 | 124 |
| 100 | 135 |
D3: Programmatic Reasoning
Practical engineering ability — the capacity to solve real-world technical problems in code and software systems. Programmatic Reasoning focuses on code repair, implementation, and algorithmic coding, while scientific research problems that use code as an execution medium live in Critical Reasoning.
SWE-Bench Pro Hard
A harder software-engineering benchmark used to complement SWE-Bench Verified. It keeps code repair in the Programmatic dimension while reducing reliance on the older verified set.
| Score | IQ |
|---|---|
| 0 | 70 |
| 0.20 | 105 |
| 0.40 | 125 |
| 0.55 | 138 |
| 0.65 | 145 |
| 0.80 | 152 |
| 1.00 | 158 |
SWE-bench Verified Compressed · ceil 128
Models generate patches to resolve real GitHub issues and pass unit tests. However, 94% of issues predate model training cutoffs and ~30% have solution leakage. This makes it one of the most gameable benchmarks, resulting in the most aggressive compression — from a ceiling of 144 down to 128.
| Score % | IQ |
|---|---|
| 0 | 80 |
| 15 | 92 |
| 30 | 102 |
| 50 | 110 |
| 65 | 117 |
| 80 | 123 |
| 100 | 128 |
LiveCodeBench Compressed · ceil 140
LiveCodeBench adds a broad, continuously refreshed coding signal to Programmatic Reasoning. It is compressed because it overlaps with the other programming benchmarks and current frontier models are already clustered near the top of the observed range.
| Score % | IQ |
|---|---|
| 0 | 78 |
| 20 | 92 |
| 40 | 105 |
| 60 | 116 |
| 75 | 126 |
| 85 | 134 |
| 100 | 140 |
D4: Critical Reasoning
Critical reasoning captures expert judgment, scientific literacy, and difficult problem analysis under uncertainty. The hard benchmarks test whether a model can reason through questions that push the boundaries of human expertise itself, while the compressed benchmark tests graduate-level science knowledge.
SciCode Compressed · ceil 140
SciCode uses code as the execution medium for realistic scientific research problems. The tasks require identifying scientific concepts, recalling domain facts, reasoning through numerical methods or simulations, and transforming that reasoning into computation. It is included in Critical Reasoning because the bottleneck is scientific research reasoning, not generic programming.
| Score % | IQ |
|---|---|
| 10 | 78 |
| 20 | 88 |
| 30 | 100 |
| 40 | 108 |
| 50 | 117 |
| 60 | 125 |
| 80 | 135 |
| 100 | 140 |
Humanity's Last Exam Hard
Questions contributed by domain experts and explicitly screened to ensure no existing model can answer them at creation time. The benchmark spans the full frontier of human expertise. Current top scores are in the mid-40s. The curve compresses significantly above IQ 140 — even though 100% would represent superhuman breadth of knowledge, the upper tail is flattened so current frontier scores do not overstate the composite.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 5 | 95 |
| 10 | 110 |
| 15 | 120 |
| 20 | 130 |
| 25 | 136 |
| 35 | 141 |
| 50 | 146 |
| 75 | 153 |
| 100 | 158 |
CritPt Hard
Novel mathematical analysis problems that require identifying critical points and applying analytical reasoning. Problems are original, making memorization ineffective. Current top score is ~27%. Because this benchmark is extremely difficult, a source-backed 0% is treated as weak evidence rather than an IQ floor. The scoring curve reaches its ceiling at 20%, so higher source-backed percentages do not push the benchmark contribution above IQ 155.
| Score % | IQ |
|---|---|
| 0 | 100 |
| 0.6 | 120 |
| 1.6 | 130 |
| 3 | 135 |
| 5 | 140 |
| 8 | 145 |
| 12 | 150 |
| 20 | 155 |
GPQA Diamond Compressed · ceil 135
Graduate-level science questions written by PhD experts. A 25% score equals random guessing. Domain experts score 65–81%. The public question set is widely available in training data, making contamination a significant concern. The anchor curve is compressed from ceiling 148 to 135.
| Score % | IQ |
|---|---|
| 25 | 85 |
| 35 | 98 |
| 50 | 107 |
| 65 | 115 |
| 80 | 123 |
| 90 | 131 |
| 100 | 135 |
D5: Agentic Reasoning
Agentic reasoning measures practical task execution across environments: navigating tools, using context, taking multi-step actions, and recovering enough from intermediate state to finish the task.
Terminal-Bench 2.0 Hard
Models execute shell commands in isolated Docker containers to complete practical system administration and development tasks. The interactive, execution-based format makes memorization ineffective.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 10 | 85 |
| 25 | 100 |
| 40 | 115 |
| 55 | 128 |
| 65 | 136 |
| 80 | 142 |
| 100 | 148 |
Terminal-Bench Hard Hard
Terminal-Bench Hard is tracked separately from Terminal-Bench 2.0 and contributes an additional source-backed agentic reasoning signal when available. It uses the same IQ anchor curve as Terminal-Bench 2.0, but raw values remain separate and are never backfilled into the Terminal-Bench 2.0 field.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 10 | 85 |
| 25 | 100 |
| 40 | 115 |
| 55 | 128 |
| 65 | 136 |
| 80 | 142 |
| 100 | 148 |
BrowseComp Hard
BrowseComp measures hard browsing and research task completion. Scores are stored as percentages and mapped through a hard-benchmark curve with a 148 ceiling.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 20 | 85 |
| 40 | 96 |
| 50 | 105 |
| 60 | 116 |
| 75 | 132 |
| 85 | 140 |
| 100 | 148 |
OSWorld-Verified Hard
OSWorld-Verified measures desktop/computer-use task completion, capturing a model's ability to operate an external environment rather than only answer a static prompt.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 20 | 90 |
| 40 | 102 |
| 60 | 122 |
| 75 | 136 |
| 85 | 143 |
| 100 | 150 |
Toolathlon Hard
Toolathlon measures tool-use task performance. Scores are stored as percentages and mapped through an agentic hard-benchmark curve with a 152 ceiling.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 15 | 88 |
| 30 | 105 |
| 40 | 120 |
| 45 | 128 |
| 55 | 140 |
| 70 | 148 |
| 100 | 152 |
Piecewise-Linear Interpolation
Each benchmark defines a set of anchor points mapping raw scores to IQ values. For scores that fall between two anchors, we use piecewise-linear interpolation:
If the score is at or below the lowest anchor, the model receives that anchor's IQ. If at or above the highest, it receives the ceiling IQ. There is no extrapolation beyond the defined range.
This approach avoids assumptions about the distribution shape between anchors. Each segment can have a different slope, allowing the curve to be steeper where small score improvements represent large cognitive leaps (e.g., going from 0% to 5% on HLE) and flatter where additional points reflect diminishing differentiation.
Benchmark Averaging & Compression
Each dimension averages all its benchmarks together, with missing benchmarks conservatively imputed. Rather than separating benchmarks into primary/fallback tiers with a hard cap, we use compressed anchor curves to limit the influence of easier or gameable benchmarks.
How Compression Works
For compressed benchmarks, the anchor curve is rescaled so that IQ values above 100 are proportionally reduced toward a lower ceiling:
Values at or below IQ 100 are unchanged. This preserves the low-end of the curve (where models genuinely struggle) while compressing the high-end where gameable benchmarks over-reward.
Why compress instead of cap? A hard cap (e.g., IQ 115) discards all discrimination above the cap — a model scoring 80% and one scoring 100% on AIME would both receive 115. Compression preserves the rank ordering while reducing the magnitude of the advantage that gameable benchmarks can confer. A perfect AIME score now yields IQ 135 instead of 146, which still contributes meaningfully but cannot dominate the dimension average.
Benchmark-Level Imputation
When a model has scores for some but not all of a dimension's benchmarks, the missing benchmarks are filled in before the dimension IQ is averaged. We use two ingredients:
- The model's available-benchmark IQ average — how the model is performing on the benchmarks it does have in this dimension. This is the within-dimension signal: if a model is hitting IQ 130 on the dimension's other benchmarks, the missing one is probably also somewhere around 130.
- The benchmark's 80th-percentile IQ (\(P_{80}\)) — a per-benchmark ceiling derived from the actual data. Take every model that has a real score on that benchmark, convert each score to an implied IQ via the anchor curve, sort those IQs from low to high, and take the value at the 80th-percentile rank. So if 50 models have HLE scores yielding implied IQs ranging from 70 to 155, \(P_{80}(\text{HLE})\) is the implied IQ at the 80th-percentile rank in that sorted list. It is where strong-but-not-frontier measured models actually land on this benchmark.
The imputed value is the minimum of the two:
Why min of the two? The model's own dimension average is the best within-dimension signal we have. Capping at the 80th-percentile prevents a strong model from being imputed past where the actual data has been observed — a model averaging IQ 145 in this dimension might hit the missing benchmark's ceiling, but the imputed value won't claim that without measurement. The min lets imputation move a missing score up or down toward what the rest of the dimension implies, while staying conservatively below where the field has empirically reached.
Benchmark Imputation Waterfall
| Step | When it applies | What happens | Result |
|---|---|---|---|
| 1. Source value | The model has a benchmark score. | Use the raw source-backed value. | Real benchmark IQ |
| 2. Primary / secondary predecessor | The benchmark is missing and the model has a clear primary or secondary predecessor. | Try the primary predecessor first, then the secondary predecessor, and use the first valid scoring value for that benchmark. | Predecessor-imputed benchmark IQ |
| 3. Hard-benchmark zero | A historically near-zero hard benchmark is missing. | Use 0 as the scoring-only value, while leaving the raw benchmark field blank. | Zero-assumed benchmark IQ |
| 4. Within-dimension estimate | The dimension has at least one other benchmark for this model. | Average the model's available benchmark IQs in that dimension, then cap by the benchmark's 80th percentile. | \(\min(\text{dimension benchmark average}, \text{benchmark }P_{80})\) |
| 5. No benchmark estimate | The model has no benchmark data in that dimension. | Do not invent individual benchmark rows. | The whole dimension is handled by dimension-level imputation. |
For ARC-AGI-2 and CritPt, missing values are treated as zero in the scoring pipeline because historical frontier models generally scored at or near zero until directly shown otherwise. For models released before April 2025, the same scoring-only zero assumption is also applied to FrontierMath T4, ProofBench, and Terminal-Bench 2.0. The raw benchmark table still leaves those values blank when no source row exists. Other missing benchmarks use the ordinary imputation waterfall.
Predecessor imputation is constrained by model family, and non-reasoning variants do not impute from reasoning variants. These lineage choices are scoring assumptions only; they do not create source-backed raw benchmark values.
Imputation only fires inside a dimension that has at least one real or hard-zero benchmark. If a dimension has no usable benchmark value for a given model, no benchmark-level imputation runs there — the dimension itself is either left missing or filled at the dimension level (see Composite IQ Calculation below).
Composite IQ Calculation
Step 1: Score Each Dimension
For every dimension where the model has at least one real benchmark, compute the dimension IQ as the average of its benchmarks. If some benchmarks within the dimension are missing, fill them via the conservative imputation above before averaging.
Step 2: Dimension-Level Imputation
If a model has at least 2 scored dimensions but is missing some of the others, every missing dimension is imputed before the composite is averaged. The cap is matched to models with real data for that dimension and similar capability across the other dimensions:
where \(\overline{\mathrm{IQ}}_{\text{scored dims}}\) is the model's average IQ across the dimensions it does have. For the cap, we look at models released on or before the scored model that have real data on the missing dimension, compare their average IQ across the other dimensions, and use the lower-quartile missing-dimension IQ among the closest comparable models. If the comparable set is too thin, we fall back to the same-era lower quartile for that dimension. If no same-era real data exists for a dimension, we do not invent a neutral default; the model does not receive a derived all-dimension IQ.
In practice, comparable models are those whose average across the non-missing dimensions is within a small IQ band of the model being scored and whose release date is not later than the scored model. If fewer than three comparable models are available, we use the nearest same-era measured models instead. This keeps missing dimensions conservative without letting older models borrow strength from future benchmark cohorts.
All five dimensions are always used for derived IQ. Missing a hard dimension such as Fluid Abstraction should not improve a model's score, so missing dimensions are filled conservatively rather than omitted from the average.
Dimension Imputation Waterfall
| Step | When it applies | What happens | Result |
|---|---|---|---|
| 1. Scored dimension | The model has at least one benchmark in the dimension after predecessor imputation. | Score the available benchmarks, fill missing benchmarks with the benchmark waterfall, then average. | Real/scored dimension IQ |
| 2. Matched lower-quartile cap | A whole dimension is still missing and the model has at least two scored dimensions. | Find models with real data for the missing dimension and similar average across the other dimensions. | \(\min(\text{model scored-dimension average}, \text{matched lower-quartile }D_k)\) |
| 3. Nearest-neighbor lower quartile | Too few models fall within the similarity radius. | Use the nearest comparable models by other-dimension average. | \(\min(\text{model scored-dimension average}, \text{nearest-neighbor lower quartile }D_k)\) |
| 4. Global dimension lower quartile | The comparable set is still too thin. | Use the lower-quartile observed IQ for that dimension across models with real data. | \(\min(\text{model scored-dimension average}, \text{global lower-quartile }D_k)\) |
| 5. No derived IQ | No real data exists for that dimension at all. | Do not invent a neutral default. | No derived all-dimension IQ. |
Step 3: Compute the Composite
where all five dimensions are used once missing dimensions are imputed.
Key rules:
- Minimum 2 dimensions required. Models with fewer than 2 scored dimensions do not receive a derived composite IQ.
- No omitted dimensions. Models with enough coverage for derived IQ always use a 5-dimension composite; missing dimensions are conservatively imputed.
- Transparent count. The display shows
X/5so readers can see how many dimensions had source-backed data before dimension-level imputation. - Equal weighting. All dimensions contribute equally. Compressed ceilings (not differential weighting) handle benchmark quality differences.
Rank Status
Each model receives a rank status reflecting the completeness of its evaluation:
- Full — All 5 dimensions scored. The most reliable composite.
- Partial — 2–4 dimensions scored. Composite is derived but based on incomplete coverage.
- Provisional — Only 1 dimension scored. Not enough for a derived composite.
- Unranked — No dimension data available.
Tracked Benchmarks & Exclusions
Some benchmarks are tracked in the models table or shown in standalone charts but are not part of the composite IQ calculation:
- MMLU-Pro — A 10-choice multiple-choice knowledge test. Overlaps with the Critical Reasoning dimension (GPQA/HLE) and adds limited discrimination at the frontier. Models have converged to similar high scores.
- MMMU-Pro — Multimodal academic questions. While the vision component is interesting, most frontier model evaluation focuses on text-based reasoning. This benchmark is tracked in the data but excluded from the IQ composite.
These benchmarks remain in the database and are viewable on the models page — they are simply not included in the composite IQ computation.
Core Math Benchmarks
FrontierMath Tier 1–3 and ProofBench are core inputs to the D2 (Mathematical Reasoning) dimension and are surfaced on the IQ page as standalone cost-scatters.
- FrontierMath Tier 1–3 (ceiling 152) — harder than AIME, easier than T4. Top general models score ~50%. Slots between AIME and T4 in difficulty and gives more discrimination in the middle of the math distribution where T4 is too sparse.
- ProofBench (ceiling 158) — formally-verified proof writing. A different cognitive task than the problem-solving benchmarks (you have to construct a verified proof, not just give an answer). Top general models ~56%; specialized math models ~71%.
EQ Scoring
AI IQ estimates an Emotional Quotient (EQ) for each model from three interaction-quality signals: Arena Elo (broad conversational quality as ranked by human-preference voting), IFBench (instruction-following and constraint adherence), and EQ-Bench 3 Elo (AI-judged emotional/social reasoning in challenging roleplays). EQ-Bench 3 is retained because it is the strongest recent dedicated emotional/social reasoning signal, but it is style-sensitive and judged by Claude, so it is treated as one component rather than as neutral ground truth. Each source score is mapped to an implied EQ via a hand-calibrated anchor curve, then the available components are averaged when at least two source-backed components are present.
If only one source is available, the row remains eligible for that component benchmark chart, but it does not receive a composite EQ. This keeps single-benchmark coverage from outranking models with broader interaction-quality evidence.
EQ-Bench 3 Elo → EQ
EQ-Bench 3 produces Elo ratings from head-to-head emotional-roleplay matchups judged by Claude. This makes it valuable as a dedicated affective/social reasoning signal, but also sensitive to Claude-like response style. The Elo range observed in production runs roughly from 200 (very weak) to 2000 (top frontier). The mapping:
| EQ-Bench Elo | EQ |
|---|---|
| 200 | 78 |
| 600 | 88 |
| 900 | 93 |
| 1100 | 97 |
| 1300 | 105 |
| 1500 | 113 |
| 1700 | 125 |
| 2000 | 140 |
Arena Elo → EQ
LM Arena Elo reflects broad conversational quality as judged by human voters in head-to-head matchups. The observed Elo range is tighter (~1100–1520), so the anchor curve is calibrated separately:
| Arena Elo | EQ |
|---|---|
| 1100 | 70 |
| 1200 | 80 |
| 1300 | 95 |
| 1350 | 105 |
| 1400 | 113 |
| 1450 | 122 |
| 1500 | 132 |
| 1520 | 140 |
IFBench → EQ
IFBench captures whether the model listens to instructions and respects user constraints. This is not emotional intelligence in the narrow sense, but it is part of practical interaction quality.
| IFBench % | EQ |
|---|---|
| 0 | 70 |
| 30 | 85 |
| 50 | 100 |
| 65 | 112 |
| 75 | 125 |
| 85 | 138 |
| 100 | 145 |
EQ-Bench Style-Sensitivity Adjustment
Because EQ-Bench 3 is judged by Claude, it can favor Claude-like response style and penalize models that solve emotional/social scenarios in a substantially different voice. To reduce that family/style bias while preserving EQ-Bench as a useful dedicated signal, we subtract a 100-point Elo adjustment from the EQ-Bench component for Anthropic models before mapping to implied EQ. Arena and IFBench are unaffected.
Why three sources? Arena is human-judged and captures broad conversational preference; IFBench adds a listening/instruction-following signal; EQ-Bench adds dedicated emotional/social reasoning coverage. Requiring at least two components balances specificity, judgment-source diversity, and practical user experience.
Cost & Speed Metrics
Sticker Price — published price for a typical workload
AI IQ's effective-cost views are anchored to a 2:1 input-to-output token mix — a deliberately input-heavy workload that reflects most real applications (RAG, long-context reasoning, agent loops). Sticker Price is the dollar amount to process 2M input tokens and generate 1M output tokens at a model's published rates:
where \(p_{\text{in}}\) and \(p_{\text{out}}\) are the published per-million-token prices in dollars.
Task Efficiency — how much work does the model use?
Sticker price alone hides large per-task differences in how much work a model uses to solve a benchmark. We estimate this with a blended usage multiplier. For each benchmark, AI IQ first estimates the task cost expected from a model's published input and output prices. The usage signal is the residual: actual task cost divided by expected task cost. Direct token-usage data is included as an additional signal where available.
When only one channel is available, AI IQ uses that channel directly. The Task Efficiency chart shows the inverse of the usage multiplier, so 2× means the model uses about half the task effort of the median model, and 0.5× means it uses about twice as much.
Effective Cost — what it actually costs to do the same task
The product of the two:
Reads as: what this model spends on a task after adjusting its published price by observed and price-adjusted benchmark usage. Models below the diagonal (Effective Cost < Sticker Price) are task-efficient and cheaper than their sticker suggests; models above are task-hungry. This is the cost axis on every effective-cost-vs-quality chart.
Response Time
Response time is the median seconds to a complete answer (lower is better), shown on a logarithmic scale. The IQ vs Response Time chart reverses the X axis so the upper-right corner represents the ideal — high intelligence at low latency.
Limitations & Transparency
- Dimension coverage varies. Some models have data for all 5 dimensions; others have as few as 2 (with the rest imputed). A model's composite IQ is most reliable when all dimensions are scored. Always check the
X/5count and rank status. - Benchmark mix matters. Two models with the same composite IQ may have very different underlying data quality. One might have all hard benchmarks (ungameable tests with full curves) while another relies mostly on compressed benchmarks (with lower ceilings). The rank status and dimension count help distinguish these cases.
- Imputation is conservative, not clairvoyant. Missing values are filled first from explicit direct-predecessor lineage when available, then from within-dimension benchmark evidence, then from comparable measured models at the dimension level. These are reasonable estimates, not ground truth — a model's true ability on an unevaluated benchmark could be significantly higher or lower.
- Anchor calibration is subjective. The mapping from raw scores to IQ involves judgment calls about what different performance levels mean relative to human cognitive ability. We document our rationale for each benchmark, but reasonable people can disagree.
- IQ is a metaphor. Human IQ tests measure a specific construct via standardized instruments under controlled conditions. AI benchmark performance is a different thing. The IQ scale provides an intuitive frame of reference, not a claim of equivalence.
- Compressed ceilings are a design choice. The ceiling values directly affect which models benefit and which are penalized. Models that excel on compressed benchmarks will have their contributions limited, which may feel unfair if those benchmarks genuinely reflect high ability. We believe the trade-off — rewarding harder evaluation — is correct, but the specific ceiling values are judgment calls.
- Benchmarks become stale. As models improve and training data evolves, benchmark ceilings, gameability ratings, and compression levels may need revision. This methodology is a living document.
Asymptotic Compression Above IQ 140
The anchor point curves intentionally compress above IQ 140. Each additional percentage point on a benchmark contributes less to the IQ score in the superhuman range than in the human range. This reflects three realities:
- Human IQ distributions compress at the tails. The difference between IQ 100 and IQ 120 is much more common than the difference between IQ 140 and IQ 160.
- Superhuman benchmark scores are driven by breadth, not depth. A model scoring 50% on FrontierMath T4 isn't twice as smart as one scoring 25% — it covers more mathematical branches rather than being fundamentally more capable in any single branch.
- Practical discrimination. Without compression, reasoning vs. non-reasoning configurations of the same model produce 20+ point IQ gaps, which is unrealistic. With compression, the gap narrows to ~10-12 points (“smart” vs. “very smart” rather than “above average” vs. “genius”).
The compression ensures that no benchmark can single-handedly produce IQ values above ~155, regardless of raw score. The theoretical ceiling of the composite is approximately 150–155 under current benchmarks.
Data Process
AI IQ keeps source-backed data, extracted updates, and derived scoring separate. That separation matters because a raw benchmark chart should show what a public source actually reported, while the composite IQ can use conservative scoring-only imputations to avoid rewarding missing coverage.
Source-backed benchmark data can exist for models that are not yet shown on public chart surfaces. A model must have launch metadata, must not be hidden, and must have a derived IQ before it appears in the main IQ charts. This keeps placeholder rows inspectable in the data table without letting a single benchmark import promote them into public trend charts.
| Stage | Purpose | What is preserved |
|---|---|---|
| 1. Capture | Save the raw leaderboard or source text used for an update. | The original pasted or scraped source capture. |
| 2. Extract | Map source rows to canonical model entries and fields. | A small reviewable update listing exactly which model fields changed. |
| 3. Apply | Write source-backed values into the model dataset. | Unknown values stay blank; unrelated fields are not guessed. |
| 4. Score | Derive IQ on a temporary scoring copy. | Raw benchmark values remain source-backed; imputed values are used only for derived IQ. |
Manual source captures that may be hard to reproduce exactly are archived so the same raw data can be re-parsed later if the extraction rules improve. Larger generated scrapes can be refreshed from the original source and do not need to be treated as permanent public evidence in the same way.
Chart Inclusion
AI IQ separates model data from chart display policy. A model can exist in the dataset, have source-backed benchmark rows, and still be absent from a public chart if it does not meet that chart's policy or required fields.
Most public chart surfaces use a default policy based on publication status, derived IQ availability, model type, provider-specific tier, and generation recency. Current and previous generations of provider-tier 0 or higher models are included; lower-tier variants such as mini, nano, Sonnet, Haiku, Flash, and smaller open-weight sizes are included only for the current generation. Archive and hidden rows are excluded unless explicitly overridden.
The IQ Over Time chart uses a stricter frontier-timeline policy. It requires a release date, derived IQ, a public model row, a general-purpose model type, provider tier 0 or higher, and membership in a top-lab provider grouping. Provider lines then connect non-decreasing IQ checkpoints rather than every model from that provider.
The policy fields are maintained in the admin dashboard: publication status, model type, provider tier, model line, generation offset, and display override. Provider tier is provider-specific; for example OpenAI mini/nano, Anthropic Sonnet/Haiku, Google Flash, NVIDIA Nano, and Qwen size tiers are not treated as generic cross-provider role names.
Sources
Benchmark scores, prices, and token usage come from publicly published leaderboards. Each source is sampled periodically and reconciled against published numbers before being applied.
- Artificial Analysis Intelligence Index — the primary aggregator. Provides scores for AIME, GPQA Diamond, SWE-Bench Verified, HLE, SciCode, Terminal-Bench 2.0, CritPt, LiveCodeBench, IFBench, MMMU-Pro, and the AA composite indices (Omniscience, GDPval, τ2-Bench Telecom, LCR), plus per-model pricing, response time, median throughput, total evaluation cost for the AA suite, and token-usage data used for task efficiency.
- LM Arena — head-to-head Elo ratings and ranks
- ARC Prize leaderboard — ARC-AGI-1 and ARC-AGI-2 scores and per-task cost; ARC-AGI-3 is tracked in the admin/models-table pipeline but is not yet used in IQ scoring
- Vals.ai — the Vals Index, AIME, ProofBench, SWE-Bench, SWE-Bench Pro, and LiveCodeBench source views where available
- SWE-Bench — SWE-Bench Verified leaderboard rows, using clear single-model agent/model pairs for model-level scoring
- Terminal-Bench — Terminal-Bench 2.0 and Terminal-Bench Hard task accuracy
- SciCode — scientific coding benchmark results
- BrowseComp, OSWorld-Verified, and Toolathlon — curated source captures used for the Agentic Reasoning dimension
- Epoch AI — FrontierMath Tier 1–3 and Tier 4 accuracy
- EQ-Bench 3 — emotional-intelligence Elo
The Artificial Analysis Intelligence Index can list two rows for the same model under one display name when the same underlying model has both a reasoning and a non-reasoning configuration (the reasoning row is marked with a 💡 lightbulb icon). When the two configurations differ meaningfully on cost, latency, or quality, they are tracked as separate model entries (e.g. reasoning vs non-reasoning variants of the same release).