Methodology
AI IQ assigns each model an estimated IQ score by evaluating performance across 4 cognitive dimensions, each measured by multiple benchmarks. Hard, ungameable benchmarks retain full IQ curves, while easier or gameable benchmarks have compressed ceilings that limit their influence. Missing benchmarks and dimensions are conservatively imputed, and the composite IQ is the mean of all four dimension scores.
This page documents the full scoring system: how the four dimensions are defined, how raw scores map to IQ via piecewise-linear interpolation, how benchmark ceilings are compressed for gameability, and how missing values are imputed.
The 4-Dimension Framework
AI IQ organizes evaluation into four cognitive dimensions. Each dimension uses multiple benchmarks that are averaged together, with missing benchmarks conservatively imputed:
- Hard benchmarks are frontier-discriminating tests with low gameability. They retain full IQ curves with ceilings of 143–158 and can differentiate between the strongest models.
- Compressed benchmarks are easier or more gameable tests. Their anchor curves are compressed to lower ceilings (128–140), limiting how much a high score on a gameable benchmark can inflate the composite.
The composite IQ requires at least 2 of 4 dimensions to have data. Models with fewer scored dimensions fall back to a manual IQ estimate.
Formulas
Each benchmark raw score \(s\) is converted to an IQ value via piecewise-linear interpolation over that benchmark's anchor points \(\mathbf{A} = [(s_0, a_0),\, (s_1, a_1), \ldots]\):
Each dimension averages the IQ values of its benchmarks. Missing benchmarks are conservatively imputed before averaging using a symmetric 3-tier system (see Benchmark-Level Imputation).
The four dimensions:
Superscripts denote compressed ceilings. \(f\) is piecewise-linear interpolation over each benchmark's anchor curve.
The composite IQ is the mean of all four dimension scores, requiring at least 2 scored dimensions:
D1: Abstract Reasoning
Abstract reasoning is the ability to solve novel problems without relying on prior knowledge. This is the closest analogue to the "g factor" in human psychometrics — raw problem-solving ability applied to patterns never seen before.
ARC-AGI-2 Hard
Each puzzle requires identifying a novel visual transformation rule from examples and applying it to a new input. The puzzles are unique and cannot be memorized. This is the purest test of abstract reasoning in the benchmark set — no prior knowledge helps, only the ability to infer abstract rules from examples. Far from saturation (top models ~85%). The curve compresses above IQ 140 to reflect diminishing returns in the superhuman range.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 20 | 85 |
| 40 | 95 |
| 60 | 100 |
| 75 | 115 |
| 85 | 125 |
| 95 | 140 |
| 100 | 143 |
ARC-AGI-1 Compressed · ceil 135
Same format as ARC-AGI-2 but an easier problem set. Top models now score ~96%, so it no longer discriminates at the frontier. The anchor curve is compressed from a ceiling of 152 down to 135 to limit the influence of saturated scores.
| Score % | IQ |
|---|---|
| 0 | 78 |
| 15 | 92 |
| 30 | 102 |
| 50 | 111 |
| 70 | 119 |
| 85 | 127 |
| 95 | 132 |
| 100 | 135 |
D2: Mathematical Reasoning
Mathematical reasoning and quantitative problem-solving — the ability to work with mathematical structures, proofs, and analytical frameworks. The hard benchmarks test novel quantitative reasoning that cannot be memorized from training data.
FrontierMath T4 Hard
Extremely difficult original math problems from Tier 4 of the FrontierMath benchmark. Problems are novel and cannot be found in training data. Top models score ~25%. Currently no T4 data is available for any model — this benchmark is included in the framework for future use. When data exists, it will be averaged with AIME for the D2 score. The curve compresses above IQ 140.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 5 | 100 |
| 15 | 120 |
| 30 | 135 |
| 50 | 142 |
| 70 | 148 |
| 100 | 155 |
AIME Compressed · ceil 130
Competition mathematics with integer answers. Old AIME problems are widely available in training data, with studies detecting 10–20 point contamination boosts. Models at ~98%. The anchor curve is compressed from a ceiling of 146 to 130 to limit the influence of contamination-driven scores.
| Score % | IQ |
|---|---|
| 0 | 82 |
| 20 | 95 |
| 40 | 104 |
| 60 | 112 |
| 80 | 120 |
| 90 | 124 |
| 100 | 130 |
D3: Programmatic Reasoning
Practical engineering ability — the capacity to solve real-world technical problems in code and systems. The hard benchmark tests execution-based tasks that require genuine interaction with systems, while the compressed benchmarks cover real-world software engineering and scientific computing.
Terminal-Bench 2.0 Hard
Models execute shell commands in isolated Docker containers to complete practical system administration and development tasks. The interactive, execution-based format makes memorization ineffective. One of the highest-integrity benchmarks in the set. The curve compresses above IQ 140.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 10 | 100 |
| 25 | 115 |
| 40 | 125 |
| 55 | 135 |
| 65 | 140 |
| 80 | 145 |
| 100 | 150 |
SWE-bench Verified Compressed · ceil 128
Models generate patches to resolve real GitHub issues and pass unit tests. However, 94% of issues predate model training cutoffs and ~30% have solution leakage. This makes it one of the most gameable benchmarks, resulting in the most aggressive compression — from a ceiling of 144 down to 128.
| Score % | IQ |
|---|---|
| 0 | 80 |
| 15 | 92 |
| 30 | 102 |
| 50 | 110 |
| 65 | 117 |
| 80 | 123 |
| 100 | 128 |
SciCode Compressed · ceil 140
Scientific computing tasks requiring domain expertise in physics, chemistry, and biology alongside programming skill. The bottleneck is understanding the science, not the programming. The interdisciplinary nature provides partial protection against memorization from academic literature. Compressed from ceiling 158 to 140 to account for moderate gameability.
| Score % | IQ |
|---|---|
| 10 | 78 |
| 20 | 88 |
| 30 | 100 |
| 40 | 108 |
| 50 | 117 |
| 60 | 125 |
| 80 | 135 |
| 100 | 140 |
D4: Academic Reasoning
Breadth and depth of expert-level knowledge across academic domains. The hard benchmarks test whether a model can answer questions that push the boundaries of human expertise itself, while the compressed benchmark tests graduate-level science knowledge.
Humanity's Last Exam Hard
Questions contributed by domain experts and explicitly screened to ensure no existing model can answer them at creation time. The benchmark spans the full frontier of human expertise. Current top score is ~48%. The curve compresses significantly above IQ 140 — even though 100% would represent superhuman breadth of knowledge, the IQ ceiling is kept at 158 so that no single benchmark can inflate the composite above ~155.
| Score % | IQ |
|---|---|
| 0 | 70 |
| 5 | 95 |
| 10 | 110 |
| 15 | 120 |
| 20 | 130 |
| 25 | 140 |
| 35 | 145 |
| 50 | 150 |
| 75 | 155 |
| 100 | 158 |
CritPt Hard
Novel mathematical analysis problems that require identifying critical points and applying analytical reasoning. Problems are original, making memorization ineffective. Current top score is 13/20. Scores are on a 0–20 scale (not percentage). The curve compresses above IQ 140.
| Score (0–20) | IQ |
|---|---|
| 0 | 70 |
| 0.6 | 120 |
| 1.6 | 130 |
| 3 | 135 |
| 5 | 140 |
| 8 | 145 |
| 12 | 150 |
| 20 | 155 |
GPQA Diamond Compressed · ceil 135
Graduate-level science questions written by PhD experts. A 25% score equals random guessing. Domain experts score 65–81%. The public question set is widely available in training data, making contamination a significant concern. The anchor curve is compressed from ceiling 148 to 135.
| Score % | IQ |
|---|---|
| 25 | 85 |
| 35 | 98 |
| 50 | 107 |
| 65 | 115 |
| 80 | 123 |
| 90 | 131 |
| 100 | 135 |
Piecewise-Linear Interpolation
Each benchmark defines a set of anchor points mapping raw scores to IQ values. For scores that fall between two anchors, we use piecewise-linear interpolation:
If the score is at or below the lowest anchor, the model receives that anchor's IQ. If at or above the highest, it receives the ceiling IQ. There is no extrapolation beyond the defined range.
This approach avoids assumptions about the distribution shape between anchors. Each segment can have a different slope, allowing the curve to be steeper where small score improvements represent large cognitive leaps (e.g., going from 0% to 5% on HLE) and flatter where additional points reflect diminishing differentiation.
Benchmark Averaging & Compression
Each dimension averages all its benchmarks together, with missing benchmarks conservatively imputed. Rather than separating benchmarks into primary/fallback tiers with a hard cap, we use compressed anchor curves to limit the influence of easier or gameable benchmarks.
How Compression Works
For compressed benchmarks, the anchor curve is rescaled so that IQ values above 100 are proportionally reduced toward a lower ceiling:
Values at or below IQ 100 are unchanged. This preserves the low-end of the curve (where models genuinely struggle) while compressing the high-end where gameable benchmarks over-reward.
Why compress instead of cap? A hard cap (e.g., IQ 115) discards all discrimination above the cap — a model scoring 80% and one scoring 100% on AIME would both receive 115. Compression preserves the rank ordering while reducing the magnitude of the advantage that gameable benchmarks can confer. A perfect AIME score now yields IQ 130 instead of 146, which still contributes meaningfully but cannot dominate the dimension average.
Benchmark-Level Imputation
When a model is missing benchmark scores, the missing values are filled in before dimension IQs are computed. A symmetric 3-tier imputation system is applied to all 10 benchmarks across all 4 dimensions. For each dimension, the imputation uses only real data from the other 3 dimensions as the predictor (leave-one-dimension-out), preventing circular dependencies.
- Tier 1 — Family match: If a weaker family member (same model family, leave-out IQ at least 3 points lower, benchmark distance ≤ 15) has real data for this benchmark, copy its score. The IQ margin ensures we only impute downward — a model never inherits a score from a stronger sibling.
- Tier 2 — Grouping regression: If the model’s grouping (e.g., China, OpenAI, Anthropic) has a positive-slope linear regression for this benchmark, and the model’s leave-out IQ falls within the grouping’s training range, predict from the within-grouping regression. The prediction is capped at the global regression to moderate outliers.
- Tier 3 — Conservative fallback: Use min(median score, global regression prediction), clamped to [0, 100]. This ensures models without strong cross-dimensional evidence cannot score above the median through imputation.
Why leave-one-dimension-out? To impute a missing benchmark in dimension Di, we compute each model’s “leave-out IQ” from only the other 3 dimensions’ real data. This prevents imputed values from leaking into the predictor axis — all regressions and family comparisons use only original measurements. Every dimension is treated identically; there is no special ordering or phased imputation.
Why impute downward only (Tier 1)? Models from the same family can have very different capabilities. The −3 IQ margin ensures we only copy scores from a demonstrably weaker relative. For example, gpt-5-mini’s ARC-AGI scores can be used for gpt-oss-120b (since gpt-5-mini has lower leave-out IQ), but o3’s scores cannot — o3 may be substantially better at ARC despite similar overall IQ.
Composite IQ Calculation
After benchmark-level imputation fills in all missing scores, each dimension has a complete set of benchmarks. The composite IQ is always computed over all 4 dimensions.
Step 1: Score All Dimensions
For each dimension, the dimension IQ is computed by averaging all its benchmarks (hard + compressed). Because the 3-tier imputation has already filled in missing benchmarks, every model has scores for all 10 benchmarks and therefore all 4 dimensions.
Step 2: Safety-Net Dimension Imputation
In the rare case that a model has no real or imputed data for an entire dimension (D2–D4), a fallback applies:
This is a safety net that rarely triggers since the benchmark-level 3-tier system fills in missing scores first. D1 is never imputed at the dimension level — if a model has no D1 data even after benchmark imputation, the composite uses the remaining dimensions.
Step 3: Compute the Composite
where \(N\) is the number of dimensions with data. With the 3-tier imputation, most models have \(N=4\).
Key rules:
- Minimum 2 dimensions required. Models with fewer than 2 scored dimensions do not receive a derived composite IQ and instead display a manual estimate.
- All benchmarks pre-filled. The 3-tier imputation ensures every model has all 10 benchmark scores before dimension IQs are computed.
- Transparent count. The display shows
X/4so readers can see how many dimensions were actually scored vs. imputed. - Equal weighting. All dimensions contribute equally. Compressed ceilings (not differential weighting) handle benchmark quality differences.
Imputation Examples
The following table shows selected models with imputed benchmarks, illustrating how the 3 tiers work across all dimensions:
| Model | IQ | Imputed | Tier Breakdown |
|---|---|---|---|
| gpt-5.3-codex | 129 | 5/10 | arcAgi2, arcAgi1, fmT4Acc, swebench from gpt-5.2-pro; aime from gpt-5.2 (Family) |
| gemini-3-deep-think | 129 | 5/10 | fmT4Acc, critPt, terminalbench, swebench, sciCode from gemini-3-flash (Family) |
| opus-4.6-nonreasoning | 118 | 5/10 | arcAgi2, arcAgi1, aime, terminalbench, swebench from sonnet-4.5 (Family) |
| gpt-oss-120b | 107 | 3/10 | arcAgi2, arcAgi1 from gpt-5-mini; fmT4Acc from gpt-5-nano (Family) |
| glm-4.7 | 112 | 2/10 | arcAgi2, arcAgi1 (Conservative — no weaker family match) |
| ernie-5.0-thinking-preview | 110 | 5/10 | arcAgi2, arcAgi1 (China regression); fmT4Acc, terminalbench, swebench (Conservative) |
| kimi-k2.5 | 117 | 1/10 | aime (China regression) |
| deepseek-r1 | 105 | 3/10 | fmT4Acc, terminalbench, swebench (Conservative) |
Rank Status
Each model receives a rank status reflecting the completeness of its evaluation:
- Full — All 4 dimensions scored. The most reliable composite.
- Partial — 2–3 dimensions scored. Composite is derived but based on incomplete coverage.
- Provisional — Only 1 dimension scored. Not enough for a derived composite; falls back to manual IQ.
- Unranked — No dimension data available. Uses manual IQ estimate only.
Benchmarks Not Included
Three benchmarks that were part of the previous (v1) flat-averaging system have been removed from the composite IQ calculation:
- LiveCodeBench — While it has very low gameability due to continuously refreshed problems, it overlaps heavily with the Programmatic Reasoning dimension already covered by Terminal-Bench. Its removal avoids double-counting coding ability.
- MMLU-Pro — A 10-choice multiple-choice knowledge test. Overlaps with the Academic Reasoning dimension (GPQA/HLE) and adds limited discrimination at the frontier. Models have converged to similar high scores.
- MMMU-Pro — Multimodal academic questions. While the vision component is interesting, most frontier model evaluation focuses on text-based reasoning. This benchmark is tracked in the data but excluded from the IQ composite.
These benchmarks remain in the database and are viewable on the data page — they are simply not included in the composite IQ computation.
EQ Scoring
AI IQ also estimates an Emotional Quotient (EQ) for each model, measuring social and emotional intelligence across 11 sub-dimensions:
Each sub-dimension is scored on a 0–10 scale and mapped to an EQ value using shared anchor points:
| Raw (0–10) | EQ |
|---|---|
| 0 | 55 |
| 3 | 70 |
| 5 | 85 |
| 6 | 95 |
| 7 | 105 |
| 8 | 115 |
| 9 | 130 |
| 10 | 145 |
EQ-Bench Elo (Preferred Source)
When available, we use a model's EQ-Bench 3 Elo rating as the preferred EQ source. EQ-Bench is a dedicated emotional intelligence benchmark that produces Elo ratings reflecting relative emotional understanding:
| EQ-Bench Elo | EQ |
|---|---|
| 200 | 55 |
| 600 | 70 |
| 900 | 85 |
| 1100 | 95 |
| 1300 | 105 |
| 1500 | 115 |
| 1700 | 130 |
| 2000 | 145 |
When EQ-Bench Elo is not available, the composite EQ is computed as the mean of the 11 sub-dimension EQ scores (minimum 2 required).
Cost & Speed Metrics
Query Assumptions
All cost calculations assume a standard query of 1,000 input tokens and 2,000 output tokens, representing a typical conversational exchange.
Charts display cost per 1,000 queries on a logarithmic scale to handle the wide price range between models. Response time uses a log scale as well. Both axes are reversed so that the upper-right corner of every chart represents the best outcome: high intelligence at low cost and fast speed.
Limitations & Transparency
- Dimension coverage varies. Some models have data for all 4 dimensions; others have as few as 2 (with the rest imputed). A model's composite IQ is most reliable when all dimensions are scored. Always check the
X/4count and rank status. - Benchmark mix matters. Two models with the same composite IQ may have very different underlying data quality. One might have all hard benchmarks (ungameable tests with full curves) while another relies mostly on compressed benchmarks (with lower ceilings). The rank status and dimension count help distinguish these cases.
- Imputation is conservative, not clairvoyant. Missing benchmarks are filled using a 3-tier system (family match, grouping regression, or conservative fallback). These are reasonable estimates, not ground truth — a model's true ability on an unevaluated benchmark could be significantly higher or lower.
- Anchor calibration is subjective. The mapping from raw scores to IQ involves judgment calls about what different performance levels mean relative to human cognitive ability. We document our rationale for each benchmark, but reasonable people can disagree.
- IQ is a metaphor. Human IQ tests measure a specific construct via standardized instruments under controlled conditions. AI benchmark performance is a different thing. The IQ scale provides an intuitive frame of reference, not a claim of equivalence.
- Compressed ceilings are a design choice. The ceiling values directly affect which models benefit and which are penalized. Models that excel on compressed benchmarks will have their contributions limited, which may feel unfair if those benchmarks genuinely reflect high ability. We believe the trade-off — rewarding harder evaluation — is correct, but the specific ceiling values are judgment calls.
- Benchmarks become stale. As models improve and training data evolves, benchmark ceilings, gameability ratings, and compression levels may need revision. This methodology is a living document.
Asymptotic Compression Above IQ 140
The anchor point curves intentionally compress above IQ 140. Each additional percentage point on a benchmark contributes less to the IQ score in the superhuman range than in the human range. This reflects three realities:
- Human IQ distributions compress at the tails. The difference between IQ 100 and IQ 120 is much more common than the difference between IQ 140 and IQ 160.
- Superhuman benchmark scores are driven by breadth, not depth. A model scoring 50% on FrontierMath T4 isn't twice as smart as one scoring 25% — it covers more mathematical branches rather than being fundamentally more capable in any single branch.
- Practical discrimination. Without compression, reasoning vs. non-reasoning configurations of the same model produce 20+ point IQ gaps, which is unrealistic. With compression, the gap narrows to ~10-12 points (“smart” vs. “very smart” rather than “above average” vs. “genius”).
The compression ensures that no benchmark can single-handedly produce IQ values above ~155, regardless of raw score. The theoretical ceiling of the composite is approximately 150–155 under current benchmarks.