The AI Intelligence Leaderboard
Estimating the intelligence of every major AI model
How AI IQ estimates model intelligence
- We archive source captures from public benchmark leaderboards and extract only source-backed values
- We map each benchmark score to an implied IQ using calibrated difficulty curves
- We group scored benchmarks into seven dimensions: abstract, mathematical, scientific, app building, production engineering, computer use, and reliability
- We conservatively fill missing benchmark and dimension estimates only inside the scoring pipeline
- Every derived IQ averages all seven scored dimensions, so missing coverage cannot make a model look better by omission
Effective cost & iso-curves
Effective cost on the X-axis is sticker price for 1M I/O Tokens × token usage multiplier. 1M I/O Tokens means 1M input tokens plus 1M output tokens, priced at the model's published rates.
Iso-curves trace lines of equal preference for IQ versus cost. The slider weights quality vs cost: center is 1:1, drag toward Cost to make cost matter more, or toward IQ to make quality matter more. Models above and to the right of a curve are strictly better.
Tracking frontier progress
Each dot is a model with a known release date and a derived IQ estimate. Models are positioned left-to-right by release date, so the chart shows how the frontier changes over time rather than just where models rank today.
Provider-colored lines connect each lab's flagship frontier checkpoints. Codex, mini, nano, flash, coder, and smaller open-weight variants are omitted so the chart tracks each lab's main offering rather than every SKU.
This view is most useful for spotting whether a new release is actually ahead of its direct predecessor, or whether source coverage and conservative imputations are shaping the comparison.
What it measures
Multi-step quantitative reasoning, from competition problems to research-level proofs.
What it measures
Graduate-level reasoning across the natural sciences and applying scientific knowledge to hard problems.
What it measures
Turning product and design prompts into usable apps, front-end experiences, and full-stack prototypes.
What it measures
Coding fluency, repository repair, debugging, testing, and long-horizon engineering execution.
What it measures
Agentic operation of real tools and environments — terminals, browsers, and desktop apps.
What it measures
Following instructions precisely and knowing the limits of its own knowledge instead of guessing.
What it measures
A diagnostic view of emotional and interpersonal behavior. This is excluded from Composite IQ until the benchmark base becomes more rigorous.
Three tradeoffs at once
Most charts pit two qualities against each other. This view holds all three of the practical tradeoffs in one space: how smart a model is, how fast it answers, and what it costs to run.
IQ rises on the vertical axis, faster models sit to the right, and effective cost runs back into the depth axis on a log scale. The ideal model lives up, right, and toward the front — high intelligence, quick responses, and low cost. Drag to rotate and find where each provider clusters.