AI IQ | The AI Intelligence Leaderboard & Benchmark Charts

The AI Intelligence Leaderboard

Estimating the intelligence of every major AI model

AI Models by IQ

Each model's estimated IQ plotted on a standard normal IQ distribution

How AI IQ estimates model intelligence

We archive source captures from public benchmark leaderboards and extract only source-backed values
We map each benchmark score to an implied IQ using calibrated difficulty curves
We group scored benchmarks into seven dimensions: abstract, mathematical, scientific, app building, production engineering, computer use, and reliability
We conservatively fill missing benchmark and dimension estimates only inside the scoring pipeline
Every derived IQ averages all seven scored dimensions, so missing coverage cannot make a model look better by omission

IQ vs Effective Cost

Each model's estimated IQ plotted against effective cost per 1M I/O Tokens (sticker price × blended usage multiplier).

IQ 1:1 Cost

Effective cost & iso-curves

Effective cost on the X-axis is sticker price for 1M I/O Tokens × token usage multiplier. 1M I/O Tokens means 1M input tokens plus 1M output tokens, priced at the model's published rates.

Iso-curves trace lines of equal preference for IQ versus cost. The slider weights quality vs cost: center is 1:1, drag toward Cost to make cost matter more, or toward IQ to make quality matter more. Models above and to the right of a curve are strictly better.

Frontier IQ Over Time

X = release date. Y = estimated IQ. Provider step-lines connect each provider's flagship frontier checkpoints over time.

Tracking frontier progress

Each dot is a model with a known release date and a derived IQ estimate. Models are positioned left-to-right by release date, so the chart shows how the frontier changes over time rather than just where models rank today.

Provider-colored lines connect each lab's flagship frontier checkpoints. Codex, mini, nano, flash, coder, and smaller open-weight variants are omitted so the chart tracks each lab's main offering rather than every SKU.

This view is most useful for spotting whether a new release is actually ahead of its direct predecessor, or whether source coverage and conservative imputations are shaping the comparison.

Mathematical Reasoning IQ

Each model's Mathematical Reasoning IQ plotted on a standard normal IQ distribution

What it measures

Multi-step quantitative reasoning, from competition problems to research-level proofs.

FrontierMath Tier 4 FrontierMath Tier 1-3 AIME ProofBench MathArena

Scientific Reasoning IQ

Each model's Scientific Reasoning IQ plotted on a standard normal IQ distribution

What it measures

Graduate-level reasoning across the natural sciences and applying scientific knowledge to hard problems.

Humanity's Last Exam CritPt SciCode GPQA Diamond

Abstract Reasoning IQ

Each model's Abstract Reasoning IQ plotted on a standard normal IQ distribution

What it measures

Fluid problem-solving on novel puzzles a model cannot have memorized — abstracting patterns from just a few examples.

ARC-AGI-2 ARC-AGI-1 ARC-AGI-3

App Building IQ

Each model's App Building IQ plotted on a standard normal IQ distribution

What it measures

Turning product and design prompts into usable apps, front-end experiences, and full-stack prototypes.

Arena.ai WebDev DesignArena Frontend DesignArena Full Stack Vibe Code Bench

Production Engineering IQ

Each model's Production Engineering IQ plotted on a standard normal IQ distribution

What it measures

Coding fluency, repository repair, debugging, testing, and long-horizon engineering execution.

SWE Marathon FrontierCode Diamond SWE-Bench Verified SWE-Bench Pro DeepSWE SWE-rebench LiveCodeBench

Computer Use IQ

Each model's Computer Use IQ plotted on a standard normal IQ distribution

What it measures

Agentic operation of real tools and environments — terminals, browsers, and desktop apps.

Terminal-Bench 2.0 Terminal-Bench Hard BrowseComp OSWorld-Verified Toolathlon MCP Atlas

Reliability IQ

Each model's Reliability IQ plotted on a standard normal IQ distribution

What it measures

Following instructions precisely and knowing the limits of its own knowledge instead of guessing.

IFBench AA Omniscience

Emotional Reasoning (EQ)

Diagnostic Emotional Reasoning scores, excluded from Composite IQ

What it measures

A diagnostic view of emotional and interpersonal behavior. This is excluded from Composite IQ until the benchmark base becomes more rigorous.

EQ-Bench 3 Arena.ai Overall AttuneBench

IQ vs Speed vs Cost in 3D

3D scatter: X = response time (log, faster to the right), Y = IQ, Z = effective cost (log). Color = provider. Drag to rotate.

Three tradeoffs at once

Most charts pit two qualities against each other. This view holds all three of the practical tradeoffs in one space: how smart a model is, how fast it answers, and what it costs to run.

IQ rises on the vertical axis, faster models sit to the right, and effective cost runs back into the depth axis on a log scale. The ideal model lives up, right, and toward the front — high intelligence, quick responses, and low cost. Drag to rotate and find where each provider clusters.

IQ Methodology

The AI Intelligence Leaderboard

How AI IQ estimates model intelligence

Effective cost & iso-curves

Tracking frontier progress

What it measures

What it measures

What it measures

What it measures

What it measures

What it measures

What it measures

What it measures

Three tradeoffs at once

Get the weekly AI model intelligence newsletter

Scored benchmarks, 7 dimensions

How dimensions relate to composite IQ