BullshitBench v2 Scores

Clear Pushback rate: share of attempts where a model clearly challenges a false premise instead of accepting nonsense. Color = provider.

Clear Pushback rate: share of attempts where a model clearly challenges a false premise instead of accepting nonsense. Color = provider.

How To Read This Chart

This benchmark chart uses source-backed benchmark rows mapped to public AI IQ model profiles.

Rank	Model	Provider	Score
1	opus-4.8	Anthropic	94
2	sonnet-4.6	Anthropic	91
3	opus-4.5	Anthropic	90
4	opus-4.6	Anthropic	87
5	qwen3.5-397b	Alibaba	78
6	haiku-4.5	Anthropic	77
7	opus-4.7	Anthropic	74
8	minimax-m3	MiniMax	63
9	mimo-v2.5-pro	Xiaomi	62
10	qwen3.6-plus	Alibaba	59
11	fable-5	Anthropic	56
12	qwen3.7-max	Alibaba	56