BullshitBench v2 Scores
Clear Pushback rate: share of attempts where a model clearly challenges a false premise instead of accepting nonsense. Color = provider.
BullshitBench v2 Scores
Clear Pushback rate: share of attempts where a model clearly challenges a false premise instead of accepting nonsense. Color = provider.
How To Read This Chart
This benchmark chart uses source-backed benchmark rows mapped to public AI IQ model profiles.
Top Models
| Rank | Model | Provider | Score |
|---|---|---|---|
| 1 | opus-4.8 | Anthropic | 94 |
| 2 | sonnet-4.6 | Anthropic | 91 |
| 3 | opus-4.5 | Anthropic | 90 |
| 4 | opus-4.6 | Anthropic | 87 |
| 5 | qwen3.5-397b | Alibaba | 78 |
| 6 | haiku-4.5 | Anthropic | 77 |
| 7 | opus-4.7 | Anthropic | 74 |
| 8 | minimax-m3 | MiniMax | 63 |
| 9 | mimo-v2.5-pro | Xiaomi | 62 |
| 10 | qwen3.6-plus | Alibaba | 59 |
| 11 | fable-5 | Anthropic | 56 |
| 12 | qwen3.7-max | Alibaba | 56 |