Combination Models

Fusion, mixture-of-agents, and router-orchestrated systems that behave like one model surface

Tracked separately

Model-like systems, not ordinary checkpoints

Combination models route, fan out, debate, verify, or synthesize across multiple model calls while exposing a single endpoint, preset, or product identity. They belong on a separate page until their costs, latency, benchmark coverage, and component recipes are source-backed enough to compare directly with single models in the main leaderboard.

Page rule Include systems only when the public source describes a multi-model or multi-agent recipe presented as one model-like surface.

Source Benchmark Charts

Featured Combination Systems

OpenRouter Fusion Source

Frontier panels synthesized by Opus 4.8

OpenRouter Fusion dispatches a prompt to participant models and uses a judge/synthesizer model to produce the final answer. The published DRACO table reports several Fusion rows above the solo frontier baselines used in that test.

Fable 5 + GPT-5.5, synthesized by Opus 4.8: 69.0%
Opus 4.8 + GPT-5.5 + Gemini 3.1 Pro, synthesized by Opus 4.8: 68.3%
Opus 4.8 + GPT-5.5, synthesized by Opus 4.8: 67.6%
Opus 4.8 + Opus 4.8, synthesized by Opus 4.8: 65.5%

Sakana Fugu Source

Fugu and Fugu Ultra

Sakana describes Fugu as a multi-agent system delivered through one OpenAI-compatible API. Fugu dynamically coordinates a pool of expert agents; Fugu Ultra uses a deeper pool for higher-quality answers on hard work.

SWE Bench Pro: 59.0 Fugu, 73.7 Fugu Ultra
TerminalBench 2.1: 80.2 Fugu, 82.1 Fugu Ultra
LiveCodeBench Pro: 87.8 Fugu, 90.8 Fugu Ultra
Humanity's Last Exam: 47.2 Fugu, 50.0 Fugu Ultra

Cognition Devin Fusion Source

Devin plus a sidekick model

Cognition frames Devin Fusion as a harness where Devin keeps the main task context and delegates suitable work to a sidekick model with its own cached context. The public post emphasizes cost reductions with task-dependent quality changes rather than a single leaderboard row.

Fusion + Fable 5: 57.6 score at $3.00/task on FrontierCode Extended
Fable 5 (medium): 57.0 score at $5.12/task
Fusion: 47.9 score at $2.38/task
Fable 5 Fusion test: 41% cheaper than pure Fable 5 harness
Opus and GPT-5.5-level tests: 35% cheaper

Hermes Agent MoA Source

Mixture of Agents presets

Hermes documents Mixture of Agents as a virtual model provider. Reference models run first, their outputs are added as private context, and the configured aggregator acts as the model that writes the response and emits tool calls.

Example tracked recipe: Opus 4.8 + GPT-5.5, aggregated by Opus 4.8
Primary strength: model-picker integration with normal Hermes agent loops
Status: no comparable public benchmark table in the cited documentation

vLLM Semantic Router Source

Micro-agent recipes

vLLM describes router-owned collaboration patterns such as Confidence, Ratings, ReMoM, Fusion, and Workflows. The public surface can remain one model name while the router selects the task-shaped recipe underneath.

VSR Closed on LiveCodeBench Jan-Apr 2025: 92.6
VSR Closed on GPQA-Diamond: 96.0
VSR Closed on Humanity's Last Exam: 50.0
VSR Hybrid on Humanity's Last Exam: 47.1

Comparison Snapshot

System	Surface	Combination Pattern	Public Evidence	AI IQ Treatment
OpenRouter Fusion	API model/tool/plugin/chatroom	Parallel panel plus judge/synthesizer	DRACO deep-research scores for named panels	Track here; do not derive Composite IQ from one benchmark family
Sakana Fugu	OpenAI-compatible model API	Learned orchestration over a model pool	Multi-benchmark table for Fugu and Fugu Ultra	Candidate for future sparse benchmark rows if source definitions align
Devin Fusion	Devin harness behavior	Main agent plus sidekick delegation	Cost/quality examples and average cost-reduction claims	Track as product-system evidence, not a standalone model row
Hermes MoA	Virtual model provider / preset	Reference models feeding an aggregator	Implementation documentation, no comparable scorecard	List without ranking until source-backed evals exist
vLLM VSR	Router-level model alias	Task-shaped router recipes and micro-agents	Scorecard rows for VSR Closed and VSR Hybrid	Track here; benchmark fields need recipe/source disambiguation first

Why this is not in the main model table yet

AI IQ's main model rankings assume a model row represents a reasonably stable model identity with source-backed benchmark fields and documented pricing. Combination systems can change recipes, fan-out width, synthesizer choice, and routing policy without changing the outward name. This page keeps the category visible while avoiding accidental apples-to-oranges IQ scoring.

Future promotion into the dataset should require a stable public model identifier, published pricing or hosted cost basis, source-backed benchmark rows that map to existing fields, and clear notes about whether benchmark numbers include tools, web access, multi-turn orchestration, or hidden routing.