Combination Models
Fusion, mixture-of-agents, and router-orchestrated systems that behave like one model surface
Model-like systems, not ordinary checkpoints
Combination models route, fan out, debate, verify, or synthesize across multiple model calls while exposing a single endpoint, preset, or product identity. They belong on a separate page until their costs, latency, benchmark coverage, and component recipes are source-backed enough to compare directly with single models in the main leaderboard.
Frontier panels synthesized by Opus 4.8
OpenRouter Fusion dispatches a prompt to participant models and uses a judge/synthesizer model to produce the final answer. The published DRACO table reports several Fusion rows above the solo frontier baselines used in that test.
- Fable 5 + GPT-5.5, synthesized by Opus 4.8: 69.0%
- Opus 4.8 + GPT-5.5 + Gemini 3.1 Pro, synthesized by Opus 4.8: 68.3%
- Opus 4.8 + GPT-5.5, synthesized by Opus 4.8: 67.6%
- Opus 4.8 + Opus 4.8, synthesized by Opus 4.8: 65.5%
Fugu and Fugu Ultra
Sakana describes Fugu as a multi-agent system delivered through one OpenAI-compatible API. Fugu dynamically coordinates a pool of expert agents; Fugu Ultra uses a deeper pool for higher-quality answers on hard work.
- SWE Bench Pro: 59.0 Fugu, 73.7 Fugu Ultra
- TerminalBench 2.1: 80.2 Fugu, 82.1 Fugu Ultra
- LiveCodeBench Pro: 87.8 Fugu, 90.8 Fugu Ultra
- Humanity's Last Exam: 47.2 Fugu, 50.0 Fugu Ultra
Devin plus a sidekick model
Cognition frames Devin Fusion as a harness where Devin keeps the main task context and delegates suitable work to a sidekick model with its own cached context. The public post emphasizes cost reductions with task-dependent quality changes rather than a single leaderboard row.
- Fusion + Fable 5: 57.6 score at $3.00/task on FrontierCode Extended
- Fable 5 (medium): 57.0 score at $5.12/task
- Fusion: 47.9 score at $2.38/task
- Fable 5 Fusion test: 41% cheaper than pure Fable 5 harness
- Opus and GPT-5.5-level tests: 35% cheaper
Mixture of Agents presets
Hermes documents Mixture of Agents as a virtual model provider. Reference models run first, their outputs are added as private context, and the configured aggregator acts as the model that writes the response and emits tool calls.
- Example tracked recipe: Opus 4.8 + GPT-5.5, aggregated by Opus 4.8
- Primary strength: model-picker integration with normal Hermes agent loops
- Status: no comparable public benchmark table in the cited documentation
Micro-agent recipes
vLLM describes router-owned collaboration patterns such as Confidence, Ratings, ReMoM, Fusion, and Workflows. The public surface can remain one model name while the router selects the task-shaped recipe underneath.
- VSR Closed on LiveCodeBench Jan-Apr 2025: 92.6
- VSR Closed on GPQA-Diamond: 96.0
- VSR Closed on Humanity's Last Exam: 50.0
- VSR Hybrid on Humanity's Last Exam: 47.1
| System | Surface | Combination Pattern | Public Evidence | AI IQ Treatment |
|---|---|---|---|---|
| OpenRouter Fusion | API model/tool/plugin/chatroom | Parallel panel plus judge/synthesizer | DRACO deep-research scores for named panels | Track here; do not derive Composite IQ from one benchmark family |
| Sakana Fugu | OpenAI-compatible model API | Learned orchestration over a model pool | Multi-benchmark table for Fugu and Fugu Ultra | Candidate for future sparse benchmark rows if source definitions align |
| Devin Fusion | Devin harness behavior | Main agent plus sidekick delegation | Cost/quality examples and average cost-reduction claims | Track as product-system evidence, not a standalone model row |
| Hermes MoA | Virtual model provider / preset | Reference models feeding an aggregator | Implementation documentation, no comparable scorecard | List without ranking until source-backed evals exist |
| vLLM VSR | Router-level model alias | Task-shaped router recipes and micro-agents | Scorecard rows for VSR Closed and VSR Hybrid | Track here; benchmark fields need recipe/source disambiguation first |
Why this is not in the main model table yet
AI IQ's main model rankings assume a model row represents a reasonably stable model identity with source-backed benchmark fields and documented pricing. Combination systems can change recipes, fan-out width, synthesizer choice, and routing policy without changing the outward name. This page keeps the category visible while avoiding accidental apples-to-oranges IQ scoring.
Future promotion into the dataset should require a stable public model identifier, published pricing or hosted cost basis, source-backed benchmark rows that map to existing fields, and clear notes about whether benchmark numbers include tools, web access, multi-turn orchestration, or hidden routing.