On July 1, 2026 Cognition shipped Devin Fusion, a hybrid-model harness for agentic coding that pairs a frontier model with a smaller "sidekick" model running in parallel. The latest smol.ai AINews issue flagged it as the week's standout release because Devin Fusion posts Fable-level performance at roughly 35% lower cost on the FrontierCode benchmark without swapping models mid-chat. For teams shipping mobile apps, n8n automation, and AI agent orchestration like the work we do at Halmob, Fusion is the first hybrid harness cheap enough to run in a production loop and disciplined enough to trust.
The release matters for a specific reason. Most teams that tried to cut agent costs did it by swapping the whole model mid-session — cheap when the request is easy, frontier when it is hard. That approach breaks the prompt cache on every swap and quietly wipes out most of the savings. Devin Fusion inverts the pattern: the main model stays on the wire the entire session, and a smaller sidekick runs alongside it for well-scoped subtasks. The cache stays warm, the frontier model stays in charge, and the average cost per task drops.
The 30-Second Version
What Devin Fusion Actually Is
Fusion is the productisation of a pattern Cognition has been iterating on inside Devin for the last two quarters. Instead of picking one model per request, the harness runs two roles concurrently: a lead model that owns planning, review, and the outer loop, and a sidekick model that handles the delegated grunt work. The delegation contract is narrow on purpose — a well-scoped subtask with the context it needs, not a hand-off of the whole conversation.
The technical trick that makes this cheap is cache locality. When you swap the lead model mid-chat, you invalidate the prompt cache on both sides — the session you left and the session you enter — so a "cheap" model call ends up billed at cold-cache rates on the swap. Fusion keeps the lead model warm on the primary session for the entire run and only spawns the sidekick as a side conversation with its own cache. That single design choice is where most of the 35% comes from.
Why It Is Not Just Model Routing
A normal router is a rules engine. You write if the request is easy, send it to the cheap model; otherwise send it to the expensive one. That works in a demo and breaks the first time a "easy" request turns hard three turns in. Fusion moves the decision from the outer boundary to the inner loop: the lead model reads the situation as it unfolds, and if a subtask fits the sidekick's window, it delegates. If not, the lead model does it. If mid-run the sidekick's subtask escalates, the lead model takes it back.
The practical consequence is that the model choice becomes a per-step decision made by a model that already understands the task, not a per-request decision made by a rules engine that does not. That is the same pattern we walked through in the executor-advisor orchestration piece, applied to model selection instead of tool selection.
The lead model plans. The sidekick fetches. If the fetch turns into thinking, it bounces back. The cache never leaves the room.
How Fusion, Sakana Fugu, and Manual Routing Compare
| Approach | Where the choice lives | What breaks the prompt cache |
|---|---|---|
| Manual mid-chat model swap | Application code (if/else on task type) | Every swap invalidates the cache on both sides |
| Sakana Fugu | A trained orchestrator behind one OpenAI-compatible endpoint | The orchestrator hides the cache boundary from you |
| Cognition Devin Fusion | The lead model, delegating live to a parallel sidekick | Cache stays warm on the lead; sidekick has its own |
| LangGraph / Hermes graph | Nodes you author and pin manually | You engineer the cache strategy per node |
The right row depends on the shape of the workload. A team already deep in a Sakana Fugu-style API — covered in our Sakana Fugu write-up — gets model variety by default. A team building inside the Devin harness gets Fusion as a flag. A team owning its own orchestration graph — the shape we covered in the Hermes Workspace piece — borrows the pattern rather than the product.
What The 35% Cost Cut Actually Buys
The headline number is measured on FrontierCode without Fable 5 in the pool — Fusion matches frontier-model quality against Opus 4.8 and GPT-5.5 at ~35% lower spend. That is the load-bearing claim. Two secondary effects matter more for teams running agents in production.
- Lower spend at the same latency budget. The sidekick runs in parallel, not in series, so the wall-clock cost of the delegation is close to zero when the task fits.
- Higher throughput per session. With a warm cache and a parallel worker, more subtasks fit inside the same context window before you have to summarise and re-hydrate.
- Fewer "model regressions" after a swap. Because the lead model never leaves, its tool-call shape, style, and safety posture stay consistent for the whole session.
- Cost that scales with task shape, not request count. Ten short lookups cost the same as one short lookup because they route to the sidekick. That is the shape most n8n workflows already have.
Why This Matters for Mobile and Automation Teams
Two patterns Halmob ships every week look different the day after Fusion lands. The first is a mobile app whose backend calls a coding agent for a codegen or code-review step — a lint fixer, a schema-migration helper, a "what changed in this diff" summary. The second is an n8n workflow whose AI node has been pinned to one frontier model because model swaps historically broke the flow. Both benefit from the same seam: the app calls one endpoint, and inside that endpoint the harness — Devin, or a Fusion-style clone — decides per step who runs.
For mobile, the win is that the cost of an AI-assisted feature can drop without a client release. The phone still calls the same server-side proxy. The proxy calls the Fusion-enabled harness. The savings show up in the bill, not in the app store. That pairs naturally with the server-owned loop we walked through in the loop engineering piece: the phone never knows which model did the work.
For n8n automation, the practical move is to point the AI node at a Fusion-backed endpoint (Devin's API, or a self-hosted implementation of the same pattern) and stop hand-tuning which nodes get which model. The workflow keeps its shape. The bill changes.
Where Fusion Fits in the 2026 Agent Stack
Fusion does not replace the agent stack. It collapses the model-selection layer into a per-step decision the lead model owns, and leaves every other layer alone.
| Layer | What it owns | Where Fusion sits |
|---|---|---|
| Application loop | Retries, schema validation, durable state | Above Fusion — your loop still owns this |
| Model selection | Deciding lead vs sidekick per step | Replaced by Fusion |
| Prompt cache | Reusing context across turns | Preserved by Fusion — cache stays on the lead |
| Tools and MCP | External actions the agent can call | Unchanged — surfaced through the harness |
| Distribution | Mobile app, webhook, n8n node, chat surface | Untouched — Fusion is a harness swap, not an API swap |
The layout pairs cleanly with the orchestration pattern we covered in the LangChain Deep Agents harness-profiles piece and the model-neutrality argument in the orchestration-era write-up. Fusion is a harness feature that respects those boundaries; it does not try to be the whole graph.
The Failure Modes Worth Designing Around
- Silent quality drift on sidekick tasks. A task the lead model would have caught can slip past a smaller sidekick. Instrument sample checks — spot-audit sidekick outputs and compare to lead-model outputs on the same input weekly.
- Escalation loops. If the sidekick bounces work back too often, you pay for both models on the same subtask. Watch the escalation rate; a healthy Fusion run keeps it well under half.
- Cache-warm math that only works at load. The 35% number assumes the cache is being reused. A cold, one-shot request does not see the savings. Cost-model your low-QPS workflows separately.
- Vendor coupling to Devin. Fusion, as shipped, is a Cognition harness. If you depend on it, you depend on their harness roadmap. Keep the loop portable — the pattern is reproducible even if the product is not.
- Sidekick tool-call shape drift. When the sidekick uses a different tool-call format than the lead, downstream parsers can break. Normalise tool calls at the harness boundary, not at the client.
- Bench-vs-prod delta. FrontierCode is a benchmark. Your workflow is not. Measure the same 35% on your top three workflows before promoting Fusion everywhere.
A Practical Migration Plan for Existing Workloads
- 1Pick the one AI workflow with the largest monthly bill. The workflow whose cost line item you already argue about is the right first candidate for Fusion.
- 2Run Fusion in shadow mode. Send the request to both the pinned frontier model and the Fusion endpoint, log both outputs, do not show Fusion's output to the user yet. A week of shadow data shows agreement, divergence, and the actual cost delta.
- 3Promote low-stakes flows first. Internal summaries, code review comments, draft generation. Keep the user-facing money path on the pinned model until telemetry is convincing.
- 4Instrument the escalation rate before you scale. A Fusion workflow that escalates most subtasks back to the lead is a Fusion workflow that quietly costs more, not less.
- 5Keep a single-model fallback in the harness. If the Fusion endpoint degrades, your loop should swap back to a pinned model without a release. That is the same kill switch any vendor in the path deserves.
- 6Document the contract you depend on. "We call the Fusion harness through this endpoint with these system messages, and we expect this tool-call shape." That sentence is your spec; if it ever breaks, you know what to rebuild.
When to Pick Fusion, When to Stay Pinned
How It Fits the Halmob Stack
Most of what we ship at Halmob is the layer above the model: a mobile app talking to an n8n workflow that drives an AI agent. Fusion changes what happens inside that agent step. The workflow keeps its shape; the harness underneath decides who runs each subtask. The mobile client does not need to know, the automation does not need to change nodes, and the cost line moves because the harness is smarter about cache locality — not because the model changed.
For iOS and Android teams the immediate move is to keep the server-side proxy you already put in front of the model and point it at a Fusion-backed endpoint under a feature flag. Ship the flag off. Turn it on for a canary. Measure. That single seam is what makes the next harness swap a configuration change instead of an app release.
The Bottom Line
Devin Fusion is the first hybrid-model coding harness to publish a real cost number — 35% off FrontierCode at frontier quality — and pair it with a design that keeps the prompt cache alive. The pattern generalises: a lead model that plans and reviews, a sidekick that runs scoped subtasks in parallel, and a delegation loop the lead model owns. Whether you adopt Cognition's implementation or borrow the pattern in your own harness, the interesting question stops being which frontier model do we pick and becomes how much of the work can a smaller model do without the lead ever leaving the session.
The right question for the next sprint is not "is Fusion cheaper than pinning Opus 4.8." It is "which of our workflows are chatty enough that a sidekick eats most of the tokens without moving the quality bar." At Halmob we pair mobile development with n8n automation and AI agent orchestration for teams that want the answer to that question to be most of them.
For sources, see the smol.ai AINews newsletter coverage of the Fusion release, the official Cognition Devin Fusion announcement, and the Latent Space discussion of hybrid-model harness patterns.