Künstliche Intelligenz

Devin Fusion: Hybrid-Model Agent Orchestration Guide

Devin Fusion runs a frontier main agent and a cheap sidekick agent, routes mid-session without cache loss, and cuts coding-agent cost by 35%.

İlker Ulusoy 2026-07-03 10 min Min. Lesezeit

The June 29, 2026 smol.ai newsletter led with one launch: Cognition shipped Devin Fusion, a hybrid-model harness for agentic coding that runs a frontier "main" agent and a cheap "sidekick" agent side by side, then routes work between them mid-session without breaking the prompt cache. Cognition reports a 35% cost cut against Opus 4.8 and GPT-5.5 on FrontierCode Extended at the same intelligence level, and 88% of merged pull requests in internal testing were driven end to end by the Fusion router. For teams shipping mobile apps, n8n workflows, and AI agents like the ones we build at Halmob, Fusion is the first coding harness that treats hybrid-model orchestration as a first-class product surface rather than a routing hack.

Devin Fusion sits next to the Sakana Fugu launch we wrote about a week earlier: Fugu productises orchestration as one OpenAI-compatible endpoint, and Fusion productises orchestration as a coding harness that owns the whole loop. Different shapes, same underlying bet — the abstraction line is moving up from "pick the best model" to "let the harness pick per task, per turn, per cache boundary."

The 30-Second Version

Devin Fusion runs two coding agents in parallel: a frontier main agent that plans, decides, and reviews, and a smaller sidekick agent that does the grunt work — code exploration, broad edits, tests, lint. Each keeps its own persistent cached context, so the router can hand work between them mid-session without the usual cache-miss tax. The published number is a 35% cost reduction at frontier intelligence; the more interesting number is 88% of merged PRs driven fully by the router.

What Devin Fusion Actually Is

Fusion is a harness, not a model. It wraps two coding agents behind one interface. The main agent runs on a frontier model — Opus 4.8 or GPT-5.5 in the launch benchmark. The sidekick runs on a smaller, cheaper model. The main agent takes minimal actions by default: it plans, delegates, monitors, and reviews. The sidekick does the mechanical work — reading files, applying broad edits, writing tests, fixing lint, chasing typecheck failures.

The important design choice is that both agents keep their own persistent cached contexts. Prompt caching is why frontier tokens are affordable at all on multi-turn work — a cache miss inside a long agent session can multiply token cost several times over. Old-school model routing swapped models mid-chat, which busted the cache every time. Fusion avoids that by giving each agent its own conversation and its own cache, and only routing at the boundaries where a cache miss was going to happen anyway — the context-compaction moments where the harness rewrites the working set to fit the window.

Why "Conventional Routing" Was Not Enough

Model routing has been shipping for two years. The pitch was always the same: classify the task, send easy questions to a cheap model, send hard questions to a frontier model, save money. In benchmarks it worked. In production coding work it did not. The failure mode is well-known to anyone who has watched an agent write real code: task difficulty is not visible up front. A ticket that reads "fix the flake" can be a five-line change or a two-day refactor, and you do not know which until the agent has already read half the repository.

Cognition's framing — "conventional model routing passes benchmarks but fails to write code you'd actually merge" — is the reason Fusion exists. Instead of routing the task, Fusion routes the sub-task. The main agent is always in the loop; the sidekick is a delegate the main agent supervises. That shape maps onto how a senior engineer works with a fast intern: the intern does the reading and the mechanical edits, the senior owns the design decisions, and neither of them starts over from scratch when the other one finishes a step.

ApproachHow it routesWhat breaks in production coding
Single frontier modelEverything runs on Opus 4.8 or GPT-5.5Every token is priced at the top of the market, most of them on mechanical work
Classifier-based routingA small model picks the model per task at the startDifficulty is not visible up front; wrong routing wastes the whole session
Mid-chat model swapSwitch model inside one conversationBusts the prompt cache on every swap, multiplying cost and latency
Devin FusionTwo agents, two caches, router acts at compaction boundariesCache stays warm, main agent stays in the loop, sidekick handles the grunt work

The Sidekick Pattern, Explained Simply

Think of the sidekick as a fast intern with a scratchpad. The senior engineer — the main agent — reads the ticket, writes down the plan, and hands over bounded steps: "read every file that references this function, tell me what changes," or "apply this refactor to the six files I listed, run the tests, report failures." The intern does that work in its own workspace and comes back with a summary. The senior reviews, decides what to keep, and moves to the next step. Neither of them re-reads the codebase every time. That is the shape Fusion encodes.

Two agents, two caches, one plan — the router only moves at the seams where a cache miss was already going to happen.

Dynamic Mid-Session Routing Without Cache Loss

The technical piece that makes the sidekick pattern practical is the routing rule. Fusion does not swap models on every turn. It moves work between the two agents at moments the harness would have paid a cache miss anyway — a context compaction, a fresh planning turn, an escalation from "this is a broad edit" to "this is a design change." At those seams, the cache is going to warm from scratch either way, so the router picks the agent that fits the next chunk of work and hands over. The result is that a hard task escalates back to the frontier main agent without a cache-cost penalty, and an easy stretch stays on the sidekick for as long as the sidekick can carry it.

This is the same idea that made loop engineering worth writing about — reliability comes from designing the seams of a long agent run, not from picking a smarter model. We covered that pattern in the loop engineering piece and the broader orchestration shift in the orchestration era write-up.

What the Numbers Actually Say

MetricNumber Cognition reportedReading
Cost reduction on FrontierCode Extended35% vs Opus 4.8 and GPT-5.5 aloneSame intelligence, one-third cheaper — a budget headroom, not a demo trick
Cost reduction with Fable 5 as main41%Fable 5 access was later restricted; the number matters as a ceiling, not a shipping default
Merged PRs driven fully by the router88% of Cognition's internal test setThe important one — router picks stuck to what a senior reviewer would sign off

The 35% figure is what gets quoted, but the 88% figure is the one to sit with. It says the routing decisions the harness made were good enough that a reviewer merged them. That is a different quality bar from "the answer looked correct on a benchmark." It is the bar a coding agent has to clear to be worth wiring into a real repository.

Why This Matters for Mobile Teams

Mobile app work is where the sidekick pattern earns its keep fastest. A typical iOS or Android task looks like: touch three ViewModels, update the corresponding UI test, run the linter, run the snapshot tests, fix the snapshots, push. Most of that is mechanical. The design decisions — how state is threaded, when to invalidate a cache, what the failure UX should be — are a small fraction of the tokens. A single-frontier-model agent pays the top price for every one of those mechanical turns. Fusion routes them to the sidekick and only escalates when the main agent needs to make a call.

For the mobile development work we ship, the interesting property is not the raw cost saving. It is that the cost of running an agent on a live mobile branch stops scaling with the size of the codebase in the way it used to. That changes which sprints an agent can pay for its seat in. The related pattern for coding-agent CI, which we already use, is documented in the Cursor SDK CI/CD piece.

Where Fusion Fits in an n8n Automation Flow

n8n workflows and Devin agents already share the same job shape — a stateful loop over a work item, checkpoints for human review, retries and idempotency at the boundaries. Fusion plugs in cleanly. The workflow node that today calls a coding-agent endpoint keeps its contract; the harness on the other side of that call is now hybrid-model. The classifier node, the human-approval branch, the persistence step, the queue — they do not change. The token bill drops. The main agent stays in the loop for the design-change turns; the sidekick handles the file walks and the test runs.

That is the same pattern we use in the n8n automation work we ship, and the production shape — retries, queues, wall-clock caps — is covered in our n8n on ECS Fargate load test.

How Fusion Compares to Sakana Fugu

Both launches point at the same trend: orchestration as a productised layer instead of a hand-written routing rule. They solve different pieces of it.

AspectSakana FuguDevin Fusion
Form factorOpenAI-compatible model endpointCoding harness (Devin) with hybrid backend
ScopeGeneral chat and tool-use tasksAgentic coding — read, edit, test, PR
Routing unitWhole task, sub-task within the modelSub-task within the harness, across two agents
Cache strategyManaged inside the orchestration modelTwo persistent caches, one per agent, routed at seams
Ownership boundaryThe API — you keep your SDKThe harness — you use Devin's runner

If you already have a coding agent embedded in a repository, Fusion is the near-term win. If you have a chat- or tool-use surface where the model choice was the recurring decision, Fugu is the shorter migration. The two are complementary, and we covered Fugu in more detail in the Sakana Fugu orchestration API write-up.

Risks and Pitfalls Worth Designing Around

  • Sidekick drift. The sidekick can silently do the wrong broad edit if the main agent's brief was vague. The fix is not a bigger sidekick — it is a stricter delegation contract. Keep the brief written, keep the diff small, keep the review turn.
  • Router opacity. Fusion does not surface a per-turn trace of which agent handled which step. For a regulated workload, insist on a route log before you wire it into the workflow.
  • Cost surprises on hard sessions. The 35% number is an average across FrontierCode. A session that stays escalated to the main agent for its full length looks like an Opus 4.8 session on the bill. Set a per-session token cap.
  • Cache assumption. The cost story rests on the caches staying warm. If your workflow re-instantiates the harness per task, the cache advantage disappears. Batch related tasks into one Fusion session.
  • Vendor lock at the harness layer. You stop locking to one model, and start locking to Devin's harness. Treat the harness as the abstraction boundary; keep the underlying prompts and skills portable.
  • Fable 5 is not a shipping default. The 41% number used Fable 5 as the main agent. Access to Fable 5 was restricted in June. Plan against the Opus 4.8 / GPT-5.5 pairing that ships today.

A Practical Adoption Plan

  1. 1Pick one repository, not the whole org. Start with a codebase where the CI is fast and the test set is trustworthy. Fusion's router leans on the test run to know when to escalate.
  2. 2Set a per-session token and wall-clock cap. The averages do not protect you from a bad session. Bound the blast radius before the first run.
  3. 3Log the route decision. Capture whichever telemetry the harness exposes about main-versus-sidekick turns. Without it, you cannot debug a regression a week later.
  4. 4Mirror one workflow behind a flag. Run Fusion side by side with your current single-model agent on the same tickets for a week. Compare cost, wall-clock, and merged-PR rate.
  5. 5Write the delegation contract down. The main agent's brief to the sidekick is the piece that decides quality. Treat it like a code-owner file.
  6. 6Batch related tasks in one session. The cache is the cost story. Group tickets that touch the same area so the caches carry across.
  7. 7Keep an exit plan. If Cognition changes the pricing or the runner, your fallback is the single-frontier agent you shipped last quarter. Keep it one config flip away.

When to Reach for Fusion, When to Stay Single-Model

Reach for Fusion when the coding workload has a clear split between design turns and mechanical turns, and the cost of the mechanical turns is a line on your invoice. Stay single-model when the codebase is small enough that a frontier session costs less than the wiring effort, or when regulatory constraints demand a fully traceable per-turn model choice you cannot get from the harness today.

How It Fits the Halmob Stack

Most of what we ship at Halmob sits at the meeting point of mobile apps, n8n automation, and AI agent orchestration. Devin Fusion lands inside the coding-agent slot of that third column. For a client whose mobile team already runs an agent on pull requests, the migration is a harness swap and a session-cap rule. For a client whose n8n flow already delegates a coding step to Devin, Fusion changes the bill and the reliability of that step without changing the workflow around it.

The interesting shift is not the 35% saving. It is that "which model do we run this coding agent on" stops being a monthly decision the team owns. The harness owns it, at the sub-task grain, with the cache in mind. That is the same abstraction move we saw with Fugu — one layer up, packaged as a product.

The Bottom Line

Devin Fusion is the first coding harness that treats hybrid-model orchestration as the default shape rather than a routing optimisation bolted on top. Two agents, two caches, a router that only acts at the seams — that is a small piece of design and a large change in the cost curve of running a coding agent on real work. The 35% number is the headline. The 88% merged-PR number is the reason to plan around it.

The right question for the next sprint is no longer "which frontier model do we point the coding agent at." It is "does the sidekick pattern fit our repository, and where is the delegation contract written down." At Halmob we pair mobile development with n8n automation and AI agent orchestration so teams that want that question answered cleanly get a shorter answer.

For sources, see the smol.ai AINews newsletter coverage of the June 29, 2026 launch, the official Cognition Devin Fusion announcement, and the TLDR AI 2026-06-30 recap.