Artificial Intelligence

Minitap Mobile-Use: 100% AndroidWorld Mobile AI Agents

Minitap mobile-use is the open-source multi-agent framework that hit 100% on AndroidWorld. How its six-agent design changes mobile automation.

İlker Ulusoy 2026-06-23 9 min min read

The latest smol.ai AINews issues kept naming the same surprise: an open-source mobile-agent stack from a tiny Paris team climbed to the top of Google DeepMind's AndroidWorld leaderboard and then went past human baseline. The project is Minitap mobile-use, a six-agent system that drives real Android and iOS apps the way a person does. For teams shipping mobile apps, n8n automation, and AI agent orchestration, like the ones we build at Halmob, this is the first serious open template for an agent that lives inside the phone, not next to it.

Minitap's claim is concrete: 100% on AndroidWorld's 116 tasks, 31-second average task time, and the whole architecture released as an Apache-licensed repo. The wider story is what the design choices say about where mobile AI agents are heading in the second half of 2026 — away from single-prompt screen agents and toward small, specialised loops that route control between each other.

The 30-Second Version

Minitap mobile-use is an open-source multi-agent framework that controls real Android and iOS apps. It scored 100% on AndroidWorld by splitting the job across six specialised agents — Planner, Orchestrator, Contextor, Cortex, Executor, and Screen Analyzer. The interesting part is not the score; it is that a six-agent split is now cheaper and more reliable than one giant screen-reading prompt.

What Minitap Mobile-Use Actually Is

Mobile-use is an open-source framework, published by Minitap, that lets an AI agent operate a real phone — tapping, scrolling, typing, reading the screen — across stock Android and iOS apps. Underneath, it is a multi-agent system: one agent decomposes the task, another picks the next action, a third verifies what is actually on screen, and so on. The user-facing surface is a single goal in natural language; the system splits the work behind the scenes.

The framework sits on top of Minitap Cloud, the company's commercial device farm, but the agent itself runs on any device the operator can reach. That separation matters. The same loop that ranks #1 on AndroidWorld is the loop a Halmob client can run against their own staging build of a React Native or Swift app, on a real device, from a CI pipeline.

Why a Six-Agent Split Beats One Big Prompt

Single-agent screen drivers have been around for two years and have a well-known failure mode: the context window fills up with screen dumps, the model loses track of the plan, and one wrong tap cascades. Minitap's answer is task decomposition. Each agent owns a narrow context and a narrow contract.

AgentJobWhat it owns
PlannerTurn the user goal into ordered subgoalsThe high-level plan, nothing about pixels
OrchestratorDispatch the next subgoal to the right specialistRouting decisions and progress tracking
ContextorKeep a compact summary of what has happenedLong-term memory of the run
CortexChoose the next concrete actionTap, type, swipe, or stop
ExecutorSend the action to the device and confirm side effectsIdempotency and verification
Screen AnalyzerParse the current view into structured elementsVisual grounding only

Each row is a separate loop with its own prompt and its own retries. The team reports that this split cuts average task time roughly in half versus a single-agent baseline (31 seconds versus 68), which is the opposite of what most people expect — more agents usually means more latency. The reason the math works is that small contexts let smaller, cheaper models drive most steps, and only the hard ones escalate.

That shape will look familiar if you read our write-up on the Executor-Advisor pattern or the deeper one on Sakana Conductor multi-agent orchestration. Different domains, same lesson: split the loop before you scale the model.

Why This Matters For Mobile Teams

Most of the mobile AI work shipped in 2024 and 2025 was an LLM bolted to the side of an app — chat in a sheet, summarise in a callout. The next wave is the inverse: an agent that uses the whole app the way a human does, on the phone, end-to-end. Minitap is the first credible open implementation of that, and it changes three concrete things for teams shipping mobile features.

  • QA stops being a bottleneck. Regression flows that took a manual tester twenty minutes — onboarding, checkout, ride booking, food order — now run in under a minute on a device farm. We sketched a related shape for n8n in our n8n YOLOv26 edge device automation piece.
  • Accessibility tooling gets a real backbone. An agent that can drive any app turns into a screen reader replacement for tasks that screen readers cannot complete. The same plumbing is the foundation for true voice control on a phone.
  • Operations agents can finally live on mobile. The Hermes Workspace and Apple Poke patterns we covered handle approvals on a phone; Minitap handles the doing. Together, they sketch what an iPhone-native operations console looks like in 2027.

The Architecture in Production Shape

Stripping the marketing off, Minitap's production stack maps cleanly to the loop most agent teams already operate. The six-agent decomposition lives inside step three of the reference loop we walked through in the loop engineering write-up. Everything else — durable state, retries, schema-validated tool calls — sits around it.

  1. 1Receive the goal. A natural-language task lands from a webhook, a UI, or a parent agent. Persist it with a stable ID before the phone wakes up.
  2. 2Acquire a device. Pull an Android or iOS device from the pool. Reset its state to a known baseline so the run is reproducible.
  3. 3Run the multi-agent loop. Planner emits subgoals, Orchestrator dispatches, Cortex picks the action, Executor performs it, Screen Analyzer parses the result, Contextor compresses history. Loop until done or stuck.
  4. 4Verify the side effect. A booking that did not generate a confirmation page did not happen. The verifier is part of the agent, not part of the harness.
  5. 5Emit a trace. Screenshot timeline, per-step decisions, tool calls. The trace is the artifact the operator inspects when a flow regresses.
  6. 6Release the device. Snapshot the final state, push it to the queue, and reset for the next run.

Steps one, two, five, and six are exactly what n8n is good at. We covered the production shape of that under load in our n8n on ECS Fargate load test. Steps three and four are where Minitap earns its score. The discipline is keeping the seam between them clean so you can swap the model — or even the framework — without rebuilding the harness.

Single-agent screen drivers fail on the third tap. Six small agents with narrow contracts get to a hundred.

How It Compares With The Other 2026 Mobile-Agent Bets

Minitap is not alone. The June 2026 newsletters tracked three other mobile-agent shapes worth holding next to it. The differences are about who owns the runtime — the OS, the app, the device farm, or the workspace.

ProjectWho owns the runtimeBest fit
Minitap mobile-useDevice farm + open frameworkEnd-to-end task automation, QA, accessibility
Apple Poke (iMessage Business)The OS, via MessagesApproval-style agents, customer-facing concierge
Hermes Workspace MobileMulti-agent workspace on the phoneLong-running operations consoles for engineers
Vercel AI SDK 6 mobile agentsThe app developer, inside their own RN/Swift appIn-app assistants for product flows

We covered the other three in the Apple Poke iMessage write-up, the Hermes Workspace mobile orchestration piece, and the Vercel AI SDK 6 mobile agents post. They are complements, not competitors — most production setups will pick two of the four, not one of them.

Where The 100% Score Is Misleading

AndroidWorld is a curated benchmark with 116 verifiable tasks across 20 apps. Hitting 100% is a real achievement, but it is not the same as "your customer's app, today." The benchmark has two friendly properties that the wild does not.

  • Programmatic success verification. Each task has a unit test that confirms it. Real apps rarely do. A flow that "completes" without setting the right server-side state is silently wrong.
  • Stable UI surfaces. Benchmark apps do not push A/B tests, dark patterns, or unexpected paywalls. Production apps do. An agent that scores 100 on the benchmark may still need a human-in-the-loop for the first hundred runs against a live app.
  • No regulatory friction. The benchmark does not include flows that demand SMS OTP, biometric prompts, or KYC. Anything financial, medical, or government will hit a wall the agent cannot pass on its own.

Read The Score Right

100% on AndroidWorld means "the architecture can solve every task the benchmark asks." It does not mean "ship it against your bank's app and walk away." The real production work is the harness — observability, approvals, idempotency — that sits around the agent.

What To Do This Quarter If You Ship A Mobile Product

  1. 1Run the open framework against your own staging build. Pick the three flows your QA team runs most often. Wire them up as natural-language goals. See where the agent breaks. Each break is a backlog item with a clear ROI number.
  2. 2Decide who owns the trace. Screenshots, decisions, tool calls. Without the trace, every failure looks identical. Treat the trace store as a first-class service, not as logs.
  3. 3Add approval gates at the side-effect boundary. Payments, account changes, irreversible writes. The agent runs the flow; a human signs the last step until the regression rate is known.
  4. 4Pick one flow to convert to a permanent automation. Onboarding smoke test, in-app purchase regression, or a daily QA loop. Schedule it on every build, route the trace to chat, and watch the dashboard for two weeks before you trust it.
  5. 5Plan the human handoff. When the agent stops, where does the work go? If the answer is "nowhere, it just fails," the pilot will be cancelled by the second week.

When Mobile-Use Is The Wrong Tool

Mobile-use is the right tool when the agent has to drive real apps end-to-end — QA, accessibility, repetitive operations. It is the wrong tool when the same job can be done through a documented API. An agent tapping a checkout button is impressive; an agent calling the checkout endpoint is cheaper, faster, and easier to monitor. Reach for screen control after you have ruled out the API.

How It Fits The Halmob Stack

The work we ship at Halmob lives at exactly the intersection Minitap targets: mobile apps wired through n8n orchestrations into AI agent workflows. Minitap is the part of the stack that turns a flaky two-tester QA cycle into a one-minute automated run, and turns a backlog of "please tap through this one more time" into something a workflow can own overnight.

For iOS and Android teams the next step is small: clone the mobile-use repo, point it at a staging build of a flow your team already runs by hand, and watch where the agent stops. The first three stops are the roadmap. The team that does that this quarter will own the testing and operations pipeline that everyone else writes off as "too brittle for AI" for another twelve months.

The Bottom Line

Minitap mobile-use is the open template the mobile-agent space was missing. A six-agent split, an Apache licence, a benchmark result that proves the architecture is real, and a clean enough API that a competent team can stand it up in a week. The teams that internalise it this quarter will ship the mobile agents that the rest of the industry was still arguing about in slide decks.

The right question is not "does the agent score 100 on the benchmark." It is "which one of our existing flows is one harness change away from being driven by it." At Halmob we pair mobile development with n8n automation and AI agent orchestration for teams that want that question to have a short answer.

For sources, see the smol.ai AINews newsletter coverage of mobile agents and AndroidWorld, the Minitap mobile-use repository on GitHub, and the underlying paper "Do Multi-Agents Dream of Electric Screens?" on arXiv.