On June 24, 2026, Google made computer use a built-in capability of Gemini 3.5 Flash. The same model that already does function calling, Search grounding, and Maps now also watches a screen and returns clicks, taps, and keystrokes — across a browser tab, an Android phone, and a desktop window from a single API. For teams shipping mobile apps, n8n workflows, and AI agent automation, like the ones we build at Halmob, the mobile-control side of that release is the one that changes the most about what a production agent can actually do.
Until this release, controlling a phone from an agent was its own project. You either wrote a per-app adapter, paid for a third-party device-cloud SDK, or pinned yourself to the standalone Gemini 2.5 computer use preview and held two model bindings open. The June 24 update collapses the three surfaces — browser, mobile, desktop — into one model call. The signal in the day's smol.ai AINews issue was not the headline benchmark; it was the Android quickstart that ships on day one.
The 30-Second Version
What Native Computer Use Actually Ships
Computer use is no longer a separate model binding. The same gemini-3.5-flash endpoint you call for chat, function calling, and Search grounding now accepts screenshots and returns UI-level actions. The model output is a structured function call — click, type, scroll, key_press, wait — that your client-side runner translates to the right primitive: a Chromium DevTools call for the browser, an adb input tap for Android, an OS driver for desktop. The contract is the same across all three.
Two things change because of the merge. First, the cost of trying computer use on a new surface drops to writing the runner — the model is already there. Second, an agent loop can hop surfaces inside a single session without re-authenticating against a different endpoint. The same turn can compare two prices in a browser, open an Android app to confirm an order, and click Send in a desktop email client.
The Mobile Story: adb, Emulators, and Real Devices
Philipp Schmid, who leads DevRel for Gemini, published an Android quickstart hours after the launch. It is the cleanest read on what actually works today:
- 1One script spins up an emulator. A single shell command installs Android Studio command-line tools, creates an AVD, and boots the emulator headlessly. No GUI install, no SDK manager dance.
- 2The agent loop is small. Screenshot via adb → send to Gemini with the interactions API → parse the function call → execute with adb shell input → repeat. The reference loop is under 200 lines of Python.
- 3The same loop works on a real device. Swap the emulator for adb connect <ip>:5555 against a phone on your Wi-Fi, and the agent drives the physical device with no code change. The runner does not know the difference.
- 4Function calls map 1:1 to adb input. click(x, y) becomes adb shell input tap x y; type(text) becomes adb shell input text; scroll becomes adb shell input swipe. Nothing exotic.
This shape matters because it is the first time an Android agent runner can be written by one engineer in one afternoon. The MiniTap and AndroidWorld benchmarks we covered in the MiniTap mobile agents write-up established the evaluation harness; Gemini 3.5 Flash brings a production model into that harness.
How the Three Surfaces Compare
| Surface | Runner primitive | What changes vs the old standalone model |
|---|---|---|
| Browser | Chromium DevTools Protocol (Playwright, Puppeteer) | Same screenshot loop; no separate model binding |
| Android | adb shell input + screencap | First-class quickstart; works over USB and Wi-Fi adb |
| iOS | WebDriverAgent or XCTest (community runners) | No official runner; pattern is extensible but you write the bridge |
| Desktop (macOS, Windows, Linux) | OS accessibility APIs or a screen-driver SDK | Same function-call vocabulary; you supply the executor |
The iOS row is the honest one. The launch covers Android natively because adb is a stable, programmable surface that Google controls. iOS automation still goes through WebDriverAgent or a similar community bridge, and the agent does not care — the function calls are identical. The work moves from "teach the model the platform" to "wire the runner to the platform."
Why This Lands Now
Three things converged in June 2026 to make this release more than a feature flag. Anthropic shipped dynamic workflows with 1000 subagents for code; Sakana shipped Fugu multi-model orchestration for routing; and Hermes shipped /learn for capturing procedures as skills. All three assumed the agent had a reliable way to act on a UI surface. Native computer use in a commodity model is the missing layer underneath those releases — the execution arm for everything the orchestration layer decides.
Orchestration decides which agent runs. Skills decide what it knows how to do. Computer use decides what it can actually touch.
The Build Pattern We Use at Halmob
For mobile development engagements that need agent-driven QA or operations, the pattern we land on with Gemini 3.5 Flash computer use looks like this:
- 1One adb runner per device pool. A small Python or Node service that owns the adb connection, takes screenshots, executes function calls. Keep it dumb. The model lives outside.
- 2One n8n workflow per business procedure. The trigger is whatever drives the procedure — a webhook, a queue, a schedule. The action is a call into the runner with a goal and a starting screenshot. We documented the production shape of this pattern in the n8n on AWS ECS Fargate load test.
- 3One skill per repeatable step. When a procedure is run more than once a week, lift it into a skill so the agent does not re-derive it from screenshots every time. /learn writes the SKILL.md for you.
- 4One audit log per session. Every screenshot, every function call, every adb command. Computer use without an audit log is a liability. Treat the log the way you treat database query logs.
- 5One human-in-the-loop checkpoint per write. The model is good enough to click buttons. It is not good enough to click Confirm Payment without a human approval gate the first thirty times. Grow the approval surface down over time, not up.
Where This Replaces Older Stacks
| What it used to take | What it takes now | Where the savings show up |
|---|---|---|
| Per-app Espresso or XCUITest suites | One adb runner + Gemini 3.5 Flash agent loop | QA coverage scales without per-screen test code |
| Third-party mobile device-cloud SDK | adb over Wi-Fi to your own device shelf | Spend moves from per-minute device-time to your own hardware |
| Standalone Gemini 2.5 computer use binding | Native 3.5 Flash tool call | One model binding, one bill, one rate limit |
| Custom screen-scraping for a vendor portal | Browser surface of the same agent | Same code path for browser, Android, desktop tasks |
The Limits Worth Knowing Before You Ship
- Coordinate precision is not pixel-perfect. Gemini returns a target point; the runner has to handle near-misses with a retry-and-relocate loop. Treat click(x, y) as a hint, not a primitive.
- Long sessions drift. A 200-step session on a phone accumulates more screen state than the model can hold context on. Checkpoint with a fresh screenshot every N steps and let the loop re-orient.
- Latency adds up. Screenshot upload + model inference + adb execute is roughly 1.5–3 seconds per step in our measurements. A 30-step procedure is a minute, not a second.
- OAuth, 2FA, and CAPTCHA are still hard. The model can drive the form; it cannot read the SMS code your bank sent. Wire those through your runner, not the agent.
- App version drift is silent. An agent run that worked yesterday can fail tomorrow because the in-app onboarding shifted. Pin the test pool to a known app version and rev deliberately.
- Public preview, not GA. The feature is live in the Gemini API and the Gemini Enterprise Agent Platform as a public preview. Treat SLAs and pricing as subject to change before GA.
How This Fits Halmob's Stack
Most of what we build at Halmob sits at the meeting point of mobile apps, n8n automation, and AI agent orchestration. Native computer use in Gemini 3.5 Flash lands inside the mobile-execution side of that triangle. The agent loop is what we already build for n8n; the runner is what we already build for mobile; the model is now a single endpoint either side can call. For OpenClaw consultancy engagements, this is the piece that lets us answer "can the agent actually do that on a phone" with a yes instead of a maybe.
The consumer side of Gemini on Android — the one we covered in the Gemini Intelligence on Android write-up — gives end users a proactive assistant. The developer side we are covering here gives engineering teams the API to build their own. The two sit on the same model; the difference is who owns the loop.
When to Reach for Computer Use and When Not To
A Simple Decision Rule
The Bottom Line
The native release is the difference between an interesting demo and a stack you can build a product on. The model is the same one your function-calling agent already uses, the Android quickstart works on the first try, and the runner is small enough to own. The mobile agents we build at Halmob have been waiting for exactly this shape: one model, three surfaces, an adb-shaped contract for the phone.
The right question for the next sprint is not "should we add a computer use agent." It is "which one of our existing procedures already moves a mouse on a UI nobody wants to maintain a scraper for, and can we put the agent there first." At Halmob we pair mobile development with n8n automation and AI agent orchestration so the answer comes back as one engagement, not three.
For sources, see the smol.ai AINews coverage of the June 2026 Gemini release, the Google announcement for computer use in Gemini 3.5 Flash, the official Gemini 3.5 release notes, and Philipp Schmid's Android quickstart thread.