Every modality will browse.

Our human autonomy is predicated on our ability to sense, navigate and discern the way we do. We execute based on what we see, hear, and sense in our real, material world.

Agency is more or less the doctrine of mankind as we know it.

Agents have access to vision, audio, and text inputs/outputs. Access is not a robust enough policy or learning framework for true navigation and discernment through all modalities.

Your models still have no clue how to intuitively map out and synthesize these contexts in a way that’s optimal. Browser infrastructure is that map. The map that agents need to truly understand the world they’re navigating.

Browsers are the breeding ground for infinite knowledge and true continual learning for humans and computers alike. All modalities will soon explore the world wide web.

Agents need to understand real world interaction.

The browser as an interaction environment

The dynamic web, the live web, is intuitively the correct environment for multimodal agents. It’s visual, textual, structured, dynamic, and adversarially inconsistent. By nature, this makes it the most powerful training ground.

A few months back, we collaborated with Prime Intellect in creating browserenv.com, training and evaluating computer use agents (CUAs) on the dynamic web. The same task may require reading text, interpreting layout, using visual hierarchy, following a workflow, and deciding when not to act.

For multimodal tasks like this, browser environments can expose many views of the same state:

  • screenshot: what the user sees DOM → what the page contains
  • accessibility tree: what the user can interact with network events
  • what changed URL/session state: where the agent is replay → how the agent got there

Each modality matters. Predictable visual states (recordings, movies, playbacks) are useful, but the novelty of a live browser makes it a true agentic unlock. A live browser is unpredictable in a way recordings and snapshots aren't, making it a richer observation space than anything thats not live.

There’s been plenty of discussion around how vision will matter enormously going forward for CUAs. Many UI states are only obvious visually with disabled buttons, hidden menus, loading states, canvas-based charts, visual editors, drag-and-drop affordances, map regions, and generated layouts. We’ve read papers like Pix2Struct, ScreenAI, SeeClick, and Ferret-UI , indicating how screens and user interfaces becoming serious multimodal pre surfaces.

But the structured state is too valuable to ignore. The DOM and accessibility tree can make agents more reliable, cheaper to run, easier to evaluate, and easier to debug. The most capable systems are likely to be hybrid: vision-heavy where visual grounding is needed, structure-heavy where precision matters, and trace-aware when learning from prior behavior (Browserbase’s Stagehand is exactly this, more on that further down).

The browser is one of the few environments where all of these modalities can be captured together. Your agent learns how to roll with the punches when its trained on unpredictability.

That’s why it needs to train on the live web.

Roadmapping the browser environment

1. Agent cohorts need browser workspaces

Beyond HCI (human-computer interaction), agent-to-agent communication inherently relies on concurrent browser sessions.

Multi-agent systems like AutoGen, CAMEL, and MetaGPT point toward workflows where specialized agents collaborate. In a computer-use setting, those agents need separate browser contexts.

One agent keeps the conversation moving. Another verifies sources. Another compares products. Another reproduces a bug. Another tests a generated UI. Another monitors a dashboard. Another handles a logged-in workflow.

Each browser context likely needs separate credentials, location, viewport, cookies, permissions, and traceability. So here, browser concurrency enables parallel, isolated workspaces for agent cohorts.

This is what seamlessly ties together the collaboration between human-facing and background agents.

2. “Browsing” as a tool or environment?

Trick question. You can pick both.

A useful interaction agent must be fast enough to feel conversational, deep enough to solve real tasks, cheap enough to run repeatedly, safe enough to trust, and observable enough to improve.

This is the difference between a browser as a tool and a browser as an environment. A tool executes an action, while an environment lets the model experience consequences.

Browser automation has historically meant making an objective workflow pass:

Click this selector → Fill this form → Download this report → Test this checkout flow → Scrape this page. Repeat until reliable.

That work still matters, but computer use models create a broader infrastructure need.

Beyond automating a given browser task, it is now this: Can we make a given workflow observable, learnable, repeatable, safe, and useful to models? When it fails, how will it pick itself up and improve?

This requires a different type of artifact.

For interaction agents, the consequences are kind of the point. Any given modality may solicit a decision about whether to proceed, ask the user, or stop. A repeated action may reveal where an agent is stuck. A dynamic environment is needed for dynamic recovery and self-healing models.

Every one of those moments can become training data, eval data, or a recovery opportunity. The TL;DR is fairly clear: multimodal models need environments where they can observe, act, fail, and recover.

3. Our primitives enable your model

At Browserbase, we work with browser infrastructure every day. It is no mystery that the best system chooses the cheapest path that preserves correctness.

Latency for multimodal task demands vary greatly between long vs short horizon. The clause here is how to optimize. In people, we measure it as reaction time.

On Human Benchmark's reaction-time test, the median lands around 273ms, about a quarter second from stimulus to click. The detail worth borrowing is that the score isn't pure biology, it also includes the latency of your screen and mouse. A display can add anywhere from 15-150ms in lag time, so the hardware between your brain and the click matters.

Inference dictates the speed at which your visual cortex performs. Inference also dictates the speed at which an agent observes and executes. It’s the part you can actually optimize. That's the job browser infrastructure does.

You can really even extrapolate this to our own primitives. Stagehand (our SDK for browser agents) neatly maps the loop:

observe discovers available actions and current page state in a single call → act executes them → extract pulls structured data without spinning up another reasoning loop just to parse → agent handles higher-agency multi-step workflows whenever escalation is needed.

// Act - Execute natural language actions
await stagehand.act("click the login button");

// Extract - Pull structured data
const price = await stagehand.extract(
  "extract the price",
  z.number()
);

// Observe - Discover available actions
const actions = await stagehand.observe("find submit buttons");

// Agent - Automate entire workflows
const agent = stagehand.agent({
  mode: "cua",
  model: "gpt-5.5-extra-high",
});
await agent.execute("apply for this job");

A resilient infrastructure layer is what ultimately helps multimodal interaction systems spend fewer tokens and less time translating model intent into browser action. The wheel is not being reinvented, but rather optimized. The opportunity to leverage browser infrastructure when training several modalities is right in front of you.

Evidenced by research

Labs training computer-use models will continue to see essential use cases for traces from agentic work on the live web.

The multimodal frontier fares heavily on two points: the input streams (vision/audio/action all need to be captured in the same trace format) and the policy step (improvements have to generalize across modalities, not just one channel). The compounding only works if traces faithfully record all the channels the policy will eventually act on. An inconsistency or drop in one modality, and the next cycle can't learn it.

Research benchmarks are increasingly built around embodied web environments. Back in 2024, researchers at CMU notably gave us WebArena, which introduced reproducible web environments for autonomous agents and showed large gaps between agent and human performance. This extrapolated to VisualWebArena, which explicitly motioned towards visually grounded web tasks.

This aligns with recent training and eval work. BrowserGym frames browser-agent research around standardized observation and action spaces. More recently, WebGym gave us the largest OS training dataset for visual web agents, scaling RL training environments for visual web agents using realistic tasks and rollouts. WebRL trains web agents with self-evolving curriculum reinforcement learning. We further explored scalable task and trajectory generation for generalist computer-use agents with AgentSynth.

Dynamic CL environments enable limitless learning.

The future of computer usage is undoubtedly multimodal, be it HCI or between agents. The user will speak, interrupt, share visual context, ask follow-ups, and expect progress while background work continues. The model will search, browse, test, generate UI, coordinate agents, and return with evidence. Recursing on this pattern is what enables a rather infinite learning sequence.

Interactive systems will become dependent on real world embodiment and presence. Background systems need to be accurate and efficient in both short and long horizon tasks.

Browser infrastructure is what makes this not just possible, but successful.

I’ll say it again: agents in all modalities need environments where they can observe, act, fail, and recover.

The agents want to browse. Let them.

Keep reading