When you build a browser agent, one of the first decisions you make is how much of the work the AI should own. Should it follow a script and only reason when needed? Should it own the full loop and figure out the path itself?

This post walks through the 4 levels of browser agents and when each one is the right fit.

A brief history of browser agents

Browser agents did not appear overnight. They are the product of three waves of tooling, each one adding a new capability to how programs interact with the web.

2022: scripted browsers.

Playwright and Puppeteer as the state of the art: deterministic automation that worked well until the web inevitably changed. When layouts shifted, selectors broke, scripts stalled, and there was no built-in way to recover without a developer stepping in.

2023: AI in the loop.

The first wave of LLM-assisted web tooling arrived. Instead of relying on selectors, developers could describe what they wanted in natural language and let a model handle the interaction. Extraction stopped breaking when layouts shifted, and individual steps became self-healing.

The script still owned the overall flow, but the model took over the moments where rigidity used to hurt.

2025: browser agents.

Computer use became production-ready. Models could now look at a page, decide what to do, take an action, and recover from unexpected states.

The browser stopped being a target for scripts and became a runtime for agents. A program could now be given a goal instead of a sequence of steps.

None of these waves retired the previous one. Scripts still work where scripts are the right tool. AI-in-the-loop is still the right answer for many workflows. Fully autonomous browser agents are the right fit for a growing set of problems.

The result is that you now have a menu of options rather than a single way to do the job. That menu is what the rest of this post is about.

The 4 levels of browser agents

Browser agents sit on a spectrum of agency. At one end, the program is fully in charge and the model is a helper for individual steps. At the other end, the model is in charge of the entire loop and the program is the runtime that supports it.

Let's review each level, what it looks like in code, and when to reach for it.

Level 1: AI as a helper in the loop

The program controls the flow. The model handles individual interactions.

Stagehand replaces rigid selectors with natural-language methods like act() and extract(). When the site changes, the agent keeps running.

AI is used as a self-healing mechanism, making the overall workflow more robust to website changes.

import { Stagehand } from "@browserbasehq/stagehand";
import { z } from "zod";

const stagehand = new Stagehand({ env: "BROWSERBASE" });
await stagehand.init();

const page = stagehand.page;
await page.goto("<https://news.ycombinator.com>");
await page.act("click the top story");
const { title } = await page.extract({
  instruction: "extract the article title",
  schema: z.object({ title: z.string() }),
});

When to reach for it. The workflow is fixed, but the sites you target change often or vary across runs. You want self-healing interactions without giving up control of the flow.

A good fit for:

A common case: lead-enrichment and market-intelligence teams that visit thousands of company sites to pull the same shape of data (name, industry, headcount, recent news). The flow is fixed (visit → extract → store), but no two sites look the same and they redesign without warning. With selector-based extraction every layout change is an outage. With extract the workflow keeps running because the model re-reads the page each time. Same story for regulatory monitoring (filings portals that get re-skinned), price-tracking across competitor catalogs, and job-board ingestion.

Level 2: Agent handoff inside a script

The program still owns most of the flow. The model takes over at the steps that need reasoning.

You hand off to an agent for a sub-task, the agent takes a few actions on its own, and control returns to the script. This is the first level where the program is no longer fully in charge.

This pattern adds more agency to overcome a challenging part of the agent’s workflow, such as logging into a website or completing a form.

import { Stagehand } from "@browserbasehq/stagehand";

const stagehand = new Stagehand({ env: "BROWSERBASE" });
await stagehand.init();

const page = stagehand.page;

// scripted setup
await page.goto("<https://example-store.com>");
await page.act("search for waterproof hiking boots");

// agentic sub-task: reasoning over the result list
const agent = stagehand.agent();
await agent.execute("review the first 5 results and add the best-reviewed waterproof option in size 10 to the cart");

// back to scripted execution
await page.act("proceed to checkout");
await page.act("apply the promo code SAVE10");

When to reach for it. Most of the workflow is deterministic, but one or two steps require reasoning you cannot script ahead of time:

  • picking the right product variant
  • navigating a settings panel that changes per account
  • resolving an ambiguous result list

You keep the scaffolding tight where it matters and open the loop where rigidity would fail.

A good example is a corporate travel tool: the script opens the booking site, plugs in the dates and destination from the request, filters by airline policy compliance.

The result list is where things get fuzzy. "Pick the cheapest flight that lands before 6pm and has under one stop" isn't expressible as a selector. The agent reviews the options, picks one, and hands back for the deterministic seat-selection and confirmation steps.

Level 3: Agent owns the loop, you own the tools

The agent owns the loop. The program is the runtime.

This is where browser agents start to look like agents in the way the rest of the agent ecosystem uses the word as AI is taking over the control of the workflow.

You give the agent a goal, expose a set of tools (act, extract, navigation, framework integrations like LangChain or Mastra), and let it decide what to do next.

The below example uses a Stagehand Agent pattern:

import { Stagehand, tool } from "@browserbasehq/stagehand";
import { z } from "zod";

const stagehand = new Stagehand({ env: "BROWSERBASE" });
await stagehand.init();

const agent = stagehand.agent({
  mode: "cua",
  model: "anthropic/claude-sonnet-4-6",
  tools: {
    lookupCRM: tool({
      description: "Look up a company in our CRM to check if it is already a lead",
      inputSchema: z.object({
        domain: z.string().describe("The company domain, e.g. acme.com"),
      }),
      execute: async ({ domain }) => {
        const record = await crm.findByDomain(domain);
        return { exists: !!record, status: record?.status ?? null };
      },
    }),

    saveLead: tool({
      description: "Save a new lead to the CRM with the data extracted from the company website",
      inputSchema: z.object({
        domain: z.string(),
        companyName: z.string(),
        industry: z.string(),
        employeeCount: z.number().optional(),
      }),
      execute: async (lead) => {
        const id = await crm.createLead(lead);
        return { id };
      },
    }),
  },
});

await agent.execute(
  "Visit each company on the Y Combinator W26 batch page, check if we already have them in the CRM, and save new leads with their company info."
);

Providing the tools gives your program some control over the workflow as checkpoints (or predictable safety nets) during the whole execution.

When to reach for it. The path is unpredictable. You do not know in advance:

  • what sites the agent will visit
  • what the page structures will be
  • how many steps it will take

The example code above isn't hypothetical, this is what sales-prospecting agents look like in production. Hand the agent a list of company domains and the tools to check CRM, save leads, and read sites. It decides which to visit first, when a site doesn't have an "about" page and it should try the careers page instead, when a domain redirects to a parent company and the lead should be merged.

You couldn't script this. Every company site is different and the long tail is the point. The same pattern shows up in customer-support agents, competitive-research agents, and the AI-QA agents that test other AI apps (where the tools are assert, screenshot, and fileFailure instead of CRM lookups).

Level 4: Fully autonomous browser agent

You stop writing the loop. You ask for a result.

You hand the agent a goal and it handles the reasoning, the browser, and the recovery in one place. No scripted scaffolding around it.

Here’s an example using the Claude Agent SDK as an Agent Harness, granting full control of the browser to the AI model:

import { defineFn } from "@browserbasehq/sdk-functions";
import { query } from "@anthropic-ai/claude-agent-sdk";

defineFn("my-function", async (context, params) => {
  const { session } = context;
  const { task, modelApiKey } = params;

  console.log("Connecting to browser session:", session.id);

  let result;

  for await (const message of query({
    prompt: task,
    options: {
      env: {
        ...process.env,
        ...(modelApiKey ? { ANTHROPIC_API_KEY: modelApiKey } : {}),
      },
      allowedTools: ["WebSearch", "WebFetch"],
      permissionMode: "bypassPermissions",
      allowDangerouslySkipPermissions: true,
      maxTurns: 30,
      mcpServers: {
        playwright: {
          command: "npx",
          args: [
            "@playwright/mcp@latest",
            "--cdp-endpoint",
            session.connectUrl,
          ],
        },
      },
    },
  })) {
    if ("result" in message) {
      result = message.result;
    }
  }

  return {
    message: "Task completed",
    timestamp: new Date().toISOString(),
    result,
  };
});

This is the part of the spectrum that is moving fastest right now. Developers shipping OpenClaw and Hermes agents are doing this in production today. Give the agent a goal, give it a session, walk away.

Teams are starting to build fully autonomous browser agents as managed agents, where the loop, the model, and the browser are bundled into one system you call like an API. That is its own conversation and one we will come back to in a future post.

An orthogonal pattern: AI-maintained scripts

The 4 levels above measure how much of the runtime loop the model owns. There is another pattern that does not fit on that axis: keep the runtime fully deterministic and use an AI coding agent to improve the script between runs.

The loop looks like this:

  • the script runs on schedule
  • when it fails, the logs and a session replay are handed to a coding agent
  • the coding agent diagnoses what changed and updates the selectors or flow
  • the next run uses the patched script

The browser is still running raw Playwright. The AI is upstream of the code, not inside the runtime loop. This is closer in spirit to reflective system-improvement research (see GEPA for the same idea applied to prompt optimization).

When to reach for it. You want to minimize production runtime risk: a single mistake is not an option, you need determinism at runtime. (ex: healthcare workflows). Or, you have an existing Playwright or Puppeteer codebase, the cost of a full rewrite is high, and selector maintenance is what is actually hurting you.

With AI-maintained scripts, you get most of the benefit of AI without changing what runs in production.

How to pick: risk and scale

Reading the 4 levels back-to-back, it is tempting to treat them as a maturity curve. They are not. They are a menu, and the right choice depends on two properties of the job you are doing.

Risk pushes for more control.

If a wrong click costs money, breaks trust, or fails an audit, you want determinism. You want to replay the run, point to the exact step that ran, and explain why it happened.

  • raw Playwright or Puppeteer scripts give you that out of the box, and AI-maintained scripts (the orthogonal pattern) keep that property while solving the maintenance burden
  • level 1 (Stagehand act and extract) keeps the deterministic flow and adds resilience to layout changes at runtime
  • level 2 (agent handoff inside a script) bounds the agentic part to specific sub-tasks

Scale pushes for more agency.

If you cannot enumerate the sites, the layouts, or the workflows ahead of time, scripting becomes a losing fight.

  • every new site is a new script
  • every layout change is a new bug
  • at a certain volume of variety, the cost of writing and maintaining scripts overtakes the cost of unpredictability

Discovery, navigation across unfamiliar sites, and any workflow where the long tail matters all live in level 3 (agent owns the loop, you own the tools) or level 4 (fully autonomous, the right answer when even the tool surface is hard to enumerate).

Most production systems sit somewhere in between, and the best ones are hybrids. The same workflow can use level 3 to discover what to do and level 1 to execute the part that matters. The same codebase can use AI-maintained scripts for legacy code and level 2 or 3 for new work.

Risk and scale rarely point the same direction. The design of a good browser agent is mostly the work of deciding which parts of the workflow are which.

Browser agents are the future of the web

Browser agents are not one thing. They are a spectrum, and the level of agency you pick is one of the most important decisions you make when you build one.

  • some jobs want a script with a few smart steps
  • some jobs want an agent with a goal
  • most jobs want a combination of both, in the same workflow, on the same infrastructure

That last part matters. Whatever level you need today, you should be able to move up or down the spectrum tomorrow without changing platforms or rewriting your stack.

One API key, browsers first, Stagehand as the SDK for browser agents, agent loop when you want it.

Build browser agents.