TLDR; We published Universal Verifier, our first paper with Microsoft Research, showing how to reliably verify whether a browser agent actually succeeded. It cuts false positives to ~0% and enables trustworthy browser agent evals and training signals.

In the last year, we’ve made significant investments partnering with frontier labs like Google Deepmind and Microsoft to help push the state-of-the-art for browser agents. While staying true to our open source commitment, we also partnered with companies like Prime Intellect to create BrowserEnv, democratizing the tools required to train frontier browser agents. This is our first academic step to automate the web.

Replacing trajectories with verifiers

Since inception, Stagehand’s shape and focus have been heavily influenced by evals. As models improved, our approach to evaluation broke down. We relied on deterministic trajectories, defining the “correct” path step-by-step on static websites. But real tasks don’t work that way when there are many valid ways to succeed and the web is constantly changing.

This method became counterproductive. Maintaining ground-truth paths didn’t scale and increasingly failed to reflect how agents actually complete tasks.

Instead of prescribing the path, we shifted to judging it.

We built Evaluator, an AI judge that scores agent trajectories based on how they achieve outcomes, not whether they follow a single predefined route. This research paper expands on this endeavor with a sophisticated architecture that outperforms any previous AI judge.

Introducing the Universal Verifier (UV). It presents a new philosophy for architecting AI verifiers for browser agents based on multiple findings after extensive experimentation:

  • Good verifiers rely on rubric design, and good rubrics must have specific non-overlapping criteria
  • Separating process from outcome and controllable from uncontrollable failures is a core design principle
  • Verifiers deserve the same rigorous evaluation and iterative improvement we apply to models
  • Auto-research agents can't fully replace human experts in verifier design yet (but are getting close)

As a result, UV reaches new heights even matching human-level agreement:

Cohen’s κ is the standard metric to calculate inter-annotator agreement. We show that UV agrees with humans just as often as humans agree with one another.

More importantly, UV drops false positive rates in task success verification to essentially 0%. Prior judges trail by a wide margin: ≥45% for WebVoyager and ≥22% for WebJudge. When UV says an agent succeeded, it’s because it actually did.

With RL, the distinction between a verifier that rewards actual success and one that rewards plausibility is the difference between training a model to be accurate and training it to get lucky. Every false positive is a reward signal teaching your model to fake it. UV is the first verifier accurate enough to trust as a reward model, not just a benchmark judge.

Raising the bar together

This is AI and human judgment reinforcing each other in a loop. Humans define the structure while AI improves it.

UV was trained by humans iterating over 96 experiments and weeks of failure analysis. Given the same setup, an auto-research agent plateaued at 70% of expert quality. In other words, it was good at fine-tuning, but not at stepping back and asking "what category of problem am I looking at?" But when the auto-research agent was initialized from the human expert's best configuration, it surpassed the human peak. Humans discovered the structural principles, while AI out-tuned them at the margins.

This loop runs both ways. UV sharpened the humans who worked on it. As one of the verifiers involved in the research told us:

“I was genuinely impressed by the AI’s ability to analyze large amounts of data and accurately identify small mistakes. For example, it was able to review around 15 different data objects and pinpoint subtle errors that could easily be overlooked. This level of detail and consistency is particularly valuable and shows strong potential for improving accuracy and efficiency in our work.”

Humans teach UV where to look. UV teaches humans what to look for. Each pass of the flywheel raises the ceiling for the other.

The power of the Browserbase Platform

Running the experiments for this research required infrastructure that scales with large numbers of real browser sessions. We built an evaluation platform in a weekend: gluing together screenshots and recordings, action and reasoning traces from both the model under test and the UV, plus annotator tooling with real-time computations of agreement, false positive rate, false negative rate, and accuracy.

That platform fed CUAVerifierBench (246 human-labeled trajectories, 106 of them contributed by Browserbase) now open-sourced alongside the paper.

Today, building such a production system for browser agents is effortless with the Browserbase Platform.

  • Observability - Session recordings and traces for both actions and reasoning, built in.
  • Stagehand - The open source browser agent SDK and CLI gives you every rung of the automation ladder: understudy, act/extract/observe, agent.
  • Evals runner - A single CLI that lets you define your own tasks, reuse OSS benchmarks, and run them at high concurrency (collapsing total runtime from weeks to minutes). Grade with the Universal Verifier-backed Evaluator, or use it to generate reward signals during training.
  • Model Gateway - Swap between frontier models in a single line of code, with no rate limits and no extra cost.

What’s next

Universal Verifier is the missing piece for training and developing browser agents that closes the gap between best effort and production-ready. CUAVerifierBench lets anyone measure whether the next verifier does better. And the platform makes both available to any team building browser agents today.

If your team is interested in training or evaluating models with Browserbase, we’d love to talk to you!