I talk to enterprise engineering teams every week. Ramp, Stripe, CVS Health, Bill.com, Flex, Bilt. Companies building browser automation that actually has to work in production, at scale, with real money on the line.
The conversation always follows the same arc. They’re excited about AI agents. Then they do the math.
Every action a browser agent takes (clicking a button, filling a form field, extracting a price) requires an LLM to look at the page, reason about what it sees, and decide what to do. That’s an inference call. Sometimes several. At one workflow, on one website, that’s manageable. At typical Enterprise scale, the monthly inference bill easily hits $17,000. And that’s after cutting volume by 75% just to keep costs under control.
Here’s what makes that number hard to accept: those invoices mostly land on the same portals. Same login page. Same form layout. Same submit button. The LLM doesn’t know that. It re-discovers the page structure from scratch, every single time.
You’re paying your most expensive resource to solve problems it already solved yesterday.
The pattern I keep seeing
I’ve had this conversation over 200 times in the past year. The arc is consistent.
It starts with excitement. You show someone page.act("click the submit button") and watch an AI figure out which button to click. No CSS selectors. No XPaths. No scripts that shatter when the page changes. A company told us they have a dedicated team whose entire job is fixing broken bots across 650+ vendor websites. Scripts break every 2-3 weeks. That team spends more time patching than building. Natural language automation feels like a way out.
Then comes the speed question. One company needs sub-5 second response times because their agents are answering customer calls live. Another company is targeting 250,000 users across 40 payment portals. They can’t have each session take two and a half minutes.
Then the cost question. A payment company scaled back from 32 million to 8 million URLs per year for merchant compliance crawling because LLM costs were too high. A Sales company told us their budget was one cent per agent run, inclusive of everything: browser time, tool calls, inference. At raw LLM prices, that’s not possible.
Then the worst phase: “we’ll build it ourselves.”
This is where smart teams go to burn months.
A Fintech company we worked with built a Redis-backed XPath cache. MD5 hashes of prompts mapped to previously discovered selectors. A custom cached_act function that checks the local store before making an observe call. It worked, sort of.
Another one tried something similar for their visa automation platform. Their cache was creating separate entries for every user because the form-filling prompts included the applicant’s name. Different name, cache miss, full LLM inference. Forms were taking 10 minutes to fill out. The cache file grew without bound.
URLs with tracking parameters cause cache misses on pages that are functionally identical. Pages that look the same but have subtly different DOMs return stale selectors. Variables in prompts (names, addresses, payment amounts) create unique keys for actions that are structurally the same.
Caching browser automation is a genuinely hard infrastructure problem. It looks easy until you try it.
So we built it into Stagehand
Most browser automation is repetitive. That’s the core insight.
Receipt automation hits the same vendor portals. Health companies navigate the same provider directories. Bill pay companies processes rent payments through the same property management sites. GTM tools rus enrichment workflows against the same databases. The data changes. The page structure doesn’t.
When you make a request to the Stagehand API, it checks whether you’ve performed an equivalent action on an equivalent page before. If you have, it maps your natural language instruction to a cached selector instantly. No LLM call. No inference cost. No waiting.
Three components make this work:
Tree Builder. Captures a structural snapshot of the DOM by traversing all elements, shadow DOM boundaries, and iframes. Pages are compared across requests not by URL alone, but by what’s actually rendered on the page.
DOM Hasher. Creates a deterministic fingerprint for each action. The key distinction: it separates what you’re doing from the variable data you’re doing it with. Filling in “John Smith” and filling in “Jane Doe” on the same form produces the same cache key. The action is identical. The variables don’t matter. This is exactly the problem a company we’re working with hit. Their homegrown cache treated every unique prompt as a unique action. Stagehand’s hasher doesn’t.
URL Normalizer. Standardizes page URLs before including them in the cache key. Referral trackers, analytics parameters, session tokens, all the noise that would otherwise cause cache misses on functionally identical pages gets stripped.
If a page actually changes (buttons move, layout shifts, new elements appear) the system detects the mismatch and regenerates the cache. For the vast majority of runs where page structure is stable, it’s instant.
What changes for the teams using this
Speed. A payment portal automation went from 2 minutes 20 seconds down to 30 seconds. A CRM lookup workflow dropped from 41 seconds to sub-5. A Health company put it best: “superhuman speed with caching, human-level speed with the LLM in the loop.”
Cost. If 80% of your actions hit cache, your inference bill drops 80%. For Ramp, that’s roughly $13K/month back. For a GTM company, it’s the difference between a viable product and a money pit. Crawling 10,000 customer websites daily with caching means only processing the 0.1% of pages that actually changed.
Reliability. LLM responses are non-deterministic. Same prompt, same page, slightly different selector strategy each time. Cached responses are deterministic. Same input, same output, every run. Deterministic cached execution is how you reach the mid-90s on happy-path flows.
Who this is built for
The companies that get the most out of caching share one trait: they hit the same sites with the same workflow structure, at scale, with different data each time.
Ramp — same bill-pay flow across hundreds of vendor portals, 4.4 million browser minutes in a single month.
Clay — enrichment workflows crawling the same directories and databases for different queries. $150K deal. Probably the single biggest beneficiary of a global cache.
Garner Health — 8.5 million browser minutes over six months navigating the same healthcare provider portals. Same portal structures, same form flows, same extraction patterns.
Flex — 250,000 users, 40+ payment portals, scaling to 100 new integrations per week. Caching is the only way the math works.
Bilt — rent payments through property management portals. Same login, same payment flow, thousands of times per month.
Shopify — 3.8 million browser minutes, testing and monitoring flows against merchant sites with repetitive structures.
Stripe — merchant compliance evaluation across millions of URLs. They already cut their volume 4x because of LLM costs. Caching brings that number back up without the bill.
Bill.com — 650+ vendor payment portals. A new site added every other week. Caching means the agent doesn’t need to re-learn portal structures it’s already seen.
Pursuit — 7.3 million browser minutes on recruiting platforms. Job boards have some of the most repetitive DOM structures on the web.
If your automation does the same type of thing on the same type of page — even with different data flowing through it — you’re leaving speed and money on the table.
The thing nobody’s talking about: agents that actually learn
There’s a bigger idea here that goes beyond cost savings.
Dwarkesh Patel wrote recently about what he sees as the core bottleneck holding back AI: continual learning. LLMs don’t get better over time the way a human would. You can have a great session with a model, teach it exactly how you like your writing or your code, and all of that understanding vanishes when the session ends. He compared it to teaching saxophone by mailing students written instructions after each lesson. Nobody learns that way.
He’s right. Jessy Lin at Stanford has written about how traditional approaches to continual learning — updating model weights as new data arrives — cause catastrophic forgetting. The model gets better at the new thing and worse at everything else.
So where does agent memory actually live today? Mostly in files. The most practical form of “continual learning” in 2026 is agents saving notes to markdown files and reading them back later. One of our customers, Starlight, built a system where a “discovery agent” maps out UI capabilities and stores them as “skills” in the file system. Another agent absorbs those skills and performs actions. It’s clever. It also means your agent’s memory is a folder of text files on a server somewhere.
That works for general-purpose agents. It doesn’t work for the web.
Web pages are structural, visual, interactive. You can’t capture “how to navigate this payment portal” in a markdown file and expect an agent to reliably reconstruct that knowledge later. You need something that understands page structure natively.
This is what Stagehand Caching actually is, underneath the performance story. It’s memory for web agents.
When your agent successfully navigates a page, Stagehand remembers how it did it. Not in a text file. In a structural fingerprint that maps intent to specific DOM elements. The next time any agent encounters that page, it doesn’t reason from scratch. It remembers.
When the page changes, it notices. The fingerprint no longer matches, so it falls back to the LLM, re-learns the page, and updates the cache. Bill.com described this as “self-healing through LLM fallback when cached actions fail.” The agent tries the cached path first. If the page has changed, it figures out the new path and remembers that instead.
This is a loop: encounter a page, reason about it, cache the result, reuse it until the page changes, then re-learn. That’s not just caching. That’s a form of learning.
It’s narrow — it only works for web interactions. But it’s real. The agent gets faster and cheaper the more it runs. It builds up institutional knowledge about the sites it automates. And unlike a markdown file, that knowledge is structural, verifiable, and automatically invalidated when it goes stale.
Patel puts the timeline for human-level continual learning at 2032. For web automation, we’re not waiting. The primitive is here.
The real shift
The goal was never to put AI in every step. The goal was to build automation that works — fast, cheap, and reliable.
LLMs are extraordinary at understanding new pages, adapting to changes, reasoning about complex UIs. But once they’ve done that reasoning, the value isn’t in doing it again. The value is in remembering it.
The first time your agent encounters a payment portal, it needs AI. The hundredth time, it needs a cache. The thousand-and-first time, when the portal redesigns its checkout flow, it needs AI again — briefly — and then it’s back to cache.
That’s the loop: reason, remember, reuse, re-learn when things change. It’s not AGI. But it’s an agent that genuinely improves at its job over time, in a way that’s measurable, reliable, and already running in production.
For the enterprise teams I work with, this changes the conversation entirely. It goes from “how do we afford LLMs at scale?” to “how fast can we deploy the next 50 workflows?” From “will this agent break when the website changes?” to “it already handled that.”
That’s when things start moving.
