
Taming iframes: A Stagehand Update
Iframes (or as I like to call them: cry-frames) are unavoidable in the world of web automation. Iframes are basically just mini-websites embedded in other websites; anytime you've seen "embedded" content like a tweet or YouTube video, that's an iframe. They are simultaneously a super convenient tool for web developers, and a source of existential despair for automation engineers.
Writing an automation script for a site loaded with iframes feels like cooking one meal across multiple kitchens simultaneously. Each kitchen appears identical, but surprise! They all have completely different ingredients. Sometimes, there are even secret kitchens hiding inside other kitchens. Handling iframes, even in established browser automation frameworks like Playwright, Puppeteer, and Selenium, is a common point of confusion for people developing web automations. This post is about how Stagehand changed that.
A quick aside on Playwright and CDP
Playwright is a browser automation framework developed by Microsoft that lets developers script and control real browsers like Chrome, Firefox, and Safari. It is most popular for end-to-end testing, but its also commonly used for tasks like web scraping, performance monitoring, and automating routine web workflows. Under the hood, Playwright uses the Chrome DevTools Protocol (CDP) [2] to talk directly to Chromium-based browsers, but it abstracts away that complexity with a clean, high-level API. For example, instead of manually sending CDP commands, you can just use page.click('#submit')
or page.locator('input[name="email"]').fill('user@example.com')
.
How Stagehand makes complex web pages LLM-parseable
First, Stagehand uses the Accessibility.getFullAXTree
CDP command, which returns a structured set of Accessibility Nodes (AXNode
). These nodes come with rich semantic details about every interactive or structural element on the page, including their roles (like buttons, links, or input fields), names, labels, and states. Here’s an example of an AXNode
:
{ "nodeId": "30", "role": { "value": "button" }, "name": { "value": "Apply for this Job" }, "properties": [ {"name": "focusable", "value": {"value": true}} ], "backendDOMNodeId": 30 }
Second, Stagehand uses DOM.getDocument
, to get the full hierarchical structure of the DOM starting from the root node. A DOM node looks something like this:
{ "nodeId": 42, "parentId": 41, "backendNodeId": 3, "nodeType": 1, "nodeName": "BUTTON", "localName": "button", "nodeValue": "", "childNodeCount": 2, "children": [...], "attributes": [ "class", "_button_8wvgw_29 _quaternary_8wvgw_91 _back_14ib5_49 ashby-job-board-back-to-all-jobs-button" ] }
Stagehand then combines these data sources to create a hierarchical string representation of the page. This representation makes the page understandable and actionable for the LLM. Here’s a small snippet illustrating this structure:
Accessibility Tree:
[1] RootWebArea: Careers at Y Combinator | Y Combinator
[7] scrollable
[311] div
[314] banner
[315] navigation
[316] div
[318] link: Y Combinator
[3] image: Y Combinator
[319] div
[321] link: About
[339] link: Companies
[351] link: Startup Jobs
[371] link: Find a Co-Founder
[377] link: Library
[383] link: SAFE
[389] link: Resources
[419] link: Apply
...
Stagehand also uses this information to construct an ID → XPath map. When the LLM selects an element by its ID from this structured representation, Stagehand quickly retrieves the corresponding XPath and executes the desired action.
Yet, Playwright and CDP don’t always play nicely together, particularly around iframes:
- Managing contexts [1]: Want to send a CDP command to a specific iframe? Now you have to handle the CDP session context yourself—goodbye Playwright’s magical auto-context switching.
- Frame IDs: Playwright doesn’t expose stable IDs for frames. This omission complicates creating composite IDs.
- Node IDs aren’t unique across iframes: The backendNodeId provided by CDP, which Stagehand relies upon internally, is not globally unique across iframes. The same ID might point to two entirely different elements in different iframes, causing major ambiguities.
So how can we enable users to extract
, observe
, and act
upon elements in deeply nested iframes?
Why iframes are so painful
Users just want to be able to write Stagehand code like this:
await page.act({ action: "click the submit button", iframes: true, });
Since Stagehand operates primarily on the HTML body tag, it’s relatively easy to resolve a natural language command to a Playwright-compatible DOM selector.
However, with iframes, you don’t know exactly where your element lives, and we don’t want Stagehand users to tear their hair out trying to get the right DOM selectors. A major part of the value of Stagehand is to free users from the tedium of manually finding elements, frames and selectors.
Assuming you know what selectors to use, here is what clicking inside of a nested iframe might look like with playwright:
const inner: FrameLocator = page .frameLocator('iframe.lvl1') // level 1 .frameLocator('iframe.lvl2') // level 2 .frameLocator('iframe.lvl3'); // level 3 – form lives here await inner.locator('button[type="submit"]').click();
The Solution: One Tree to Rule them all
Previously, Stagehand only operated on the root DOM. Now, Stagehand systematically traverses through every frame, both same-process and out-of-process iframes (OOPIF). This traversal is done through a depth-first search:
- We start at the main document and recursively explore each child frame.
- For each frame, we capture its accessibility tree and calculate an absolute XPath pointing to its iframe element within the parent frame.
- We construct a snapshot containing each frame’s simplified tree (hierarchical string representation), XPath mappings, and URL mappings.
- Each snapshot is tagged with a unique EncodedId—combining a frame ordinal and node ID—to guarantee global uniqueness.
- Finally, these snapshots are stitched together into a single, combined accessibility tree. The resulting unified representation ensures all elements, regardless of nesting depth, are accurately identifiable and interactable by Stagehand.
Globally Unique IDs
To preserve uniqueness of nodes across iframes, we assign each one a simple composite ID: a frame ordinal combined with its backend node ID. This gives us a stable way to distinguish otherwise identical node IDs that come from different frames.
Deep XPath Locator
To reliably find elements within nested iframes, we added a “deep” XPath locator. It works by breaking the XPath into chunks, and every time it hits an <iframe>
step (like /.../iframe[2]/../...
), it uses Playwright’s frameLocator() to descend into that frame. It then continues building the path inside that new context. For an XPath like /html/body/iframe[2]/iframe[1]/div[3]
, the deep locator would handle each iframe hop internally and return a working Playwright Locator for the final element.
CDP Session Management
We updated how CDP sessions are managed by tying each session directly to its corresponding Page or Frame. Internally, we use a WeakMap to store sessions per frame, so we can easily look them up when needed without keeping extra references around. This makes it easier to send CDP commands to the right frame, especially when dealing with nested or out-of-process iframes.
End-User Impact
extract
now sees everything.observe
produces xpaths for elements across all frames.act
can traverse and act upon elements across all frames.
Most of all, you don’t need to think about any of it—unless you want to.
What’s Next?
The next step is to skip empty iframes entirely. If an frame doesn’t have meaningful content, there’s no reason to waste time indexing it.
But for now, we’re just excited that you no longer have to worry about what kitchen your ingredients are in. Stagehand just lets you cook.
If you’ve been waiting for iframe support—this is it.
Footnotes
- Managing "contexts" is essentially managing the environment or scope in which browser commands run. In CDP, a context usually refers to an individual browsing context such as a specific webpage or iframe, each requiring its own unique CDP session to control it independently. When switching between Playwright frames, Playwright abstracts this complexity by automatically switching contexts behind the scenes. However, Playwright does not directly provide the ability to send CDP commands that are scoped to a frame. Which means you lose Playwright's automated context handling, forcing you to manually track each individual session context for commands to reach the correct iframe.
- The Chrome DevTools Protocol (CDP) is Chrome’s remote-debugging API. It exposes the same low-level hooks that power the DevTools UI, but over a public, JSON-RPC API. This allows external programs to inspect, control, and automate Chrome browsers directly. Playwright itself uses CDP internally when interacting with Chromium-based browsers.