Mark web pages for use with vision-language models
Overview
WebMarker is a JavaScript library used for adding visual markers and labels to elements on a web page. This can be used for Set-of-Mark (opens in a new tab) prompting, which improves visual grounding abilities of vision-language models such as GPT-4o (opens in a new tab), Claude 3.5 (opens in a new tab), and Google Gemini 1.5 (opens in a new tab). This library aims to:
- Improve LLM performance on vision tasks referencing web pages
- Enable reliable web page interactions based on LLM responses
How it works
1. Call the mark()
function
This marks the interactive elements on the page, and returns an object containing the marked elements, where each key is a mark label string, and each value is an object with the following properties:
element
: The interactive element that was marked.markElement
: The label element that was added to the page.boundingBoxElement
: The bounding box element that was added to the page.
A data-mark-label
attribute containing the label is also added to each marked element.
You can use this information to build your prompt for the large language model.
2. Send a screenshot of the marked page to a large language model, along with your prompt
Example prompt:
let markedElements = mark();
let prompt = `The following is a screenshot of a web page.
Interactive elements have been marked with red bounding boxes and labels.
When referring to elements, use the labels to identify them.
Return an action and element to perform the action on.
Available actions: click, hover
Available elements:
${Object.keys(markedElements)
.map((label) => `- ${label}`)
.join("\n")}
Example response: click 0
`;
3. Programmatically interact with the marked elements.
In a web browser (i.e. via Playwright), interact with elements as needed.
For prompting or web agent ideas, see the WebVoyager (opens in a new tab) paper.
Playwright example
// Inject the WebMarker library into the page
await page.addScriptTag({
url: "https://cdn.jsdelivr.net/npm/webmarker-js/dist/main.js",
});
// Mark the page and get the marked elements
let markedElements = await page.evaluate(async () => await WebMarker.mark());
// Click a marked element
await page.locator('[data-mark-label="0"]').click();
// (Optional) Check if page is marked
let isMarked = await page.evaluate(async () => await WebMarker.isMarked());
// (Optional) Unmark the page
await page.evaluate(async () => await WebMarker.unmark());
Installation
yarn add webmarker-js
API
mark()
Marks the page and returns the marked elements.
const markedElements = mark();
// markedElements["0"].element returns the first marked element
// markedElements["0"].markElement returns the label element for the first marked element
// markedElements["0"].boundingBoxElement returns the bounding box element for the first marked element
Optionally accepts an options object support the properties below.
Options
selector
A custom CSS selector to specify which elements to mark.
- Type:
string
- Default:
'a[href], button, input:not([type="hidden"]), select, textarea, summary, [role="button"], [tabindex]:not([tabindex="-1"])'
getLabel
Provide a function for generating labels. By default, labels are generated as integers starting from 0.
- Type:
(element: Element, index: number) => string
- Default:
(_, index) => index.toString()
markAttribute
A custom attribute to add to the marked elements. This attribute contains the label of the mark.
- Type:
string
- Default:
"data-mark-label"
markPlacement
The placement of the mark relative to the element.
- Type:
'top' | 'top-start' | 'top-end' | 'right' | 'right-start' | 'right-end' | 'bottom' | 'bottom-start' | 'bottom-end' | 'left' | 'left-start' | 'left-end'
- Default:
'top-start'
markStyle
A CSS style to apply to the label element. You can also specify a function that returns a CSS style object.
- Type:
Partial<CSSStyleDeclaration> | (element: Element) => Partial<CSSStyleDeclaration>
- Default:
{backgroundColor: "red", color: "white", padding: "2px 4px", fontSize: "12px", fontWeight: "bold"}
boundingBoxStyle
A CSS style to apply to the bounding box element. You can also specify a function that returns a CSS style object. Bounding boxes are only shown if showBoundingBoxes is true.
- Type:
Partial<CSSStyleDeclaration> | (element: Element) => Partial<CSSStyleDeclaration>
- Default:
{outline: "2px dashed red", backgroundColor: "transparent"}
showBoundingBoxes
Whether or not to show bounding boxes around the elements.
- Type:
boolean
- Default:
true
containerElement
Provide a container element to query the elements to be marked. By default, the container element is document.body.
- Type:
Element
- Default:
document.body
viewPortOnly
Only mark elements that are visible in the current viewport.
- Type:
boolean
- Default:
false
Advanced example
const markedElements = mark({
// Only mark buttons and inputs
selector: "button, input",
// Use test id attribute for marker labels
markAttribute: "data-test-id",
// Use a blue mark with white text
markStyle: { color: "white", backgroundColor: "blue", padding: 5 },
// Use a blue dashed outline with a transparent and slighly blue background
boundingBoxStyle: { outline: "2px dashed blue", backgroundColor: "rgba(0, 0, 255, 0.1)"},
// Place the mark at the top right corner of the element
markPlacement: "top-end";
// Show bounding boxes over elements (defaults to true)
showBoundingBoxes: true,
// Generate labels as 'Element 0', 'Element 1', 'Element 2'...
// Defaults to '0', '1', '2'... if not provided.
getLabel: (element, index) => `Element ${index}`,
// A custom container element to query the elements to be marked.
// Defaults to the document.body.
containerElement: document.body.querySelector("main"),
// Only mark elements that are visible in the current viewport
viewPortOnly: true,
});
isMarked()
Check if the page is marked.
const isMarked = isMarked();
unmark()
Unmark the page.
unmark();