| name | web-data-extraction |
| description | Extract structured data from websites using RLM decomposition and browser tools from this example's self-contained project setup. |
Web Data Extraction with RLM (Self-Contained Example)
Use this skill from examples/web-data-extraction/.
This example is isolated with its own .pi/settings.json, local package setup, and wrapper scripts so it does not interfere with other projects.
Setup (run once)
npm run setup
This installs npm:pi-rlm for this example scope and prepares local browser-tools dependencies under .pi/.
Then start Chrome for browser automation:
./scripts/browser-start.sh
./scripts/browser-start.sh --profile
Important Runtime Rule
RLM subtask nodes run with --no-skills.
So task instructions must tell solver nodes to use local wrappers (not bare browser-content.js):
./scripts/browser-content.sh
./scripts/browser-eval.sh
./scripts/browser-nav.sh
Usage
Single page extraction
Use the rlm tool to extract all product information from https://books.toscrape.com/.
For each book, extract: title, price, rating, and availability.
Return the results as a JSON array.
Use ./scripts/browser-content.sh for page extraction and ./scripts/browser-eval.sh for targeted DOM queries.
Start the browser first with ./scripts/browser-start.sh if not already running.
RLM settings: backend=sdk, mode=auto, maxDepth=2, maxNodes=12, toolsProfile=coding
Multi-page extraction
Use the rlm tool to extract conference talk information from these pages:
- https://example.com/talks/day1
- https://example.com/talks/day2
- https://example.com/talks/day3
For each talk extract: title, speaker, time slot, room, and abstract.
Decompose by page so each page is scraped in parallel.
Return a unified JSON array sorted by time slot.
Use ./scripts/browser-content.sh for full page extraction.
Start the browser first with ./scripts/browser-start.sh if not already running.
RLM settings: backend=sdk, mode=decompose, maxDepth=1, maxNodes=8, toolsProfile=coding
Paginated extraction
Use the rlm tool to extract all job listings from https://example.com/careers.
The page is paginated — extract from the first 5 pages.
For each listing extract: title, department, location, and posting date.
Return as a JSON array.
Use ./scripts/browser-nav.sh and ./scripts/browser-eval.sh to navigate pages and extract listings.
Start the browser first with ./scripts/browser-start.sh if not already running.
RLM settings: backend=sdk, mode=auto, maxDepth=2, maxNodes=16, toolsProfile=coding