| name | opendevbrowser-data-extraction |
| description | This skill should be used when the user asks to "extract data from a page", "scrape tables", "collect paginated results", "parse list/card content", or "export structured web data" with OpenDevBrowser. |
| version | 2.0.0 |
Data Extraction Skill
Use this skill to extract structured, auditable datasets from dynamic pages with compliance-aware workflows.
Pack Contents
artifacts/extraction-workflows.md
assets/templates/extraction-schema.json
assets/templates/pagination-state.json
assets/templates/quality-gates.json
assets/templates/compliance-checklist.md
scripts/run-extraction-workflow.sh
scripts/validate-skill-assets.sh
- Shared robustness matrix:
../opendevbrowser-best-practices/artifacts/browser-agent-known-issues-matrix.md
Fast Start
./skills/opendevbrowser-data-extraction/scripts/validate-skill-assets.sh
./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh list
./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh pagination
./skills/opendevbrowser-data-extraction/scripts/run-extraction-workflow.sh infinite-scroll
Supporting Surfaces
- Use browser replay (
screencast-start / screencast-stop) when lazy loading, infinite scroll, or pagination drift needs time-based proof.
- Use desktop observation only for read-only evidence around sibling desktop surfaces; most extraction flows should stay browser-only.
- Use
--challenge-automation-mode off|browser|browser_with_helper only for bounded browser-scoped computer use when provider challenges appear; stop before any desktop-control interpretation.
Core Rules
- Define schema before extraction.
- Track provenance for each record (
source_url, provider, captured_at, page).
- Prefer embedded structured data (JSON-LD/microdata) where available.
- Stop on sustained anti-bot pressure (repeated 403/429/challenge loops).
- Honor
Retry-After and preserve checkpoint state before retrying pagination.
Parallel Multitab Alignment
- Apply shared concurrency policy from
../opendevbrowser-best-practices/SKILL.md ("Parallel Operations").
- Run extraction acceptance on
managed, extension, and cdpConnect before claiming mode parity.
- Keep one session per worker; avoid interleaving
target-use streams inside a single session.
Robustness Coverage (Known-Issue Matrix)
Matrix source: ../opendevbrowser-best-practices/artifacts/browser-agent-known-issues-matrix.md
ISSUE-01: stale refs after dynamic content updates
ISSUE-06: 429/backoff and retry budgeting
ISSUE-08: blocked/restricted origins and policy checks
ISSUE-09: pagination drift, duplicate accumulation, terminal detection
ISSUE-10: locale/currency parsing consistency
Extraction Planning
- Define required fields and null policy.
- Snapshot and map refs to schema.
- Choose pagination strategy.
- Apply quality gates each page.
opendevbrowser_snapshot sessionId="<session-id>" format="actionables"
Structured Data First
Attempt extraction in this order:
- JSON-LD product/article blocks
- semantic table/list/card DOM
- fallback text parsing
opendevbrowser_dom_get_text sessionId="<session-id>" ref="<json-ld-ref>"
opendevbrowser_dom_get_html sessionId="<session-id>" ref="<table-ref>"
Pagination Patterns
Numbered/Next pagination
opendevbrowser_click sessionId="<session-id>" ref="<next-ref>"
opendevbrowser_wait sessionId="<session-id>" until="networkidle"
opendevbrowser_snapshot sessionId="<session-id>" format="actionables"
Infinite scroll
opendevbrowser_scroll sessionId="<session-id>" dy=1000
opendevbrowser_wait sessionId="<session-id>" until="networkidle"
Load more
opendevbrowser_click sessionId="<session-id>" ref="<load-more-ref>"
opendevbrowser_wait sessionId="<session-id>" until="networkidle"
Quality Gates
Apply per page:
- dedupe by stable key (URL or canonical ID)
- null-rate check for required fields
- count delta check (new records must increase)
- consistency check for currency/units
- max consecutive challenge/429 loops before stop
Use assets/templates/quality-gates.json.
Compliance and Safety
- Respect robots and site terms.
- Use pacing; do not flood endpoints.
- Treat robots as policy guidance, not auth.
- Stop or back off on repeated 429/403 and challenges.
References