com um clique
debug-crawler
// Investigate a failing crawler from an issues.json artifact URL and propose a fix. Covers fetching error details, inspecting source data via Zyte, and common failure patterns.
// Investigate a failing crawler from an issues.json artifact URL and propose a fix. Covers fetching error details, inspecting source data via Zyte, and common failure patterns.
Scaffold a new PEP (Politically Exposed Persons) crawler from a source URL or GitHub issue
Migrate ad-hoc name cleaning in a crawler to h.review_names (Step 1 of the name framework migration). Use when a crawler.py contains delimiter splits, regex substitutions, bracket stripping, or conditional logic applied to name strings before the name is added or applied.
Scaffold a new sanctions list crawler from a source URL or GitHub issue
Fix mypy --strict type errors in crawler files. Use when the user asks to make the typechecker happy, fix types, or add type annotations to a crawler.
| name | debug-crawler |
| description | Investigate a failing crawler from an issues.json artifact URL and propose a fix. Covers fetching error details, inspecting source data via Zyte, and common failure patterns. |
| argument-hint | <issues.json URL> |
| allowed-tools | Read, Edit, Glob, Grep, Bash, WebFetch |
The user has provided an issues.json artifact URL: $ARGUMENTS
Read zavod/docs as needed to understand how crawlers are normally written — the goal
here is to fix the failing crawler in accordance with existing practices, not to
refactor or standardise it.
Fetch the issues.json URL to understand the error:
WebFetch <issues.json URL>
prompt: "Show all issues, especially errors and warnings. Include full message text and any data fields."
Note the:
us_ne_med_exclusions)# Glob datasets/**/<dataset_name>.yml
Read the crawler's .yml and crawler.py.
The source has likely changed. Use OPENSANCTIONS_ZYTE_API_KEY (already set in the
environment) to fetch via Zyte when direct access times out or is blocked:
python3 -c "
import requests, os
from base64 import b64decode
ZYTE_API_KEY = os.environ['OPENSANCTIONS_ZYTE_API_KEY']
url = '<data_url from .yml>'
resp = requests.post(
'https://api.zyte.com/v1/extract',
auth=(ZYTE_API_KEY, ''),
json={'url': url, 'httpResponseBody': True, 'httpResponseHeaders': True},
timeout=60
)
resp.raise_for_status()
content = b64decode(resp.json()['httpResponseBody'])
# then parse content as appropriate for the source format
"
Add 'geolocation': 'US' (or the relevant country code) to the Zyte request when
the source geo-restricts access — and add the matching geolocation= argument to
the fetch_resource / fetch_html call in the crawler.
Compare what the source actually contains against what the crawler expects.
| Symptom | Cause | Fix |
|---|---|---|
| Expected field/column not found | Source renamed or restructured columns | Update the crawler to match the new structure |
| First page parses fine, later pages fail | Per-page header handling no longer matches source | Adjust header-reading logic to match current source |
| 403 / empty response from Zyte | Source geo-restricts content | Add geolocation= to the fetch call |
| Assertion on entity count fails | Source grew or shrank | Verify the count is real, then update assertions: bounds |
Unexpected keys in audit_data | New columns added to source | Pop and handle (or explicitly ignore) the new fields |
After making code changes, delete the cached source file so the fresh copy is fetched:
rm -f data/datasets/<dataset_name>/source.*
zavod crawl datasets/<path>/<dataset_name>.yml
Check data/datasets/<dataset_name>/issues.log for remaining warnings. Then export
and confirm the delta is plausible:
zavod export --rebuild-store datasets/<path>/<dataset_name>.yml
A healthy run shows:
assertions: bounds in the .yml