بنقرة واحدة
crawler-pep
// Scaffold a new PEP (Politically Exposed Persons) crawler from a source URL or GitHub issue
// Scaffold a new PEP (Politically Exposed Persons) crawler from a source URL or GitHub issue
Investigate a failing crawler from an issues.json artifact URL and propose a fix. Covers fetching error details, inspecting source data via Zyte, and common failure patterns.
Migrate ad-hoc name cleaning in a crawler to h.review_names (Step 1 of the name framework migration). Use when a crawler.py contains delimiter splits, regex substitutions, bracket stripping, or conditional logic applied to name strings before the name is added or applied.
Scaffold a new sanctions list crawler from a source URL or GitHub issue
Fix mypy --strict type errors in crawler files. Use when the user asks to make the typechecker happy, fix types, or add type annotations to a crawler.
| name | crawler-pep |
| description | Scaffold a new PEP (Politically Exposed Persons) crawler from a source URL or GitHub issue |
Create a new PEP crawler. The user will provide a target path, source data URL, and/or a GitHub issue URL: $ARGUMENTS
If given a GitHub issue URL, fetch it first to extract the data source URL and any context about the dataset.
Read upfront:
.claude/docs/crawler-guide.md — shared crawler patterns (YAML, fetching, entities, helpers, lookups)Consult on demand (open only when you actually need the section — don't pre-load):
.claude/skills/crawler-pep/examples.md — full code examples (Patterns A/B/C, subnational variant, freshness check, qsv recipes). Open when you're stuck on a pattern or want a worked example.zavod/docs/peps.md — depth on Position naming, categorise(), Occupancy duration rules. Open if you need more than the summary in this skill.zavod/docs/metadata.md — full YAML field reference. Open if you're using a field not covered by the template in crawler-guide.md.zavod/docs/extract/names.md — open only if you're doing LLM-assisted or reviewed name cleaning.Prefer section reads over full reads. All of these docs are well-headered — use Grep to find the symbol/topic you need (make_occupancy, apply_date, coverage.start, etc.) and Read with offset/limit instead of reading the whole file.
Do NOT search the repo for similar crawlers. Use only the files listed above.
In addition to the general checks (fields, date formats, language, record count):
zavod/docs/peps.md; skip QIDs for per-municipality / per-region positions.)examples.md → "New-election freshness check".Agent with WebSearch/WebFetch) to find the legal document (electoral law, constitution, official government guidance) that stipulates the citizenship requirement for this specific position. In a code comment next to the person.add("citizenship", ...) call — or, if citizenship is not required, next to the omission — include the URL to that legal document.Full field reference: zavod/docs/metadata.md. PEP-specific additions:
tags:
- list.pep
assertions:
min:
schema_entities:
Person: 100 # ~80% of expected count
Position: 1
country_entities:
cc: 50
max:
schema_entities:
Person: 1000 # ~150% of expected count
Position counts in assertions when the crawler creates multiple position types.frequency matches source update cadence (daily/weekly/monthly). PEP crawlers do not have to be monthly.position lookup to translate non-English role labels into standard English names (see examples.md). Beyond that, lookups rarely go past type.*.from zavod import Context
from zavod import helpers as h
from zavod.entity import Entity
from zavod.stateful.positions import PositionCategorisation, categorise
Build position names with h.make_position. Rules:
Landtag of Mecklenburg-Vorpommern). When the source labels roles in another language, declare a position lookup in the YAML to translate them before passing to h.make_position.citizenship (except UK Parliament).wikidata_id becomes the position's entity ID, so never pass the same QID to multiple distinct positions — they'd collapse into one entity. Per-municipality/region positions usually omit wikidata_id (per-locality QIDs rarely exist on Wikidata) and rely on subnational_area=... to disambiguate; pass a QID only when each subnational position has its own unique Wikidata entry.Depth on edge cases: zavod/docs/peps.md → "Selecting a position name".
Full reference: zavod/docs/peps.md. categorise() is a stateful DB operation; is_pep/topics only matter on first insertion — subsequent crawls return DB values (including UI edits).
Three is_pep calling patterns:
is_pep arg | When to use | Example datasets |
|---|---|---|
True | Source definitionally contains PEPs (parliament, cabinet, judges) | fr_assemblee, ie_parliament, ky_judicial |
None | Mixed dataset, or per-locality positions where UI decides PEP status | fr_hatvp_declarations, lu_bourgmestres |
False | Explicitly not PEP (rare in PEP crawlers) | — |
Pass the returned categorisation to make_occupancy().
zavod/docs/peps.md)make_occupancy() — it reads them to determine PEP status.make_occupancy() returns None if the occupancy doesn't meet PEP criteria. Only emit persons with at least one valid occupancy.make_occupancy — it mutates person.topics.person.add("topics", "role.judge").crawl() that fails when the source signature changes (dataset count, page URL, etc.).no_end_implies_currentTrue (default): no end date → still in office. Use for live official rosters.False: no end date → unknown. Use for declarations, point-in-time snapshots, historical data.LLM-assisted (h.clean_names()) and reviewed-name (h.apply_reviewed_names()) helpers are both acceptable for PEP data — full reference: zavod/docs/extract/names.md. (Unlike sanctions, where LLM cleaning is forbidden.)
Run zavod crawl <path> then zavod validate <path>. PEP-specific qsv recipes (referential integrity for Person↔Occupancy, country propagation check) are in examples.md → "qsv validation checks".