en un clic
crawler-sanctions
// Scaffold a new sanctions list crawler from a source URL or GitHub issue
// Scaffold a new sanctions list crawler from a source URL or GitHub issue
Scaffold a new PEP (Politically Exposed Persons) crawler from a source URL or GitHub issue
Investigate a failing crawler from an issues.json artifact URL and propose a fix. Covers fetching error details, inspecting source data via Zyte, and common failure patterns.
Migrate ad-hoc name cleaning in a crawler to h.review_names (Step 1 of the name framework migration). Use when a crawler.py contains delimiter splits, regex substitutions, bracket stripping, or conditional logic applied to name strings before the name is added or applied.
Fix mypy --strict type errors in crawler files. Use when the user asks to make the typechecker happy, fix types, or add type annotations to a crawler.
| name | crawler-sanctions |
| description | Scaffold a new sanctions list crawler from a source URL or GitHub issue |
| allowed-tools | Read, Edit, Write, Glob, Grep, Bash, WebFetch, WebSearch, Agent |
Create a new sanctions list crawler. The user will provide a target path, source data URL, and/or a GitHub issue URL: $ARGUMENTS
If given a GitHub issue URL, fetch it first to extract the data source URL and any context about the dataset before proceeding.
Before writing any code, read these files — they contain everything you need:
.claude/docs/crawler-guide.md — shared crawler patterns (YAML template, fetching data, entity creation, helpers, lookups, FTM schemata, qsv analysis).claude/skills/crawler-sanctions/examples.md — full sanctions code examplesDo NOT search the repository for similar crawlers or patterns. The guide and examples
above are the authoritative reference. Do not read datasets/CLAUDE.md or other crawler
source files for patterns — use only the files listed above.
Before writing any code, inspect the data source. In addition to the general checks (fields, date formats, language, record count), sanctions sources need:
Use the generic YAML template from the crawler guide. Sanctions-specific additions:
tags:
- list.sanction
- issuer.west # optional
assertions:
min:
schema_entities:
Person: 1000 # ~80% of expected count
Organization: 200
Sanction: 1000
country_entities:
cc: 100
max:
schema_entities:
Person: 5000 # ~150% of expected count
Organization: 1000
frequency: daily with a cron schedule:.The most important sanctions lookup maps source program names to OpenSanctions keys:
lookups:
# Entity type dispatch (when source uses custom type labels)
type.entity:
lowercase: true
options:
- match: [individual, person]
value: Person
- match: [entity, company, organization]
value: Organization
- match: [vessel, ship]
value: Vessel
# Map source program names to OpenSanctions program keys
sanction.program:
options:
- match: "Executive Order 13224"
value: US-EO13224
# Date edge cases common in sanctions data
type.date:
options:
- match: "1972-08-10 or 1972-08-11"
values: ["1972-08-10", "1972-08-11"]
- match: "1975-19-25" # typo
value: "1975"
type.* lookups are applied automatically by entity.add(). The sanction.program
lookup must be called explicitly via h.lookup_sanction_program_key().
Full reference: zavod/docs/programs.md
h.make_sanction() automatically sets country, authority, and sourceUrl from
dataset metadata. The key parameters:
sanction = h.make_sanction(
context,
entity, # the sanctioned entity (required)
key=entry_id, # disambiguator when entity has multiple sanctions
program_name=program, # human-readable program name
source_program_key=program, # raw value from source (preserved as original_value)
program_key=h.lookup_sanction_program_key( # OpenSanctions program key from yaml lookup
context, program
),
start_date=listing_date, # optional: when sanction began
end_date=end_date, # optional: when sanction ended
)
key: Use when an entity appears on multiple sanctions lists/programs. The sanction
ID is make_id("Sanction", entity.id, key), so key disambiguates multiple sanctions
per entity.program_key: Always go through h.lookup_sanction_program_key() which reads the
sanction.program yaml lookup. Add entries to the lookup as you encounter new program
names.source_program_key: The raw program string from the source, preserved as
original_value on the programId property for auditability.entity.add("topics", "sanction") on the sanctioned entity.For simple datasets with a single known program, you can skip the lookup:
sanction = h.make_sanction(context, entity, program_key="US-DOS-CU-PAL")
if h.is_active(sanction):
entity.add("topics", "sanction")
# Only mark as sanctioned if the sanction is currently active
Full reference: zavod/docs/extract/names.md
Sanctioned names are legal designations — do not use LLM-based name cleaning. Any normalisation must be human-reviewed via the stateful review system, or handled with explicit lookup entries.
See the crawler guide for the generic Family and Ownership patterns. See examples.md for UnknownLink (sanctions-specific untyped relationships).
When the source tracks modifications and de-listings, use sanction.add("endDate", ...)
for de-listings and sanction.add("modifiedAt", ...) for amendments. See
examples.md for the full pattern.
Full reference: zavod/docs/data_reviews.md
For sources with unstructured "remarks" fields, use GPT extraction with the stateful
review system. Requires ci_test: false. See examples.md for the pattern.
After running zavod crawl, use these sanctions-specific qsv checks (see the crawler
guide for general qsv patterns):
# Entity counts by schema
qsv search -s prop "^Person:id$" data/datasets/cc_dataset/statements.pack | qsv count
qsv search -s prop "^Organization:id$" data/datasets/cc_dataset/statements.pack | qsv count
qsv search -s prop "^Sanction:id$" data/datasets/cc_dataset/statements.pack | qsv count
# Sanction program distribution
qsv search -s prop "^Sanction:program$" data/datasets/cc_dataset/statements.pack | qsv frequency -s value
# Every Sanction:entity must point to a real entity
qsv search -s prop "^Sanction:entity$" data/datasets/cc_dataset/statements.pack | qsv select value | qsv behead | sort > /tmp/sanction_targets.txt && qsv search -s prop ":id$" data/datasets/cc_dataset/statements.pack | qsv select entity_id | qsv behead | sort -u > /tmp/all_entities.txt && comm -23 /tmp/sanction_targets.txt /tmp/all_entities.txt
# Check all entities have topics=sanction
qsv search -s prop ":id$" data/datasets/cc_dataset/statements.pack | qsv select entity_id | qsv behead | sort -u > /tmp/all_ids.txt && qsv search -s prop ":topics$" data/datasets/cc_dataset/statements.pack | qsv search -s value "^sanction$" | qsv select entity_id | qsv behead | sort -u > /tmp/sanctioned.txt && comm -23 /tmp/all_ids.txt /tmp/sanctioned.txt
Then run zavod validate datasets/cc/dataset/cc_dataset.yml.
See the crawler guide for Person, Organization, LegalEntity, Address, Family, Ownership, and other shared schemata.
name, flag, imoNumber, mmsi, callSign
type, tonnage, buildDate
alias, previousName, topics (sanction)
name, serialNumber, registrationNumber
model, type
alias, topics (sanction)
entity (required -- the sanctioned entity)
authority, authorityId
program, programId, programUrl
unscId (UN Security Council ID)
startDate, endDate, listingDate, modifiedAt
reason, provisions, status, country
sourceUrl, summary
holder, number, type, country, authority
startDate, endDate, summary
subject, object, role