تشغيل أي مهارة في Manus بنقرة واحدة

ابدأ الآن

$pwd:

crawler-pep

Name: Crawler Pep
Author: opensanctions

// Scaffold a new PEP (Politically Exposed Persons) crawler from a source URL or GitHub issue

تشغيل في Manus

$ git log --oneline --stat

stars:٧٤١

forks:١٦٢

updated:٢٩ مايو ٢٠٢٦ في ١١:٠٤

مستكشف الملفات

2 ملفات

SKILL.md

readonly

name	crawler-pep
description	Scaffold a new PEP (Politically Exposed Persons) crawler from a source URL or GitHub issue

New PEP Crawler

Create a new PEP crawler. The user will provide a target path, source data URL, and/or a GitHub issue URL: $ARGUMENTS

If given a GitHub issue URL, fetch it first to extract the data source URL and any context about the dataset.

Read upfront:

.claude/docs/crawler-guide.md — shared crawler patterns (YAML, fetching, entities, helpers, lookups)

Consult on demand (open only when you actually need the section — don't pre-load):

.claude/skills/crawler-pep/examples.md — full code examples (Patterns A/B/C, subnational variant, freshness check, qsv recipes). Open when you're stuck on a pattern or want a worked example.
zavod/docs/peps.md — depth on Position naming, categorise(), Occupancy duration rules. Open if you need more than the summary in this skill.
zavod/docs/metadata.md — full YAML field reference. Open if you're using a field not covered by the template in crawler-guide.md.
zavod/docs/extract/names.md — open only if you're doing LLM-assisted or reviewed name cleaning.

Prefer section reads over full reads. All of these docs are well-headered — use Grep to find the symbol/topic you need (make_occupancy, apply_date, coverage.start, etc.) and Read with offset/limit instead of reading the whole file.

Do NOT search the repo for similar crawlers. Use only the files listed above.

Step 1: Understand the source

In addition to the general checks (fields, date formats, language, record count):

Is there a Wikidata ID for the position(s)? (See zavod/docs/peps.md; skip QIDs for per-municipality / per-region positions.)
What are the position types (parliament, cabinet, judiciary, etc.)?
Current members only, or historical terms too?
Are start/end dates provided?
Term-bounded data? Identify a freshness signal (dataset count, page URL, file name) so the crawler fails loudly when a new term lands. See examples.md → "New-election freshness check".
Does the position legally require citizenship? Don't assume from position type — national parliaments usually do (UK is an exception), but sub-national elected positions (mayors, councils) often don't. Spawn a subagent (Agent with WebSearch/WebFetch) to find the legal document (electoral law, constitution, official government guidance) that stipulates the citizenship requirement for this specific position. In a code comment next to the person.add("citizenship", ...) call — or, if citizenship is not required, next to the omission — include the URL to that legal document.

Step 2: YAML metadata — PEP-specific parts

Full field reference: zavod/docs/metadata.md. PEP-specific additions:

tags:
  - list.pep

assertions:
  min:
    schema_entities:
      Person: 100        # ~80% of expected count
      Position: 1
    country_entities:
      cc: 50
  max:
    schema_entities:
      Person: 1000       # ~150% of expected count

Include Position counts in assertions when the crawler creates multiple position types.
frequency matches source update cadence (daily/weekly/monthly). PEP crawlers do not have to be monthly.
PEP crawlers may need a position lookup to translate non-English role labels into standard English names (see examples.md). Beyond that, lookups rarely go past type.*.

Step 3: Write the crawler module

Required imports

from zavod import Context
from zavod import helpers as h
from zavod.entity import Entity
from zavod.stateful.positions import PositionCategorisation, categorise

Position naming

Build position names with h.make_position. Rules:

Name positions in English. Use the standard English term for the role; keep native-language terminology only for proper nouns of specific institutions (e.g. Landtag of Mecklenburg-Vorpommern). When the source labels roles in another language, declare a position lookup in the YAML to translate them before passing to h.make_position.
Include the role, the organisational body where relevant, and the geographic jurisdiction. For members of national parliaments, include citizenship (except UK Parliament).
Avoid: legislative term, an elected official's constituency, or the country for sub-national representatives.
wikidata_id becomes the position's entity ID, so never pass the same QID to multiple distinct positions — they'd collapse into one entity. Per-municipality/region positions usually omit wikidata_id (per-locality QIDs rarely exist on Wikidata) and rely on subnational_area=... to disambiguate; pass a QID only when each subnational position has its own unique Wikidata entry.

Depth on edge cases: zavod/docs/peps.md → "Selecting a position name".

Position categorisation

Full reference: zavod/docs/peps.md. categorise() is a stateful DB operation; is_pep/topics only matter on first insertion — subsequent crawls return DB values (including UI edits).

Three is_pep calling patterns:

`is_pep` arg	When to use	Example datasets
`True`	Source definitionally contains PEPs (parliament, cabinet, judges)	fr_assemblee, ie_parliament, ky_judicial
`None`	Mixed dataset, or per-locality positions where UI decides PEP status	fr_hatvp_declarations, lu_bourgmestres
`False`	Explicitly not PEP (rare in PEP crawlers)	—

Pass the returned categorisation to make_occupancy().

Critical rules (in addition to `zavod/docs/peps.md`)

Set ALL person props (birthDate, deathDate, etc.) BEFORE calling make_occupancy() — it reads them to determine PEP status.
make_occupancy() returns None if the occupancy doesn't meet PEP criteria. Only emit persons with at least one valid occupancy.
Emit the person AFTER make_occupancy — it mutates person.topics.
For judicial crawlers, also person.add("topics", "role.judge").
Term-bounded sources (election cycles, fixed mandates): add a freshness check in crawl() that fails when the source signature changes (dataset count, page URL, etc.).

`no_end_implies_current`

True (default): no end date → still in office. Use for live official rosters.
False: no end date → unknown. Use for declarations, point-in-time snapshots, historical data.

Name cleaning

LLM-assisted (h.clean_names()) and reviewed-name (h.apply_reviewed_names()) helpers are both acceptable for PEP data — full reference: zavod/docs/extract/names.md. (Unlike sanctions, where LLM cleaning is forbidden.)

Step 4: Validate

Run zavod crawl <path> then zavod validate <path>. PEP-specific qsv recipes (referential integrity for Person↔Occupancy, country propagation check) are in examples.md → "qsv validation checks".

related-skills.json

نفس المستودع

debug-crawler.md

from "opensanctions/opensanctions"

Investigate a failing crawler from an issues.json artifact URL and propose a fix. Covers fetching error details, inspecting source data via Zyte, and common failure patterns.

2026-05-19741

name-framework-migration-first-step.md

from "opensanctions/opensanctions"

Migrate ad-hoc name cleaning in a crawler to h.review_names (Step 1 of the name framework migration). Use when a crawler.py contains delimiter splits, regex substitutions, bracket stripping, or conditional logic applied to name strings before the name is added or applied.

2026-05-06741

crawler-sanctions.md

from "opensanctions/opensanctions"

Scaffold a new sanctions list crawler from a source URL or GitHub issue

2026-04-30741

typechecker-fixes.md

from "opensanctions/opensanctions"

Fix mypy --strict type errors in crawler files. Use when the user asks to make the typechecker happy, fix types, or add type annotations to a crawler.

2026-04-15741

package.json

"author": "opensanctions"

"repository": "opensanctions/opensanctions"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مطوّرو البرمجياتمهن الحاسوب والرياضيات15-1252L4

name	crawler-pep
description	Scaffold a new PEP (Politically Exposed Persons) crawler from a source URL or GitHub issue

New PEP Crawler

Create a new PEP crawler. The user will provide a target path, source data URL, and/or a GitHub issue URL: $ARGUMENTS

If given a GitHub issue URL, fetch it first to extract the data source URL and any context about the dataset.

Read upfront:

.claude/docs/crawler-guide.md — shared crawler patterns (YAML, fetching, entities, helpers, lookups)

Consult on demand (open only when you actually need the section — don't pre-load):

.claude/skills/crawler-pep/examples.md — full code examples (Patterns A/B/C, subnational variant, freshness check, qsv recipes). Open when you're stuck on a pattern or want a worked example.
zavod/docs/peps.md — depth on Position naming, categorise(), Occupancy duration rules. Open if you need more than the summary in this skill.
zavod/docs/metadata.md — full YAML field reference. Open if you're using a field not covered by the template in crawler-guide.md.
zavod/docs/extract/names.md — open only if you're doing LLM-assisted or reviewed name cleaning.

Do NOT search the repo for similar crawlers. Use only the files listed above.

Step 1: Understand the source

In addition to the general checks (fields, date formats, language, record count):

Is there a Wikidata ID for the position(s)? (See zavod/docs/peps.md; skip QIDs for per-municipality / per-region positions.)
What are the position types (parliament, cabinet, judiciary, etc.)?
Current members only, or historical terms too?
Are start/end dates provided?
Term-bounded data? Identify a freshness signal (dataset count, page URL, file name) so the crawler fails loudly when a new term lands. See examples.md → "New-election freshness check".
Does the position legally require citizenship? Don't assume from position type — national parliaments usually do (UK is an exception), but sub-national elected positions (mayors, councils) often don't. Spawn a subagent (Agent with WebSearch/WebFetch) to find the legal document (electoral law, constitution, official government guidance) that stipulates the citizenship requirement for this specific position. In a code comment next to the person.add("citizenship", ...) call — or, if citizenship is not required, next to the omission — include the URL to that legal document.

Step 2: YAML metadata — PEP-specific parts

Full field reference: zavod/docs/metadata.md. PEP-specific additions:

tags:
  - list.pep

assertions:
  min:
    schema_entities:
      Person: 100        # ~80% of expected count
      Position: 1
    country_entities:
      cc: 50
  max:
    schema_entities:
      Person: 1000       # ~150% of expected count

Include Position counts in assertions when the crawler creates multiple position types.
frequency matches source update cadence (daily/weekly/monthly). PEP crawlers do not have to be monthly.
PEP crawlers may need a position lookup to translate non-English role labels into standard English names (see examples.md). Beyond that, lookups rarely go past type.*.

Step 3: Write the crawler module

Required imports

from zavod import Context
from zavod import helpers as h
from zavod.entity import Entity
from zavod.stateful.positions import PositionCategorisation, categorise

Position naming

Build position names with h.make_position. Rules:

Name positions in English. Use the standard English term for the role; keep native-language terminology only for proper nouns of specific institutions (e.g. Landtag of Mecklenburg-Vorpommern). When the source labels roles in another language, declare a position lookup in the YAML to translate them before passing to h.make_position.
Include the role, the organisational body where relevant, and the geographic jurisdiction. For members of national parliaments, include citizenship (except UK Parliament).
Avoid: legislative term, an elected official's constituency, or the country for sub-national representatives.
wikidata_id becomes the position's entity ID, so never pass the same QID to multiple distinct positions — they'd collapse into one entity. Per-municipality/region positions usually omit wikidata_id (per-locality QIDs rarely exist on Wikidata) and rely on subnational_area=... to disambiguate; pass a QID only when each subnational position has its own unique Wikidata entry.

Depth on edge cases: zavod/docs/peps.md → "Selecting a position name".

Position categorisation

Full reference: zavod/docs/peps.md. categorise() is a stateful DB operation; is_pep/topics only matter on first insertion — subsequent crawls return DB values (including UI edits).

Three is_pep calling patterns:

`is_pep` arg	When to use	Example datasets
`True`	Source definitionally contains PEPs (parliament, cabinet, judges)	fr_assemblee, ie_parliament, ky_judicial
`None`	Mixed dataset, or per-locality positions where UI decides PEP status	fr_hatvp_declarations, lu_bourgmestres
`False`	Explicitly not PEP (rare in PEP crawlers)	—

Pass the returned categorisation to make_occupancy().

Critical rules (in addition to `zavod/docs/peps.md`)

Set ALL person props (birthDate, deathDate, etc.) BEFORE calling make_occupancy() — it reads them to determine PEP status.
make_occupancy() returns None if the occupancy doesn't meet PEP criteria. Only emit persons with at least one valid occupancy.
Emit the person AFTER make_occupancy — it mutates person.topics.
For judicial crawlers, also person.add("topics", "role.judge").
Term-bounded sources (election cycles, fixed mandates): add a freshness check in crawl() that fails when the source signature changes (dataset count, page URL, etc.).

`no_end_implies_current`

True (default): no end date → still in office. Use for live official rosters.
False: no end date → unknown. Use for declarations, point-in-time snapshots, historical data.

crawler-pep

New PEP Crawler

Step 1: Understand the source

Step 2: YAML metadata — PEP-specific parts

Step 3: Write the crawler module

Required imports

Position naming

Position categorisation

Critical rules (in addition to zavod/docs/peps.md)

no_end_implies_current

Name cleaning

Step 4: Validate

المزيد من هذا المستودع

المزيد من هذا المستودع

New PEP Crawler

Step 1: Understand the source

Step 2: YAML metadata — PEP-specific parts

Step 3: Write the crawler module

Required imports

Position naming

Position categorisation

Critical rules (in addition to zavod/docs/peps.md)

no_end_implies_current

Name cleaning

Step 4: Validate

Critical rules (in addition to `zavod/docs/peps.md`)

`no_end_implies_current`

Critical rules (in addition to `zavod/docs/peps.md`)

`no_end_implies_current`