| name | name-framework-migration-first-step |
| description | Migrate ad-hoc name cleaning in a crawler to h.review_names (Step 1 of the name framework migration). Use when a crawler.py contains delimiter splits, regex substitutions, bracket stripping, or conditional logic applied to name strings before the name is added or applied. |
| argument-hint | [crawler.py path] |
| disable-model-invocation | true |
Perform Step 1 of the name framework migration in $ARGUMENTS: introduce h.review_names alongside the existing cleaning logic. Existing entity.add / h.apply_name calls remain in place and continue to drive output; reviews are not applied until Step 3 of the procedure.
Branch setup
Before making any changes:
- Derive a branch name from the crawler path by taking the dataset name (the directory containing
crawler.py) and prefixing it with name-migration/. For example, datasets/us/ga/med_exclusions/crawler.py → name-migration/us-ga-med-exclusions.
- Create and check out the branch:
git checkout -b <branch-name>
- Confirm you are on the new branch before proceeding.
Crawler source
!cat $ARGUMENTS
Read first (in order)
- examples/migrations.md — real before/after migrations; read before touching any crawler
zavod/docs/extract/names.md#migrating-to-the-name-cleaning-helpers — full three-step procedure and rationale
zavod/zavod/helpers/names.py — exact signatures for review_names, check_names_regularity, Names
datasets/CLAUDE.md — name cleaning section
Trigger patterns to find
Scan the crawler for any of these before acting:
last_name, name = name_raw.split(",", 1)
name, *aliases = h.multi_split(raw, SPLITS)
name = name.replace("(Acting)", "")
name = name.strip("„")
parts = re.split(r"(?i)\baka\b", name, maxsplit=1)
names = h.multi_split(name_raw, ["(w zapisie także", "(", ")"])
if len(name_split) > 1:
entity.add("alias", name_split[1:])
Migration steps
- Capture the raw name string before any cleaning:
original = h.Names(name=<raw>).
- Initialise
suggested = h.Names().
- For each existing
entity.add(name_prop, value) or h.apply_name(...) call, add a mirroring entry to suggested — see examples/migrations.md for the exact patterns.
- After all name-setting calls, add:
is_irregular, suggested = h.check_names_regularity(entity, suggested)
h.review_names(context, entity, original=original, suggested=suggested, is_irregular=is_irregular)
- For non-sanctions crawlers, pass
llm_cleaning=True and omit suggested and is_irregular:
h.review_names(context, entity, original=original, llm_cleaning=True)
After changes
After every edit to the crawler file, run:
uvx ruff check --fix $ARGUMENTS && uvx ruff format $ARGUMENTS
Fix any errors ruff reports before proceeding.
Once all changes are complete and ruff passes, stage the file:
git add $ARGUMENTS
Then output the suggested commit message (do not commit):
[<dataset_slug>] name migration
where <dataset_slug> is derived from the path by stripping datasets/ and /crawler.py and replacing / with _ (e.g. datasets/us/ga/med_exclusions/crawler.py → [us_ga_med_exclusions] name migration).
Do not
- Do not remove or modify any existing
entity.add / h.apply_name calls
- Do not pass a cleaned or intermediate string as
original — always use the unmodified raw source string
- Do not use
llm_cleaning=True for sanctions crawlers
- Do not proceed to Step 3 of the three-step migration procedure (switching to
apply_reviewed_names) — that requires completed reviews first
- Do not construct
Names by guessing field names — read zavod/zavod/helpers/names.py first
- Do not call
h.review_names more than once per entity
- Do not add explanatory comments beyond what the code requires