mit einem Klick
typechecker-fixes
// Fix mypy --strict type errors in crawler files. Use when the user asks to make the typechecker happy, fix types, or add type annotations to a crawler.
// Fix mypy --strict type errors in crawler files. Use when the user asks to make the typechecker happy, fix types, or add type annotations to a crawler.
Scaffold a new PEP (Politically Exposed Persons) crawler from a source URL or GitHub issue
Investigate a failing crawler from an issues.json artifact URL and propose a fix. Covers fetching error details, inspecting source data via Zyte, and common failure patterns.
Migrate ad-hoc name cleaning in a crawler to h.review_names (Step 1 of the name framework migration). Use when a crawler.py contains delimiter splits, regex substitutions, bracket stripping, or conditional logic applied to name strings before the name is added or applied.
Scaffold a new sanctions list crawler from a source URL or GitHub issue
| name | typechecker-fixes |
| description | Fix mypy --strict type errors in crawler files. Use when the user asks to make the typechecker happy, fix types, or add type annotations to a crawler. |
| argument-hint | [crawler.py path] |
Fix mypy strict-mode errors in opensanctions crawler files. Run mypy --strict --explicit-package-bases on the target directory to find errors, then apply the patterns below.
zavod/zavod/helpers/html.py — typed HTML helpers (xpath_strings, xpath_elements, xpath_element, element_text)zavod/zavod/util.py — Element and ElementOrTree type aliasesmypy --strict <crawler.py> to see current errorsmypy --strict <crawler.py> again to verify errors resolved-> None return type to all functions that don't return a valueThis is the single most common fix. Every crawl(), crawl_row(), crawl_item(), parse_*(), and helper function that doesn't return needs -> None.
# Before
def crawl(context: Context):
def crawl_row(context: Context, row: dict):
def apply_identifier(context: Context, entity: Entity, id_number_line: str):
# After — pick the narrowest value type the row actually contains (see pattern #3)
def crawl(context: Context) -> None:
def crawl_row(context: Context, row: dict[str, str | None]) -> None:
def apply_identifier(context: Context, entity: Entity, id_number_line: str) -> None:
# Before
def crawl_item(input_html, context: Context):
def crawl_term(context, link: HtmlElement, ...):
# After
def crawl_item(input_html: Element, context: Context) -> None:
def crawl_term(context: Context, link: HtmlElement, ...) -> None:
dict with the narrowest value type the code actually usesPick the value type by looking at every assignment into the dict. Use Any only as a last resort. In order of preference:
dict[str, str] — all values are strings (e.g. h.xpath_strings(...)[0], string literals, city.get("key", "")).dict[str, str | None] — some values can legitimately be None (e.g. element.text, element.get("attr"), record.get(key) with no default). At consumer sites that need str, narrow with a local + assert value is not None or relax the consumer's signature to accept None.dict[str, Any] — only when values are genuinely heterogeneous (mixed types that can't be expressed as a simple union, e.g. str, int, nested list/dict). Prefer a TypedDict if the dict has a fixed schema.# Before
def crawl_item(input_dict: dict, context: Context):
json_data = { ... }
# After — values are all strings
def crawl_item(input_dict: dict[str, str], context: Context) -> None:
json_data: dict[str, str] = { ... }
# After — values include str | None from element.text
record: dict[str, str | None] = {}
record["name"] = row[0].text # str | None
record["url"] = urljoin(base, row[0].get("href")) # str
# After — truly heterogeneous (resort to Any)
from typing import Any
item: dict[str, Any] = {"name": "x", "count": 3, "tags": [...]}
When you tighten to dict[str, str | None], expect two or three follow-up errors at call sites that expect strict str (e.g. context.fetch_html). Handle each by either:
url = record["url"]; assert url is not None.None is a semantically valid input (e.g. is_valid(regno: str | None) returning False for None).Use lowercase dict, list, set, tuple — not the deprecated Dict, List, Set, Tuple from typing. While fixing types, also migrate any existing typing.Dict etc. to builtins.
.xpath(), .find() and .findall() calls with typed h.xpath_* helpersThe raw lxml .xpath() returns Any. Use the zavod helpers instead:
# Before — returns Any
links = doc.xpath(".//a/@href")
elements = doc.xpath('.//div[@class="item"]')
text = doc.xpath(".//h1/text()")[0]
# After — properly typed
links = h.xpath_strings(doc, ".//a/@href")
elements = h.xpath_elements(doc, './/div[@class="item"]')
text = h.xpath_string(doc, ".//h1/text()")
When iterating over elements only to extract an attribute (e.g. .get("href")), move the attribute into the xpath and use h.xpath_strings instead:
# Before
for anchor in doc.xpath('//a[contains(@class, "name")]'):
url = anchor.get("href")
crawl_page(context, url)
# Also before (already migrated to xpath_elements but still using .get)
for anchor in h.xpath_elements(doc, '//a[contains(@class, "name")]'):
url = anchor.get("href")
crawl_page(context, url)
# After
for url in h.xpath_strings(doc, '//a[contains(@class, "name")]/@href'):
crawl_page(context, url)
Use h.xpath_element() (singular) when you expect exactly one match:
# Before
divs = doc.xpath(divs_xpath)
assert len(divs) == 1
content = divs[0]
# After
content = h.xpath_element(doc, divs_xpath)
.text_content() with h.element_text()h.element_text() calls text_content() internally and applies collapse_spaces + strip. If the original code was calling squash_spaces or collapse_spaces on the result, that's now redundant and should be removed. If the extracted text is used for lookups (check the lookups: section in the crawler's .yml file) or exact comparisons, pass squash=False to preserve the original whitespace.
# Before
name_info = summary.text.strip()
body = body_els[0].text_content().strip()
category = squash_spaces(row.pop("category").text_content())
# After
name_info = h.element_text(summary)
body = h.element_text(body_els[0])
category = h.element_text(row.pop("category"))
# When exact text matters (used in lookups or comparisons):
label = h.element_text(el, squash=False)
from zavod.util import Element, ElementOrTree for lxml type annotationsDon't use lxml.etree._Element (private API) or xml.etree.ElementTree. Use the re-exported types:
# Before
from lxml import etree
def parse_record(context: Context, el: etree._Element):
# After
from zavod.util import Element
def parse_record(context: Context, el: Element) -> None:
Iterator instead of Generator when only yielding# Before
from typing import Generator
def parse_csv(context: Context, path: str) -> Generator[Item, None, None]:
# After
from typing import Iterator
def parse_csv(context: Context, path: str) -> Iterator[Item]:
# Before
def clean_address(text):
def extract_passport_no(text):
# After
def clean_address(text: str | None) -> list[str] | None:
def extract_passport_no(text: str | None) -> list[str] | None:
When a function has many parameters, add * to force keyword arguments — this catches argument-order bugs at the call site:
# Before
def emit_linked_org(context, vessel_id, names, role, date):
...
emit_linked_org(context, vessel.id, related_ros, "Related Recognised Organization", start_date)
# After
def emit_linked_org(context: Context, *, vessel_id: str | None, names: str, role: str, date: str | None) -> None:
...
emit_linked_org(context, vessel_id=vessel.id, names=related_ros, role="Related Recognised Organization", date=start_date)
If the function returns a tuple, add a docstring comment to briefly describe the contents of the tuple. This is not a typechecker fix, but it helps readability since tuples don't have named fields.
dict[str, str] > dict[str, str | None] > dict[str, Any]. Only fall back to Any when the values are genuinely heterogeneous.str | None union syntax, not Optional[str], and lowercase dict/list/set not Dict/List/Set — but only when you're already editing the line for another reason. Do not make cosmetic-only changes to lines that have no type errors.context: Context parameter should always be first in crawler functions.