mit einem Klick
update-dataset
// End-to-end dataset update workflow with PR creation, snapshot, meadow, garden, and grapher steps. Use when user wants to update a dataset, refresh data, run ETL update, or mentions updating dataset versions.
// End-to-end dataset update workflow with PR creation, snapshot, meadow, garden, and grapher steps. Use when user wants to update a dataset, refresh data, run ETL update, or mentions updating dataset versions.
Create a vibe app — a self-contained HTML page (data report, mini dashboard, prototype, …) — and publish it to the owid/vibe-webapps repo (served at vibe.owid.io). The scaffolder is geared toward data reports (text + embedded charts) but the same shape works for any internal webapp. Use when the user wants to turn an exploration, finding, analysis, or small interactive idea into a shareable internal page. (Renamed from `/data-report`.)
Check grapher chart metadata (titles, subtitles, descriptions, display names) against OWID's Writing and Style Guide. Use when the user mentions the style guide, writing guide, chart copy quality, title/subtitle review, or after editing .meta.yml files under etl/steps/data/grapher/.
Use when creating or enriching metadata for OWID ETL datasets - generates comprehensive YAML metadata from dataset inspection, data exploration, and web research following OWID metadata standards. Trigger when writing or editing *.meta.yml files, when a garden step has empty or minimal metadata, or when user asks to improve/add/enrich metadata.
Suggest a redirect mapping from a (soon-to-sunset) OWID explorer's views to the views of one or more replacement MDIMs. Pulls explorer views and MDIM views from the grapher DB, writes a CSV per source/target plus a wide joint proposal that routes each explorer view to a target MDIM view, and flags when several explorer views land on the same MDIM view. Trigger when the user says "map explorer <slug> to mdim(s) <...>", "suggest explorer->MDIM redirects", "we're sunsetting the <slug> explorer, map its views to the new multidims", or similar.
Draft answers for OWID's data-updates-comms Slack template using snapshot DVC + garden metadata + staging DB queries. Use when the user wants to announce a dataset update, fill the "Message about new data update" form, or generate the FAQ-style Slack post after an ETL update. Mechanical fields (producer, dates, coverage, chart count, search URL) are filled directly; editorial fields (why it matters, caveats, what's interesting about this update) get prompts seeded with extracted context for the user to refine.
Draft a short update message to a topic-owner reviewer after landing substantial dataset or chart changes on staging. The message lists the indicator changes with staging admin links, surfaces open design questions with option tables, and closes with a Chart Diff sign-off CTA. The output is markdown so the user can paste it into either Slack or a GitHub PR comment. Use after a dataset redesign or restructure when the iteration with the reviewer is back-and-forth and you need them to verify on staging before merge. Not for the canonical comms announcement — for that, see `data-updates-comms`.
| name | update-dataset |
| description | End-to-end dataset update workflow with PR creation, snapshot, meadow, garden, and grapher steps. Use when user wants to update a dataset, refresh data, run ETL update, or mentions updating dataset versions. |
| metadata | {"internal":true} |
Use this skill to run a complete dataset update with Claude Code subagents, keep a live progress checklist, and pause for user approval only when something needs attention.
<namespace>/<old_version>/<name><new_version> as today's date by running date -u +"%Y-%m-%d"Optional trailing args:
Assumptions:
workbench/<short_name>/.workbench/<short_name>/progress.md and update it after each step.workbench/<short_name>/update-context.yml as they are discovered. This is the canonical context artifact for the PR description, review handoff, and data-updates-comms.workbench/<short_name> unless continuing existing updateetl-update subagent (help → dry run → approval → real run)# NOTE: / # TODO: comments carried over from the old step files into notes_to_check.mdsanity_checks functions and their log-control flags; append to notes_to_check.mdsanity_checks output (enable log flag, re-run, scan log, revert flag) — skip if none found.countries.json against canonical regions, audit .excluded_countries.json, scan garden log for missing/unused/unknown warnings# NOTE: / # TODO: against fresh data; delete resolved workarounds + comments together, or record status in PR bodydataset.update_period_days, and that all URLs resolve (HEAD-check)update-context.yml with published chart count and 1–3 chart views for the public announcementdata-updates-comms, add to PR description, post @codex review as a separate PR comment, and notify user to post it to #data-updates-commsdag/archive/ AND relocate the new entries into the old slot (see "DAG archiving & reordering") — don't forget this stepPersistence:
workbench/<short_name>/progress.md with the current checklist state and a timestamp.Default: keep going. Run through the full workflow without stopping unless one of the conditions below is met.
Stop and ask the user when:
Don't stop for:
When you do stop, present a concise summary of the issue and what options exist.
Some updates carry structural changes that make the standard rename-only flow the wrong tool. Recognise them up front and adjust the workflow.
Triggers — any of these means you're in restructure territory, not a version bump:
short_name changes (producer rebranded the dataset).Workflow adjustments:
etl update. The rename-only flow copies the old step files into a new folder — useless when the schema is different. Author the new step chain by hand, using the old version as inspiration but not as a starting copy.title — so if v2 titles are descriptive (full sentences rather than the bare short_name), you can hand the user a table of v1 title → v2 title pairs and they can drive the chart remap from there. Generate this table from the v1 meta.yml + the v2 grapher catalog./latest announcements until charts have been remapped. Both posts depend on charts.published_count and charts.selected_views from the v2 chain. Drafting them before the remap gives the wrong count (zero) and no representative views. Tell the user to ping you when the chart remap is done, then run steps 8 / 9 / 9b.For the long-format with dimensions sub-case specifically (e.g. one row per (country, year, <dim1>, <dim2>)), use the modern OWID pattern:
tb.format(["country", "year", <dim1>, <dim2>, ...], sort_columns=True).paths.regions.add_aggregates(tb, index_columns=[...full key...], regions=REGIONS, aggregations={...}).<% if <dim> == "X" and <dim2> == "Y" %>...<% endif %> Jinja blocks inside title, description_short, display.name. Grep this repo for tb.format(["country", "year" with more than two index entries to find current reference examples.Initial setup
workbench/<short_name>/progress.md exists to determine if continuing existing updateworkbench/<short_name> directory if it existsworkbench/<short_name> directory for artifactsRun ETL update command (etl-update subagent)
<namespace>/<old_version>/<short_name> plus any required flagsetl update ONCE for the full step URI (e.g., data://garden/namespace/old_version/short_name). Do NOT run it separately per channel (snapshot, meadow, garden, grapher). Running it once ensures all cross-step DAG dependencies are updated together. Running it per-channel leaves stale version references in dag/main.yml (e.g., garden pointing to old meadow version).dag/main.yml: grep for the old version and confirm all internal references between the new steps point to the new version (e.g., garden depends on new meadow, not old meadow).1b) Check for outdated practices (check-outdated-practices skill)
etl update creates new step files, run the /check-outdated-practices skill on the newly created filesif __name__ == "__main__", geo.harmonize_countries(), dest_dir, paths.load_dependency(), etc. that were copied from old versions1c) Catalog # NOTE: / # TODO: comments in the copied step files (don't resolve yet)
rg -n "#\s*(NOTE|TODO|FIXME|HACK|XXX):" snapshots/<namespace>/<new_version>/ etl/steps/data/{meadow,garden,grapher}/<namespace>/<new_version>/.# NOTE: To learn more about the fields, hover over their names. at the top of .meta.yml).workbench/<short_name>/notes_to_check.md — one entry per annotation, recording file path, line number, which step it lives in (meadow/garden/grapher), and what the workaround does.1d) Detect sanity-check logic in the copied step files Sanity checks live in two different forms — detect both:
def sanity_check… / sanity_check…( call sites. Often gated by a module-level boolean flag (DEBUG, SHOW_SANITY_CHECK_LOGS, LONG_FORMAT) that defaults to False to keep normal runs quiet. Examples: etl/steps/data/garden/wb/.../world_bank_pip.py (SHOW_SANITY_CHECK_LOGS), etl/steps/data/garden/wid/.../world_inequality_database.py (DEBUG + LONG_FORMAT), etl/steps/data/garden/lis/.../luxembourg_income_study.py (no flag; prints unconditionally via tabulate).# Sanity check / # Sanity checks / # sanity check marking an inline assertion block that isn't wrapped in a dedicated function. Very common: etl/steps/data/garden/emdat/.../natural_disasters.py, etl/steps/data/garden/emissions/.../national_contributions.py, etl/steps/data/garden/irena/.../renewable_capacity_statistics.py. These usually have no log flag — the block simply runs on every step execution and either passes or raises.Run a combined sweep:
rg -n -i "def sanity_check|sanity_check\(|#\s*sanity check" \
snapshots/<namespace>/<new_version>/ \
etl/steps/data/{meadow,garden,grapher}/<namespace>/<new_version>/
Append a "Sanity checks" section to workbench/<short_name>/notes_to_check.md listing each hit — for each, record: file path + line number, which form (function vs. inline comment), the name of any log-control flag (function form only), and a one-line description of what's being asserted (read the surrounding 5–10 lines).
Don't act yet — the review happens in step 5b once the garden step has been run on the new data.
Create PR and integrate update via subagent (etl-pr)
<namespace>/<old_version>/<short_name>Snapshot run & compare (snapshot-runner subagent)
<namespace>/<new_version>/<short_name> and <old_version>Meadow step repair/verify (step-fixer subagent, channel=meadow)
Garden step repair/verify (step-fixer subagent, channel=garden)
5b) Review sanity-checks output (only if step 1d catalogued any) Handling depends on the form catalogued in step 1d.
Function form with a log-control flag (e.g. SHOW_SANITY_CHECK_LOGS, DEBUG):
True at the top of the garden step file..venv/bin/etlr data://garden/<namespace>/<new_version>/<short_name> --private --force --only \
> workbench/<short_name>/sanity_checks.log 2>&1
AssertionError, error, warning, dropped, outliers flagged by country/year, unexpected totals. Surface actionable findings in the PR description under a "Sanity-check findings" collapsed section.False) before committing. Verify with git diff that the garden file has no unintended changes.Function form with no flag, or inline # Sanity check(s) comment blocks:
assert, raise) or by logging (paths.log.warning, .critical, even .fatal). Logging variants do NOT fail the step — so "step 5 passed" is not proof that every invariant held. If the block uses logging, re-run the step and scan stdout/stderr for the relevant keywords; don't trust the exit code alone..venv/bin/python snippet.In either form: if sanity_checks raise AssertionError on the new data, stop and decide with the user whether the assertion needs a threshold bump, whether upstream data genuinely broke, or whether the invariant is obsolete. If the check only logs, treat a new/expanding set of warnings the same way — they're the signal the sanity check was written to produce.
Watch for silent-delete patterns. Some sanity_checks functions also mutate the table — e.g. world_bank_pip's sanity_checks drops rows that fail invariants and reports the count via the log-control flag. With the flag off the deletions still happen; the reviewer just never learns which rows disappeared. When reading a sanity_checks function, scan for drop, filter, tb = tb[...] — anything that removes rows — and list every deletion in the PR body, not just the warning counts. If the deletion seems newly applicable to upstream fixes (e.g. the row should no longer be anomalous in the new release), that's a candidate for removing the workaround entirely.
5c) Country harmonization audit
Run after the garden step completes (and after 5b if it ran). Verifies that the country entities reaching the garden output are canonical, and that the mappings/exclusions consumed by paths.regions.harmonize_names(...) are well-formed. Output: workbench/<short_name>/harmonization_audit.md.
Modern API. Garden steps should be calling paths.regions.harmonize_names(tb, country_col=..., countries_file=..., excluded_countries_file=...) — the wrapper in etl/data_helpers/geo.py:1874. If you find a step still using the deprecated geo.harmonize_countries(...) directly, step 1b's /check-outdated-practices should already have flagged it; treat that as a separate cleanup. The audit below is API-agnostic — both call sites end up emitting the same three warning strings. Some garden steps don't use the harmonizer at all and instead assign country inline in Python (no .countries.json involved); for those, the JSON checks below have nothing to look at — the garden-output check in step 5 is what catches non-canonical entities, so always run it.
Source of truth. Canonical names come from two datasets, both consulted by the harmonizer:
data/garden/regions/2023-01-01/regions — countries, continents, and OWID-defined aggregates. The runtime authority is paths.regions.tb_regions["name"]. This is built from etl/steps/data/garden/regions/2023-01-01/regions.yml plus a merge with regions.codes.csv and field defaults — don't parse the YAML in isolation or you'll miss the legacy entries and produce false positives.data/garden/wb/<latest>/income_groups — the four World Bank income-group aggregates (High-income countries, Upper-middle-income countries, Lower-middle-income countries, Low-income countries). OWID treats the latest version of this dataset as the official one, so the audit must resolve the version dynamically (don't pin a date — it goes stale when WB publishes a refresh). The names live in the classification column of the income_groups_latest table.The audit's "canonical" set is the union of these two. A .countries.json entry looks like "Source name": "Target name" — the audit checks that every target name (the value the source gets harmonized to) appears in either dataset. Anything else is flagged.
Capture a fresh garden log:
.venv/bin/etlr data://garden/<namespace>/<new_version>/<short_name> --private --force --only \
> workbench/<short_name>/harmonization.log 2>&1
Scan the log for the three harmonization warnings. These are emitted by etl/data_helpers/geo.py (excluded list) and lib/datautils/owid/datautils/dataframes.py (mapping warnings) — the wording is stable:
rg -n "missing values in mapping\.|unused values in mapping\.|Unknown country names in excluded countries file:" \
workbench/<short_name>/harmonization.log
For each warning, the entity list follows on subsequent lines (because harmonize_countries() is called with show_full_warning=True by default). Capture them.
Validate .countries.json target names against canonical regions + income groups. Each entry maps a source name (key) to a target / harmonized name (value); this check looks at the values. For each garden step in this update:
import json
from pathlib import Path
from owid.catalog import Dataset
# Resolve the canonical regions dataset dynamically (latest built version).
# Don't pin a date — when the regions step version advances, a hard-coded path
# would validate against a stale catalog and flag valid targets as non-canonical.
regions_dirs = sorted(Path("data/garden/regions").glob("*/regions"))
if not regions_dirs:
raise RuntimeError(
"No data/garden/regions/<version>/regions built locally — the audit can't "
"run without the canonical regions catalog. Build it first with "
"`.venv/bin/etlr data://garden/regions/<latest>/regions --private`."
)
tb_regions = Dataset(str(regions_dirs[-1]))["regions"]
canonical_regions = set(tb_regions["name"].dropna().astype(str))
# Add OWID's official income-group aggregates to the canonical set, if available.
# OWID treats the latest income_groups version as official. This artifact is
# often not built locally during a non-income-groups dataset refresh — degrade
# gracefully (warn and skip) rather than aborting the audit.
ig_dirs = sorted(Path("data/garden/wb").glob("*/income_groups"))
if ig_dirs:
ds_ig = Dataset(str(ig_dirs[-1]))
canonical_income = set(ds_ig["income_groups_latest"]["classification"].dropna().astype(str).unique())
else:
print(
"[WARN] No data/garden/wb/<version>/income_groups built locally — "
"skipping income-group enrichment. The four WB income-group aggregates "
"(High/Upper-middle/Lower-middle/Low-income countries) may surface as "
"'not in canonical' until you build that dataset."
)
canonical_income = set()
canonical = canonical_regions | canonical_income
mapping = json.loads(Path("etl/steps/data/garden/<namespace>/<new_version>/<short_name>.countries.json").read_text())
not_in_canonical = sorted({v for v in mapping.values() if v and v not in canonical})
print("Targets not in OWID's canonical regions or income groups:", not_in_canonical)
A non-empty not_in_canonical list means the mapping points at entities that aren't registered in either the regions catalog or the income-groups dataset. This isn't automatically a bug — it's a heads-up. Stop and decide with the user before proceeding — same pattern as the global "Checkpoints — when to pause" section at the top of this skill. Common causes (in order from "fix" to "accept"): typo, retired alias used as canonical, casing/whitespace mismatch, or a legitimately custom aggregate the source defines that OWID has no equivalent for (e.g. ILO's " (ILO)"-suffixed regions, World Bank's " (WB)"-suffixed sub-Saharan splits, BRICS, G7, G20). For typos/casing — fix the JSON. For legitimately custom aggregates — accept and note in the PR description that those entities live outside the canonical system and won't merge with population/regions infrastructure. For a real new historical region — add an entry to regions.yml in a separate PR.
Audit .excluded_countries.json. The file is optional; skip if it doesn't exist:
excluded_path = Path("etl/steps/data/garden/<namespace>/<new_version>/<short_name>.excluded_countries.json")
if excluded_path.exists():
excluded = json.loads(excluded_path.read_text())
suspicious_canonical = sorted(set(excluded) & canonical)
# Also surface continents and aggregates separately for review
aggregates = set(tb_regions[tb_regions["region_type"].isin(["continent", "aggregate"])]["name"].dropna().astype(str))
suspicious_aggregates = sorted(set(excluded) & aggregates)
print("Excluded entries that ARE canonical regions:", suspicious_canonical)
print("Excluded entries that are continents/aggregates:", suspicious_aggregates)
print("Full excluded list for review:", sorted(excluded))
suspicious_canonical is the actionable signal: each entry is a known country/region that we are dropping. Sometimes this is intentional (e.g. dropping "World" rows because the source double-counts them) — surface, don't auto-fix. Pause and ask the user if the list is non-empty. The full list is dumped so the LLM can also eyeball it for entities that aren't in canonical but look like real countries (typos, alternative names) we should be mapping rather than dropping.
Audit garden output entities. Always run this check, regardless of whether .countries.json exists or is populated — JSON mappings describe inputs to the harmonizer, but the entities that actually reach Grapher are whatever sits in the country column/index of the built garden tables. Inline country assignments (e.g. hardcoded tb["country"] = "England and Wales") and post-harmonization mutations both bypass the JSON check entirely; this is the only step that catches them.
from pathlib import Path
from owid.catalog import Dataset
garden_dir = Path("data/garden/<namespace>/<new_version>/<short_name>")
ds_garden = Dataset(str(garden_dir))
entities: set[str] = set()
for tname in ds_garden.table_names:
tb = ds_garden[tname]
# `country` can live in the index (after .format()) or as a regular column.
if "country" in tb.index.names:
entities.update(tb.index.get_level_values("country").dropna().astype(str).unique())
elif "country" in tb.columns:
entities.update(tb["country"].dropna().astype(str).unique())
# tables with no country column are silently skipped (e.g. reference tables)
output_not_in_canonical = sorted(entities - canonical)
print("Garden output entities not in OWID's canonical regions or income groups:",
output_not_in_canonical)
Same triage rules as the JSON-targets check (Python check #3): typo / casing / alias / legitimately custom aggregate. A non-empty list means at least one entity that ships to Grapher isn't registered in either the regions catalog or the income-groups dataset. Stop and decide with the user before proceeding. Common fixes: typo or casing → patch the inline assignment (or .countries.json, whichever is the source) so the value matches the canonical name; alias → switch to the canonical name; legitimate custom aggregate → accept and note in the PR description that the entity lives outside the canonical system.
Write findings to workbench/<short_name>/harmonization_audit.md with six sections, populated only when non-empty. Each section must list every flagged entity, not just a count — counts alone aren't actionable, the user (or you) needs to read the actual names to judge whether each is intentional. For long lists (>20 entries) group by pattern when the grouping is obvious (e.g. ILO's " (ILO)"-suffixed regions vs. international orgs vs. derived "World ..." aggregates) so the reviewer can scan categories instead of one flat list. Sections:
## Missing in mapping — countries in source data not in .countries.json (from log warning #1) — list each missing source name## Unused mappings — .countries.json entries the data never used (warning #2) — list each unused source→target pair## Unknown excluded entries — .excluded_countries.json entries not present in source data (warning #3) — list each## Targets not in OWID's canonical regions or income groups — target names from .countries.json that aren't registered in either dataset (Python check #3) — list each target name and the source names that map to it## Excluded entries matching canonical regions — possible over-exclusion (Python check #4) — list each## Garden output entities not in OWID's canonical regions or income groups — distinct country values found in the built garden tables that aren't in canonical regions or income groups (Python check #5) — list each entitySurface in PR. If any section was populated, add a collapsed "Harmonization audit" section to the PR description (after the per-step sections, before the Slack announcement) with the same listings, not just a summary. Empty sections can be omitted.
When you report progress to the user during the workflow, never just give a count — always include the list (or grouped categories) so they can judge in one glance.
Checkpoint summary:
6a) Re-evaluate # NOTE: / # TODO: items from step 1c against fresh data
Now that meadow, garden, and grapher have run on the new data, go back to workbench/<short_name>/notes_to_check.md and decide each item's fate. For each entry:
owid.catalog.Dataset (or inspect the raw snapshot) and compare corrected vs. uncorrected values. Cross-check the producer's release notes / changelog if available.# NOTE: / # TODO: comments in the same commit, then re-run the affected step (use --force --only, add --grapher for grapher) so downstream artifacts pick up the change.Do this before step 6b (metadata checks) so any re-runs triggered by comment-removal happen before the metadata sweep, not after.
6b) Metadata quality checks — run after all ETL steps are built Run all four checks on the newly built garden and grapher datasets so every issue surfaces together. Each skill writes results to the terminal; fix what comes up before moving on.
/check-metadata-typos scoped to the current step. Run on each of the new .meta.yml files (garden first, then grapher). Accept or skip each suggested fix./check-metadata-spacing on the built garden and grapher datasets. Catches template artifacts like doubled spaces or stray newlines that only appear after Jinja rendering./check-metadata-style on the grapher step. Audits user-facing fields (title, subtitle, description_short, display.name, presentation.*) against OWID's Writing and Style Guide. Rules live in .claude/skills/check-metadata-style/STYLE_GUIDE.md, so no Notion access is needed — but if the guide looks out of date, refresh that file from Notion in a separate PR.OWID readers are not domain experts. Walk each indicator's user-facing fields and flag anything that requires inside knowledge to parse.
| Field | Clarity check |
|---|---|
title / presentation.title_public | A non-specialist should know what the indicator measures from the title alone. Expand acronyms unless universally known (skip GDP; expand GWIS, MFI, SDG, IHME). Don't cram units into the title. |
description_short | One or two short sentences: what the metric is and what it covers. No jargon without a gloss. Active voice. The chart subtitle is short by design — no run-on or stacked clauses. |
description_key | Each bullet should land a distinct, useful fact. Skip filler ("this dataset is widely used"); prefer substantive caveats (coverage gaps, methodology limits, what counts/doesn't count). |
display.name | Short legend label. Reads naturally on a chart axis/legend; doesn't restate the title. |
presentation.grapher_config.note | Concise footnote, ≤1 sentence ideally. |
Flag and rewrite when you find:
When a phrasing is ambiguous, propose a concrete rewrite — don't just flag it.
If any skill rewrites a .meta.yml, re-run the affected step so the built catalog reflects the edits. Add --grapher when the affected step is on the grapher channel — without it the local catalog is updated but staging stays stale, so the step 7 indicator upgrade sees the old text.
# garden / meadow:
.venv/bin/etlr <channel>/<namespace>/<new_version>/<short_name> --private --force --only
# grapher:
.venv/bin/etlr grapher/<namespace>/<new_version>/<short_name> --grapher --private --force --only
Then re-run the relevant check to confirm zero remaining violations.
6c) Indicator metadata coverage, dataset block, and link verification The other quality checks catch content issues; this step catches missing fields and broken URLs before they reach review.
Mandatory fields per indicator. For every indicator in the garden .meta.yml, confirm these are set (either on definitions.common or per-indicator):
| Field | Notes |
|---|---|
title | Per-indicator |
unit | Common is fine |
short_unit | Common is fine |
description_short | Per-indicator |
description_key | At least one bullet; usually common |
processing_level | minor or major |
presentation.topic_tags | At least one tag |
display.numDecimalPlaces | Common is fine |
display.tolerance | Common is fine — chart tolerance for missing years |
display.name | Per-indicator — required for legend labels |
presentation.attribution_short | Set explicitly — does NOT inherit from the origin's attribution_short (verified: MySQL variables.attributionShort stays NULL if it's only on the origin). Place under definitions.common.presentation for the common case. |
Conditional: if processing_level: major, every indicator with that level MUST also have description_processing.
Not mandatory (skip if you don't need them): presentation.title_public, presentation.title_variant, presentation.attribution.
Dataset block. Garden .meta.yml MUST include update_period_days:
dataset:
update_period_days: <N>
This controls the auto-update cadence. Even when the rest of the dataset: block is empty, never strip update_period_days — leave the block in place with just that field.
Link verification. Run a HEAD request on every URL in the new .dvc and .meta.yml files (all channels — meadow .meta.yml files matter when they exist). Anything non-2xx is a signal, not a guaranteed break — always double-check before acting:
for url in $(rg --no-filename -No "https?://[^\"' ]+" snapshots/<namespace>/<new_version>/ etl/steps/data/{meadow,garden,grapher}/<namespace>/<new_version>/ \
| sed -E 's/[).,;:>]+$//' \
| sort -u); do
printf "%s %s\n" "$(curl -sI -L -o /dev/null -w '%{http_code}' --max-time 15 -A 'Mozilla/5.0' "$url")" "$url"
done
The --no-filename flag prevents rg from prepending path: to each match (otherwise the for-loop tries to curl path:url and every check returns 000). -A 'Mozilla/5.0' sometimes coaxes a real response out of Cloudflare-fronted hosts, but it doesn't always work — see the next note.
curl non-2xx ≠ broken. Cloudflare-fronted sites (notably ourworldindata.org) can return 404 to curl on URLs that work fine in a browser, depending on edge-node routing, IP geolocation, and cached state. Before treating a 4xx as a real failure:
WebFetch (the built-in tool). It uses a different code path and a Mozilla/5.0 UA that Cloudflare usually accepts. A 200 with a coherent page body is authoritative — trust it over curl.WebFetch also fails, sanity-check the Wayback Machine: https://web.archive.org/web/<year>/<url>. A recent successful snapshot means the URL is reachable on the public internet and your local route is the problem.WebFetch and Wayback unable to reach the URL — and even then flag and ask the user before silently rewriting an external link in metadata. Replacing a working link with a "safer" alternative because of a curl false-positive is worse than leaving the original. Apply the same restraint here as the global "Checkpoints — when to pause" section.Fix any genuinely-non-2xx hit on url_main, url_download, license.url, or URLs referenced from description / description_key before continuing. The sed strips trailing markdown/punctuation chars (), ., ,, ;, :, >) so URLs inside [text](url) aren't reported as broken because of a stray closing paren.
Verification. After editing, re-run the affected step (with --grapher if grapher) so the catalog reflects the changes. Then confirm presentation.attribution_short actually landed:
from owid.catalog import Dataset
ds = Dataset("data/grapher/<ns>/<v>/<short_name>")
tb = ds["<table>"]
print(tb["<col>"].metadata.presentation.attribution_short) # must NOT be None
Or after the staging upload:
make query SQL="SELECT shortName, attributionShort FROM variables WHERE catalogPath LIKE '%<ns>/<v>/<short_name>%'"
Indicator upgrade (optional, staging only)
STAGING=<branch> .venv/bin/etlr data://grapher/<namespace>/<new_version>/<short_name> --grapher --private
STAGING=<branch> .venv/bin/etl indicator-upgrade auto
mysql -h "staging-site-<branch>" -u owid --port 3306 -D owid -e "SELECT COUNT(*) FROM chart_dimensions cd JOIN variables v ON cd.variableId = v.id WHERE v.catalogPath LIKE '%<namespace>/<new_version>%'"
If the count is 0, the upgrade did not run — re-run it.Update context for public announcement
workbench/<short_name>/update-context.yml as the canonical record of facts discovered during the update. Do not wait until the end if a fact is already known; append/update as each step completes.dataset:
namespace: <namespace>
old_version: <old_version>
new_version: <new_version>
short_name: <short_name>
title: <public dataset title, if known>
producer: <producer, if known>
source:
release_date: <snapshot origin date_published, if known>
next_release: <best-effort, or null>
url_main: <source page, if known>
citation_full: <citation, if known>
coverage:
year_min: <garden min year>
year_max: <garden max year>
countries: <distinct countries/entities>
includes_regions: <true/false>
sparse_recent_year_note: <note, or null>
charts:
published_count: <published chart count>
size_qualifier: <handful|moderate|large|massive>
selected_views:
- title: <chart title>
slug: <chart slug>
rationale: <why this represents the dataset>
update_summary:
snapshot_diff: <short summary or artifact path>
meadow_diff: <short summary or artifact path>
garden_diff: <short summary or artifact path>
notable_changes: []
sanity_check_findings: []
resolved_workarounds: []
editorial_context:
why_it_matters_snippets: []
caveat_snippets: []
interesting_update_snippets: []
c.publishedAt IS NOT NULL). Draft/unlisted charts must not be counted in the announcement:
SELECT c.id, cc.slug, cc.full->>'$.title' as title, cc.full->>'$.type' as type, cc.full->>'$.hasMapTab' as hasMapTab
FROM charts c
JOIN chart_configs cc ON cc.id = c.configId
JOIN chart_dimensions cd ON cd.chartId = c.id
JOIN variables v ON cd.variableId = v.id
WHERE v.catalogPath LIKE '%<namespace>/<new_version>%'
AND c.publishedAt IS NOT NULL
GROUP BY c.id
size_qualifier: 1–9 = handful, 10–49 = moderate, 50–199 = large, 200+ = massive.selected_views using these criteria (in order of preference):
Slack announcement & PR update
data-updates-comms skill with workbench/<short_name>/update-context.yml as input. data-updates-comms is the canonical owner of the Slack form wording, copy-paste format, editorial framing, search URL, and any standalone fallback gathering. Do not duplicate that rendering logic here.workbench/<short_name>/slack-announcement.md.data-updates-comms reports missing mechanical fields, gather them, update update-context.yml, and re-render rather than inventing values. Ask the user if a missing field requires judgment.<details> section titled "Slack Announcement", with the file content embedded inside a triple-backtick markdown fence.@codex review as a separate PR comment (not in the PR description) to trigger an automated code review. Use:
gh pr comment <pr_number> --body "@codex review"
"Slack announcement drafted at [workbench/<short_name>/slack-announcement.md](workbench/<short_name>/slack-announcement.md) and added to the PR description. Please review and post it to #data-updates-comms." Always render the path as a markdown link […](…), not as inline-code — the chat UI renders it as clickable that way.9b) Data update post (for OWID /latest)
Draft the short reader-facing post that gets published on https://ourworldindata.org/latest. The team drafts these in Google Docs in the shared /Data updates Drive folder (https://drive.google.com/drive/folders/1oL0uLHKI6f2qi1rJA6-qFFRYEBw_-rfm), and OWID's CMS ingests the doc into the published feed.
The skill's job is to produce paste-ready Google Doc content in the exact CMS format the team uses (frontmatter title / excerpt / type / authors / kicker → \[+body\] marker → body prose with inline markdown links → {.cta} block → {.image} block → \[\] end marker). Don't invent your own format — every published post in the Drive folder follows the same shape.
This is separate from the Slack announcement — that one is a 10-field form for the internal channel; this one is a mini-blog-post for OWID readers, and the format is structured for CMS ingestion.
Steps:
.claude/skills/update-dataset/data-update-template.md and follow it — the template has the exact paste-ready format plus three worked examples (NVIDIA, H5N1, World Bank PIP) lifted verbatim from the Drive folder.workbench/<short_name>/update-context.yml (step 8) — dataset.title, dataset.producer, source.url_main, source.citation_full, coverage.*, charts.published_count, charts.selected_views, and the editorial_context.* snippet lists. Also pull from workbench/<short_name>/slack-announcement.md (step 9 output) — the editorial framing already drafted there is the closest cousin. If a field needed for the post isn't yet in update-context.yml, gather it (snapshot DVC, garden .meta.yml, or url_main via WebFetch) and persist it back to the YAML so the next consumer doesn't re-do the work.*italics* for emphasis, sparingly.https://ourworldindata.org/grapher/<slug>.https://ourworldindata.org/search?datasetProducts=<URL-encoded dataset title> — value is the dataset title, resolved with this priority: (a) the dataset.title field in the garden .meta.yml if it's set there (an override), otherwise (b) the meta.origin.title field in the snapshot .dvc. Often includes a parenthetical acronym like Luxembourg Income Study (LIS) or World Bank Poverty and Inequality Platform (PIP). Not the bare producer field.https://ourworldindata.org/explorers/<name>./sdgs)./collection/custom?charts=… URLs.YYYY-MM-data-update-<slug>.png (e.g. 2026-04-data-update-h5n1-flu.png). The skill doesn't generate the image; the user adds it to the Doc separately.workbench/<short_name>/data-update.md.<details> section titled "Data update post (for OWID /latest)" to the PR description, placed after the Slack-announcement section, with the file content embedded inside a triple-backtick markdown fence."Data update post drafted at [workbench/<short_name>/data-update.md](workbench/<short_name>/data-update.md) in the Google Docs CMS format. Please create a new Google Doc in /Data updates, paste the draft, attach the chart screenshot, and share for review." Always render workbench/<short_name>/data-update.md as a markdown link […](…) rather than as a bare path or inline-code path — the chat UI renders it as clickable that way.@codex review, then poll for inline review comments:
gh api repos/owid/etl/pulls/<pr_number>/comments | python3 -m json.tool
gh api graphql -f query='{ repository(owner:"owid", name:"etl") { pullRequest(number:<pr_number>) { reviewThreads(first:20) { nodes { id isResolved comments(first:1) { nodes { body } } } } } } }'
gh api graphql -f query='mutation { resolveReviewThread(input:{threadId:"<thread_id>"}) { thread { id isResolved } } }'
gh api repos/owid/etl/pulls/<pr_number>/comments/<comment_id>/replies -f body="<explanation>"
gh api graphql -f query='mutation { resolveReviewThread(input:{threadId:"<thread_id>"}) { thread { id isResolved } } }'
Commit and push incrementally as you go — after each step that produces code changes. Don't wait until the end. Use descriptive commit messages with appropriate emojis (the one auto-prepended by etl pr for the chosen category + 🤖 for AI-written code).
At the end of the workflow, update the PR description with:
Tracks: [owid/owid-issues#NNNN](https://github.com/owid/owid-issues/issues/NNNN). Most data updates have a corresponding owid-issues ticket; try to find it by searching the title or <short_name> first, and ask the user for the issue number if you can't locate one rather than skipping the link silently.After completing the update, check if any other datasets depend on the old version of the updated dataset:
rg "<namespace>/<old_version>/<short_name>" dag/ -g "*.yml" | grep -v "^dag/archive"
Filter out the old dataset's own DAG entries (snapshot → meadow → garden → grapher chain). Any remaining references are downstream dependents that still point to the old version.
If downstream dependents exist:
After the ETL update, etl update appends the new version entries to the bottom of the main DAG file while the old version's entries stay in their original slot. Always ask the user whether to archive — but never skip this checklist item, and when the user agrees, always do the reorder too (not just the archive).
Workflow when the user agrees:
dag/poverty_inequality.yml) to the bottom of the corresponding archive file (dag/archive/<same_file>.yml). Include the original section comment (e.g., # 1000 Binned Global Distribution (World Bank PIP)) above the archived entries.rg "<namespace>/<old_version>/<short_name>" dag/ -g "*.yml" | grep -v "^dag/archive" returns nothing, and rg "<namespace>/<new_version>/<short_name>" dag/ -g "*.yml" shows the entries only in the main file (under the section comment), not at the bottom.make check and commit with 🔨🤖 Archive old <name> entries and reorder DAG.This is the last step, after the DAG archive has been committed. Don't auto-run these — they're human-judgment tools. Hand off the two staging links so the user can review and click through:
http://staging-site-<container_branch>/etl/wizard/anomalist
http://staging-site-<container_branch>/etl/wizard/chart-diff
Important: derive <container_branch> correctly. The staging hostname is not simply staging-site-<branch>. The container name is produced by get_container_name(branch) in etl/config.py:
/, ., _ with - in the branch name.staging-site- if present.-.Branches over 28 chars therefore get clipped. Example: data-military-expenditure-2026 (30 chars) → container data-military-expenditure-20 → hostname staging-site-data-military-expenditure-20. The simplest way to get the correct value is to call the helper:
.venv/bin/python -c "from etl.config import get_container_name; print(get_container_name('<branch>'))"
Tell the user something like: "Final QA: please review Anomalist and Chart Diff in the Wizard. If anything looks off, let me know and I'll investigate."
These pages need a fresh staging build, so they're only meaningful after the PR's grapher upload to staging has completed and the staging server has rebuilt.
etl update, always verify that all new steps in dag/main.yml reference each other with the new version. A common bug is garden depending on old meadow or old snapshot — this silently loads stale data.reset_index(); format tables with tb.format(["country", "year"]) as appropriate.pd.read_csv produces an object column that mixes strings and NaN (common for sparse text columns like sources/comments/punishments), the feather repacker rejects it. Cast those columns to pandas "string" dtype before tb.format(...).paths.regions auto-resolves DAG dependencies: paths.regions.add_population(tb) and paths.regions.add_aggregates(tb, regions=[...]) pick up the population and income_groups datasets directly from the DAG. Don't paths.load_dataset("population") and pass it through unless the helper specifically asks for the dataset — the parameter is unused.High-income countries, Upper-middle-income countries, Lower-middle-income countries, Low-income countries) to your REGIONS list and add data://garden/wb/<latest>/income_groups to the DAG. paths.regions.add_aggregates(...) auto-resolves the classification.groupby(...).max() == 0) and assert the count matches the codebook. A coding change in the source then surfaces as a test failure instead of silently shipping noise.processing_level: major requires description_processing: keep processing_level: minor as the common default and override to major only on indicators that have a description_processing field. Don't blanket-set major on the common block and then leave country-level proportions without their own processing note.description_key in definitions.common propagates only to indicators without their own list: if you want a bullet to appear on every indicator, either keep it on common.description_key and don't define per-indicator lists (it inherits), or prepend it explicitly to each per-indicator list (treats it as a "first bullet" pattern).sort: labels against the unique values that actually appear in the data. Phantom labels (declared in sort: or in a category map but never produced) clutter chart legends with empty buckets. Either drop them from sort: and description_key, or remove them from the map if they can never occur given the data shape. Re-run the audit on every data refresh — phantoms can reappear when a category is dropped upstream.NOTE: comments for the next maintainer when behaviour is data-conditional: when something in the code holds only because of the current data shape (e.g. "only 4 indicators have an EoE=0 row", "only Brazil 2025 is a transition-year artefact"), leave a # NOTE: comment near the relevant block asking the next data update to re-audit. Helps future maintainers spot which assumptions might decay before they bite.WizardDB.add_variable_mapping(mapping={old_id: new_id, ...}, dataset_id_old=..., dataset_id_new=..., comments="...") with the explicit pairs, then run from apps.indicator_upgrade.upgrade import cli_upgrade_indicators; cli_upgrade_indicators(dry_run=True) to preview affected charts, and (dry_run=False) to apply. Mappings stay in the wizard DB until WizardDB.delete_variable_mapping() is called, so a slug-collision failure can be recovered by fixing the slug and rerunning the upgrade — only un-upgraded charts get reattempted. The active staging DB is inferred from the current git branch.etl update --rename is the wrong starting point — the structure of meadow/garden/grapher needs to follow the new shape, and the rename flow will only produce confusion. Spot this fork early at the snapshot/codebook stage, before running etl update. Scaffold the new chain via the create-etl-steps skill (wraps the wizard's cookiecutter templates) or launch the wizard UI with etlwiz and use its "ETL Steps" page — both produce a consistent meadow/garden/grapher skeleton to fill in. Once scaffolded, read the v1 scripts as a reference for the source-specific logic that's still relevant (column-rename maps, status/category normalisations, country harmonisation map, sanity checks, codebook-driven structural assertions) — don't copy the v1 structure blindly, but port the bits that still apply to the new schema.When the update is review-heavy and you need iterative back-and-forth with a topic owner over staging, see the report-indicator-changes skill for drafting the message.
workbench/<short_name>/snapshot-runner.mdworkbench/<short_name>/progress.mdworkbench/<short_name>/notes_to_check.md (one entry per carried-over # NOTE: / # TODO:, plus detected sanity_checks functions and their log-control flags)workbench/<short_name>/sanity_checks.log (only if step 5b ran)workbench/<short_name>/meadow_diff_raw.txt and meadow_diff.mdworkbench/<short_name>/garden_diff_raw.txt and garden_diff.mdworkbench/<short_name>/harmonization.log and harmonization_audit.md (from step 5c)workbench/<short_name>/indicator_upgrade.json (if indicator-upgrader was used)workbench/<short_name>/update-context.yml (canonical facts gathered during the update; consumed by data-updates-comms)workbench/<short_name>/slack-announcement.mdworkbench/<short_name>/data-update.md (public-facing post draft for OWID /latest, from step 9b)update-dataset data://snapshot/irena/2024-11-15/renewable_power_generation_costs 2023-11-15 update-irena-costsfood_prices_for_nutrition.meta.yml)index columns from reset_index() — ensure proper indexing with tb.format(["country", "year"]).