Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

update-dataset

Name: Update Dataset
Author: owid

// End-to-end dataset update workflow with PR creation, snapshot, meadow, garden, and grapher steps. Use when user wants to update a dataset, refresh data, run ETL update, or mentions updating dataset versions.

In Manus ausführen

$ git log --oneline --stat

stars:148

forks:31

updated:20. Mai 2026 um 16:03

Datei-Explorer

2 Dateien

SKILL.md

readonly

related-skills.json

gleiches Repository

vibe-app.md

from "owid/etl"

Create a vibe app — a self-contained HTML page (data report, mini dashboard, prototype, …) — and publish it to the owid/vibe-webapps repo (served at vibe.owid.io). The scaffolder is geared toward data reports (text + embedded charts) but the same shape works for any internal webapp. Use when the user wants to turn an exploration, finding, analysis, or small interactive idea into a shareable internal page. (Renamed from `/data-report`.)

2026-05-29148

check-metadata-style.md

from "owid/etl"

Check grapher chart metadata (titles, subtitles, descriptions, display names) against OWID's Writing and Style Guide. Use when the user mentions the style guide, writing guide, chart copy quality, title/subtitle review, or after editing .meta.yml files under etl/steps/data/grapher/.

2026-05-27148

owid-metadata-generation.md

from "owid/etl"

Use when creating or enriching metadata for OWID ETL datasets - generates comprehensive YAML metadata from dataset inspection, data exploration, and web research following OWID metadata standards. Trigger when writing or editing *.meta.yml files, when a garden step has empty or minimal metadata, or when user asks to improve/add/enrich metadata.

2026-05-27148

map-explorer-to-mdim.md

from "owid/etl"

Suggest a redirect mapping from a (soon-to-sunset) OWID explorer's views to the views of one or more replacement MDIMs. Pulls explorer views and MDIM views from the grapher DB, writes a CSV per source/target plus a wide joint proposal that routes each explorer view to a target MDIM view, and flags when several explorer views land on the same MDIM view. Trigger when the user says "map explorer <slug> to mdim(s) <...>", "suggest explorer->MDIM redirects", "we're sunsetting the <slug> explorer, map its views to the new multidims", or similar.

2026-05-27148

data-updates-comms.md

from "owid/etl"

Draft answers for OWID's data-updates-comms Slack template using snapshot DVC + garden metadata + staging DB queries. Use when the user wants to announce a dataset update, fill the "Message about new data update" form, or generate the FAQ-style Slack post after an ETL update. Mechanical fields (producer, dates, coverage, chart count, search URL) are filled directly; editorial fields (why it matters, caveats, what's interesting about this update) get prompts seeded with extracted context for the user to refine.

2026-05-20148

report-indicator-changes.md

from "owid/etl"

Draft a short update message to a topic-owner reviewer after landing substantial dataset or chart changes on staging. The message lists the indicator changes with staging admin links, surfaces open design questions with option tables, and closes with a Chart Diff sign-off CTA. The output is markdown so the user can paste it into either Slack or a GitHub PR comment. Use after a dataset redesign or restructure when the iteration with the reviewer is back-and-forth and you need them to verify on staging before merge. Not for the canonical comms announcement — for that, see `data-updates-comms`.

2026-05-20148

package.json

"author": "owid"

"repository": "owid/etl"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

DatenwissenschaftlerInformatik- und Mathematikberufe15-2051L4

name	update-dataset
description	End-to-end dataset update workflow with PR creation, snapshot, meadow, garden, and grapher steps. Use when user wants to update a dataset, refresh data, run ETL update, or mentions updating dataset versions.
metadata	{"internal":true}

Update Dataset (PR → snapshot → steps → grapher)

Use this skill to run a complete dataset update with Claude Code subagents, keep a live progress checklist, and pause for user approval only when something needs attention.

Inputs

<namespace>/<old_version>/<name>
Get <new_version> as today's date by running date -u +"%Y-%m-%d"

Optional trailing args:

branch: The working branch name (defaults to current branch)

Assumptions:

All artifacts are written to workbench/<short_name>/.
Persist progress to workbench/<short_name>/progress.md and update it after each step.
Persist reusable update facts to workbench/<short_name>/update-context.yml as they are discovered. This is the canonical context artifact for the PR description, review handoff, and data-updates-comms.

Progress checklist (maintain, tick live, and persist to progress.md)

Persistence:

After ticking each item, update workbench/<short_name>/progress.md with the current checklist state and a timestamp.

Checkpoints — when to pause

Default: keep going. Run through the full workflow without stopping unless one of the conditions below is met.

Stop and ask the user when:

A step fails and the fix is ambiguous (multiple reasonable approaches, or you're unsure of the correct one)
Data structure changed significantly (columns removed/renamed, large row count drops, schema changes that may affect charts)
Country harmonization has new unmatched countries that need manual decisions
The snapshot requires a manual download or credentials you don't have
Indicator upgrade had imperfect matches (< 100% similarity) that need human review
Anything that could silently break charts or lose data

Don't stop for:

Routine assertion count updates (just update them and note in the summary)
Clean step runs with only row increases
Expected warnings (SettingWithCopyWarning, known unmapped territories)
Straightforward filename/version reference updates

When you do stop, present a concise summary of the issue and what options exist.

When the update isn't a drop-in version bump

Some updates carry structural changes that make the standard rename-only flow the wrong tool. Recognise them up front and adjust the workflow.

Triggers — any of these means you're in restructure territory, not a version bump:

short_name changes (producer rebranded the dataset).
File format/schema changes (wide → long, different file extension with a different column set, new dimensions).
Policy/indicator set changes substantially (splits, dropped composites, newly added areas).
Score semantics change (e.g. binary → continuous with subnational coverage).

Workflow adjustments:

Skip etl update. The rename-only flow copies the old step files into a new folder — useless when the schema is different. Author the new step chain by hand, using the old version as inspiration but not as a starting copy.
Add the new chain to the DAG before archiving the old. Leave both chains active while you build and validate v2; archive the v1 entries only once v2 is on staging and the chart remap is queued or done.
Decide on naming convention upfront. Ask the user whether to preserve v1 short_names where they map cleanly, or to adopt the source's fresh naming scheme. Fresh naming is cleaner but means the auto-Indicator-Upgrader can't help.
Hand-curate the v1 → v2 indicator mapping. When short_names change entirely, the auto-upgrader has nothing to match on, but the Indicator Upgrader also matches on title — so if v2 titles are descriptive (full sentences rather than the bare short_name), you can hand the user a table of v1 title → v2 title pairs and they can drive the chart remap from there. Generate this table from the v1 meta.yml + the v2 grapher catalog.
Defer the Slack and /latest announcements until charts have been remapped. Both posts depend on charts.published_count and charts.selected_views from the v2 chain. Drafting them before the remap gives the wrong count (zero) and no representative views. Tell the user to ping you when the chart remap is done, then run steps 8 / 9 / 9b.

For the long-format with dimensions sub-case specifically (e.g. one row per (country, year, <dim1>, <dim2>)), use the modern OWID pattern:

Meadow + garden: tb.format(["country", "year", <dim1>, <dim2>, ...], sort_columns=True).
Aggregations: paths.regions.add_aggregates(tb, index_columns=[...full key...], regions=REGIONS, aggregations={...}).
Grapher: pass long tables through unchanged; the framework auto-expands them into per-cell variables.
Metadata: variables are keyed by the long-column name, with <% if <dim> == "X" and <dim2> == "Y" %>...<% endif %> Jinja blocks inside title, description_short, display.name. Grep this repo for tb.format(["country", "year" with more than two index entries to find current reference examples.

Workflow orchestration

Initial setup
- Check if workbench/<short_name>/progress.md exists to determine if continuing existing update
- If starting fresh: delete workbench/<short_name> directory if it exists
- Create fresh workbench/<short_name> directory for artifacts
Run ETL update command (etl-update subagent)
- Inputs: <namespace>/<old_version>/<short_name> plus any required flags
- CRITICAL: Run etl update ONCE for the full step URI (e.g., data://garden/namespace/old_version/short_name). Do NOT run it separately per channel (snapshot, meadow, garden, grapher). Running it once ensures all cross-step DAG dependencies are updated together. Running it per-channel leaves stale version references in dag/main.yml (e.g., garden pointing to old meadow version).
- Perform help check, dry run, approval, then real execution; capture summary for later PR notes
- After running, always verify dag/main.yml: grep for the old version and confirm all internal references between the new steps point to the new version (e.g., garden depends on new meadow, not old meadow).

1b) Check for outdated practices (check-outdated-practices skill)

After etl update creates new step files, run the /check-outdated-practices skill on the newly created files
This catches patterns like if __name__ == "__main__", geo.harmonize_countries(), dest_dir, paths.load_dependency(), etc. that were copied from old versions
Fix any findings before proceeding — this avoids propagating legacy patterns into new versions

1c) Catalog # NOTE: / # TODO: comments in the copied step files (don't resolve yet)

Run rg -n "#\s*(NOTE|TODO|FIXME|HACK|XXX):" snapshots/<namespace>/<new_version>/ etl/steps/data/{meadow,garden,grapher}/<namespace>/<new_version>/.
Filter out generic boilerplate (e.g. # NOTE: To learn more about the fields, hover over their names. at the top of .meta.yml).
Save the remaining actionable items to workbench/<short_name>/notes_to_check.md — one entry per annotation, recording file path, line number, which step it lives in (meadow/garden/grapher), and what the workaround does.
Don't act on them yet. Resolution requires fresh data and happens after each step's run — see step 6a.

1d) Detect sanity-check logic in the copied step files Sanity checks live in two different forms — detect both:

Function form — def sanity_check… / sanity_check…( call sites. Often gated by a module-level boolean flag (DEBUG, SHOW_SANITY_CHECK_LOGS, LONG_FORMAT) that defaults to False to keep normal runs quiet. Examples: etl/steps/data/garden/wb/.../world_bank_pip.py (SHOW_SANITY_CHECK_LOGS), etl/steps/data/garden/wid/.../world_inequality_database.py (DEBUG + LONG_FORMAT), etl/steps/data/garden/lis/.../luxembourg_income_study.py (no flag; prints unconditionally via tabulate).
Inline comment form — # Sanity check / # Sanity checks / # sanity check marking an inline assertion block that isn't wrapped in a dedicated function. Very common: etl/steps/data/garden/emdat/.../natural_disasters.py, etl/steps/data/garden/emissions/.../national_contributions.py, etl/steps/data/garden/irena/.../renewable_capacity_statistics.py. These usually have no log flag — the block simply runs on every step execution and either passes or raises.

Run a combined sweep:

rg -n -i "def sanity_check|sanity_check\(|#\s*sanity check" \
    snapshots/<namespace>/<new_version>/ \
    etl/steps/data/{meadow,garden,grapher}/<namespace>/<new_version>/

Append a "Sanity checks" section to workbench/<short_name>/notes_to_check.md listing each hit — for each, record: file path + line number, which form (function vs. inline comment), the name of any log-control flag (function form only), and a one-line description of what's being asserted (read the surrounding 5–10 lines).

Don't act yet — the review happens in step 5b once the garden step has been run on the new data.

Create PR and integrate update via subagent (etl-pr)
- Inputs: <namespace>/<old_version>/<short_name>
- Create or reuse draft PR, set up work branch, and incorporate the ETL update outputs
Snapshot run & compare (snapshot-runner subagent)
- Inputs: <namespace>/<new_version>/<short_name> and <old_version>
Meadow step repair/verify (step-fixer subagent, channel=meadow)
- Run, fix, re-run; produce diffs
- Save diffs and summaries
Garden step repair/verify (step-fixer subagent, channel=garden)
- Run, fix, re-run; produce diffs
- Save diffs and summaries

5b) Review sanity-checks output (only if step 1d catalogued any) Handling depends on the form catalogued in step 1d.

Function form with a log-control flag (e.g. SHOW_SANITY_CHECK_LOGS, DEBUG):

Flip the flag to True at the top of the garden step file.

Re-run the garden step, capturing output:

.venv/bin/etlr data://garden/<namespace>/<new_version>/<short_name> --private --force --only \
    > workbench/<short_name>/sanity_checks.log 2>&1

Review the log: scan for AssertionError, error, warning, dropped, outliers flagged by country/year, unexpected totals. Surface actionable findings in the PR description under a "Sanity-check findings" collapsed section.
Revert the flag to its original value (usually False) before committing. Verify with git diff that the garden file has no unintended changes.

Function form with no flag, or inline # Sanity check(s) comment blocks:

Read each catalogued block (pull 5–15 lines of context around the hit) to understand what invariant is being tested.
Important: a sanity check can enforce its finding either by raising (assert, raise) or by logging (paths.log.warning, .critical, even .fatal). Logging variants do NOT fail the step — so "step 5 passed" is not proof that every invariant held. If the block uses logging, re-run the step and scan stdout/stderr for the relevant keywords; don't trust the exit code alone.
For non-trivial invariants (monotonicity, totals, bounds), also spot-check qualitatively against the fresh garden output via a short .venv/bin/python snippet.
Record any anomalies under "Sanity-check findings" in the PR description. No log artifact to keep here since the step's own output is the evidence.

In either form: if sanity_checks raise AssertionError on the new data, stop and decide with the user whether the assertion needs a threshold bump, whether upstream data genuinely broke, or whether the invariant is obsolete. If the check only logs, treat a new/expanding set of warnings the same way — they're the signal the sanity check was written to produce.

Watch for silent-delete patterns. Some sanity_checks functions also mutate the table — e.g. world_bank_pip's sanity_checks drops rows that fail invariants and reports the count via the log-control flag. With the flag off the deletions still happen; the reviewer just never learns which rows disappeared. When reading a sanity_checks function, scan for drop, filter, tb = tb[...] — anything that removes rows — and list every deletion in the PR body, not just the warning counts. If the deletion seems newly applicable to upstream fixes (e.g. the row should no longer be anomalous in the new release), that's a candidate for removing the workaround entirely.

5c) Country harmonization audit Run after the garden step completes (and after 5b if it ran). Verifies that the country entities reaching the garden output are canonical, and that the mappings/exclusions consumed by paths.regions.harmonize_names(...) are well-formed. Output: workbench/<short_name>/harmonization_audit.md.

Modern API. Garden steps should be calling paths.regions.harmonize_names(tb, country_col=..., countries_file=..., excluded_countries_file=...) — the wrapper in etl/data_helpers/geo.py:1874. If you find a step still using the deprecated geo.harmonize_countries(...) directly, step 1b's /check-outdated-practices should already have flagged it; treat that as a separate cleanup. The audit below is API-agnostic — both call sites end up emitting the same three warning strings. Some garden steps don't use the harmonizer at all and instead assign country inline in Python (no .countries.json involved); for those, the JSON checks below have nothing to look at — the garden-output check in step 5 is what catches non-canonical entities, so always run it.

Source of truth. Canonical names come from two datasets, both consulted by the harmonizer:

data/garden/regions/2023-01-01/regions — countries, continents, and OWID-defined aggregates. The runtime authority is paths.regions.tb_regions["name"]. This is built from etl/steps/data/garden/regions/2023-01-01/regions.yml plus a merge with regions.codes.csv and field defaults — don't parse the YAML in isolation or you'll miss the legacy entries and produce false positives.
data/garden/wb/<latest>/income_groups — the four World Bank income-group aggregates (High-income countries, Upper-middle-income countries, Lower-middle-income countries, Low-income countries). OWID treats the latest version of this dataset as the official one, so the audit must resolve the version dynamically (don't pin a date — it goes stale when WB publishes a refresh). The names live in the classification column of the income_groups_latest table.

The audit's "canonical" set is the union of these two. A .countries.json entry looks like "Source name": "Target name" — the audit checks that every target name (the value the source gets harmonized to) appears in either dataset. Anything else is flagged.

Capture a fresh garden log:

.venv/bin/etlr data://garden/<namespace>/<new_version>/<short_name> --private --force --only \
    > workbench/<short_name>/harmonization.log 2>&1

Scan the log for the three harmonization warnings. These are emitted by etl/data_helpers/geo.py (excluded list) and lib/datautils/owid/datautils/dataframes.py (mapping warnings) — the wording is stable:
```
rg -n "missing values in mapping\.|unused values in mapping\.|Unknown country names in excluded countries file:" \
    workbench/<short_name>/harmonization.log
```
For each warning, the entity list follows on subsequent lines (because harmonize_countries() is called with show_full_warning=True by default). Capture them.

Validate .countries.json target names against canonical regions + income groups. Each entry maps a source name (key) to a target / harmonized name (value); this check looks at the values. For each garden step in this update:

import json
from pathlib import Path
from owid.catalog import Dataset

# Resolve the canonical regions dataset dynamically (latest built version).
# Don't pin a date — when the regions step version advances, a hard-coded path
# would validate against a stale catalog and flag valid targets as non-canonical.
regions_dirs = sorted(Path("data/garden/regions").glob("*/regions"))
if not regions_dirs:
    raise RuntimeError(
        "No data/garden/regions/<version>/regions built locally — the audit can't "
        "run without the canonical regions catalog. Build it first with "
        "`.venv/bin/etlr data://garden/regions/<latest>/regions --private`."
    )
tb_regions = Dataset(str(regions_dirs[-1]))["regions"]
canonical_regions = set(tb_regions["name"].dropna().astype(str))

# Add OWID's official income-group aggregates to the canonical set, if available.
# OWID treats the latest income_groups version as official. This artifact is
# often not built locally during a non-income-groups dataset refresh — degrade
# gracefully (warn and skip) rather than aborting the audit.
ig_dirs = sorted(Path("data/garden/wb").glob("*/income_groups"))
if ig_dirs:
    ds_ig = Dataset(str(ig_dirs[-1]))
    canonical_income = set(ds_ig["income_groups_latest"]["classification"].dropna().astype(str).unique())
else:
    print(
        "[WARN] No data/garden/wb/<version>/income_groups built locally — "
        "skipping income-group enrichment. The four WB income-group aggregates "
        "(High/Upper-middle/Lower-middle/Low-income countries) may surface as "
        "'not in canonical' until you build that dataset."
    )
    canonical_income = set()

canonical = canonical_regions | canonical_income

mapping = json.loads(Path("etl/steps/data/garden/<namespace>/<new_version>/<short_name>.countries.json").read_text())
not_in_canonical = sorted({v for v in mapping.values() if v and v not in canonical})
print("Targets not in OWID's canonical regions or income groups:", not_in_canonical)

A non-empty not_in_canonical list means the mapping points at entities that aren't registered in either the regions catalog or the income-groups dataset. This isn't automatically a bug — it's a heads-up. Stop and decide with the user before proceeding — same pattern as the global "Checkpoints — when to pause" section at the top of this skill. Common causes (in order from "fix" to "accept"): typo, retired alias used as canonical, casing/whitespace mismatch, or a legitimately custom aggregate the source defines that OWID has no equivalent for (e.g. ILO's " (ILO)"-suffixed regions, World Bank's " (WB)"-suffixed sub-Saharan splits, BRICS, G7, G20). For typos/casing — fix the JSON. For legitimately custom aggregates — accept and note in the PR description that those entities live outside the canonical system and won't merge with population/regions infrastructure. For a real new historical region — add an entry to regions.yml in a separate PR.

Audit .excluded_countries.json. The file is optional; skip if it doesn't exist:

excluded_path = Path("etl/steps/data/garden/<namespace>/<new_version>/<short_name>.excluded_countries.json")
if excluded_path.exists():
    excluded = json.loads(excluded_path.read_text())
    suspicious_canonical = sorted(set(excluded) & canonical)
    # Also surface continents and aggregates separately for review
    aggregates = set(tb_regions[tb_regions["region_type"].isin(["continent", "aggregate"])]["name"].dropna().astype(str))
    suspicious_aggregates = sorted(set(excluded) & aggregates)
    print("Excluded entries that ARE canonical regions:", suspicious_canonical)
    print("Excluded entries that are continents/aggregates:", suspicious_aggregates)
    print("Full excluded list for review:", sorted(excluded))

suspicious_canonical is the actionable signal: each entry is a known country/region that we are dropping. Sometimes this is intentional (e.g. dropping "World" rows because the source double-counts them) — surface, don't auto-fix. Pause and ask the user if the list is non-empty. The full list is dumped so the LLM can also eyeball it for entities that aren't in canonical but look like real countries (typos, alternative names) we should be mapping rather than dropping.

Audit garden output entities. Always run this check, regardless of whether .countries.json exists or is populated — JSON mappings describe inputs to the harmonizer, but the entities that actually reach Grapher are whatever sits in the country column/index of the built garden tables. Inline country assignments (e.g. hardcoded tb["country"] = "England and Wales") and post-harmonization mutations both bypass the JSON check entirely; this is the only step that catches them.
```
from pathlib import Path

from owid.catalog import Dataset

garden_dir = Path("data/garden/<namespace>/<new_version>/<short_name>")
ds_garden = Dataset(str(garden_dir))

entities: set[str] = set()
for tname in ds_garden.table_names:
    tb = ds_garden[tname]
    # `country` can live in the index (after .format()) or as a regular column.
    if "country" in tb.index.names:
        entities.update(tb.index.get_level_values("country").dropna().astype(str).unique())
    elif "country" in tb.columns:
        entities.update(tb["country"].dropna().astype(str).unique())
    # tables with no country column are silently skipped (e.g. reference tables)

output_not_in_canonical = sorted(entities - canonical)
print("Garden output entities not in OWID's canonical regions or income groups:",
      output_not_in_canonical)
```
Same triage rules as the JSON-targets check (Python check #3): typo / casing / alias / legitimately custom aggregate. A non-empty list means at least one entity that ships to Grapher isn't registered in either the regions catalog or the income-groups dataset. Stop and decide with the user before proceeding. Common fixes: typo or casing → patch the inline assignment (or .countries.json, whichever is the source) so the value matches the canonical name; alias → switch to the canonical name; legitimate custom aggregate → accept and note in the PR description that the entity lives outside the canonical system.
Write findings to workbench/<short_name>/harmonization_audit.md with six sections, populated only when non-empty. Each section must list every flagged entity, not just a count — counts alone aren't actionable, the user (or you) needs to read the actual names to judge whether each is intentional. For long lists (>20 entries) group by pattern when the grouping is obvious (e.g. ILO's " (ILO)"-suffixed regions vs. international orgs vs. derived "World ..." aggregates) so the reviewer can scan categories instead of one flat list. Sections:
- ## Missing in mapping — countries in source data not in .countries.json (from log warning #1) — list each missing source name
- ## Unused mappings — .countries.json entries the data never used (warning #2) — list each unused source→target pair
- ## Unknown excluded entries — .excluded_countries.json entries not present in source data (warning #3) — list each
- ## Targets not in OWID's canonical regions or income groups — target names from .countries.json that aren't registered in either dataset (Python check #3) — list each target name and the source names that map to it
- ## Excluded entries matching canonical regions — possible over-exclusion (Python check #4) — list each
- ## Garden output entities not in OWID's canonical regions or income groups — distinct country values found in the built garden tables that aren't in canonical regions or income groups (Python check #5) — list each entity
Surface in PR. If any section was populated, add a collapsed "Harmonization audit" section to the PR description (after the per-step sections, before the Slack announcement) with the same listings, not just a summary. Empty sections can be omitted.

When you report progress to the user during the workflow, never just give a count — always include the list (or grouped categories) so they can judge in one glance.

Checkpoint summary:

"Targets not in OWID's canonical regions or income groups" or "Garden output entities not in OWID's canonical regions or income groups" or "Missing in mapping" non-empty ⇒ stop, decide with user.
"Excluded entries matching canonical regions" non-empty ⇒ stop, ask whether each exclusion is intentional.
"Unused mappings" or "Unknown excluded entries" non-empty ⇒ surface in PR description; not a blocker.

Grapher step run/verify (step-fixer subagent, channel=grapher, add --grapher)
- Skip diff

6a) Re-evaluate # NOTE: / # TODO: items from step 1c against fresh data Now that meadow, garden, and grapher have run on the new data, go back to workbench/<short_name>/notes_to_check.md and decide each item's fate. For each entry:

Identify what the workaround does (read the surrounding code).
Load the affected step's output with owid.catalog.Dataset (or inspect the raw snapshot) and compare corrected vs. uncorrected values. Cross-check the producer's release notes / changelog if available.
If the upstream issue is fixed → delete the workaround and its # NOTE: / # TODO: comments in the same commit, then re-run the affected step (use --force --only, add --grapher for grapher) so downstream artifacts pick up the change.
If the workaround is still needed → leave it and add a one-line status under "Phase 2 TODOs" in the PR description (e.g. "Sierra Leone ×1000 correction still required — raw value in the 2026 file is still ~1/1000 of plausible").
If you're uncertain → keep it, flag it in the PR description, and ask the user.

Do this before step 6b (metadata checks) so any re-runs triggered by comment-removal happen before the metadata sweep, not after.

6b) Metadata quality checks — run after all ETL steps are built Run all four checks on the newly built garden and grapher datasets so every issue surfaces together. Each skill writes results to the terminal; fix what comes up before moving on.

Typos — /check-metadata-typos scoped to the current step. Run on each of the new .meta.yml files (garden first, then grapher). Accept or skip each suggested fix.
Jinja spacing — /check-metadata-spacing on the built garden and grapher datasets. Catches template artifacts like doubled spaces or stray newlines that only appear after Jinja rendering.
Style guide — /check-metadata-style on the grapher step. Audits user-facing fields (title, subtitle, description_short, display.name, presentation.*) against OWID's Writing and Style Guide. Rules live in .claude/skills/check-metadata-style/STYLE_GUIDE.md, so no Notion access is needed — but if the guide looks out of date, refresh that file from Notion in a separate PR.
Clarity for a general audience — read every user-facing field with non-specialist eyes. The other three skills enforce structure and style; this one judges whether the text is understandable.

Clarity checklist (do manually, no skill yet)

OWID readers are not domain experts. Walk each indicator's user-facing fields and flag anything that requires inside knowledge to parse.

Field	Clarity check
`title` / `presentation.title_public`	A non-specialist should know what the indicator measures from the title alone. Expand acronyms unless universally known (skip GDP; expand GWIS, MFI, SDG, IHME). Don't cram units into the title.
`description_short`	One or two short sentences: what the metric is and what it covers. No jargon without a gloss. Active voice. The chart subtitle is short by design — no run-on or stacked clauses.
`description_key`	Each bullet should land a distinct, useful fact. Skip filler ("this dataset is widely used"); prefer substantive caveats (coverage gaps, methodology limits, what counts/doesn't count).
`display.name`	Short legend label. Reads naturally on a chart axis/legend; doesn't restate the title.
`presentation.grapher_config.note`	Concise footnote, ≤1 sentence ideally.

Flag and rewrite when you find:

Acronyms or technical terms that aren't expanded the first time they appear
Sentences that only make sense if you already know the data source
Quantitative claims with no unit context (e.g. "burned area" without "in hectares" surfacing somewhere in the user-facing text)
Inconsistent terminology between indicators in the same dataset (e.g. "wildfires" in one, "vegetation fires" in another)
Domain phrases that have a plain-English equivalent (e.g. "anthropogenic emissions" → "human-caused emissions")

When a phrasing is ambiguous, propose a concrete rewrite — don't just flag it.

If any skill rewrites a .meta.yml, re-run the affected step so the built catalog reflects the edits. Add --grapher when the affected step is on the grapher channel — without it the local catalog is updated but staging stays stale, so the step 7 indicator upgrade sees the old text.

# garden / meadow:
.venv/bin/etlr <channel>/<namespace>/<new_version>/<short_name> --private --force --only
# grapher:
.venv/bin/etlr grapher/<namespace>/<new_version>/<short_name> --grapher --private --force --only

Then re-run the relevant check to confirm zero remaining violations.

6c) Indicator metadata coverage, dataset block, and link verification The other quality checks catch content issues; this step catches missing fields and broken URLs before they reach review.

Mandatory fields per indicator. For every indicator in the garden .meta.yml, confirm these are set (either on definitions.common or per-indicator):

Field	Notes
`title`	Per-indicator
`unit`	Common is fine
`short_unit`	Common is fine
`description_short`	Per-indicator
`description_key`	At least one bullet; usually common
`processing_level`	`minor` or `major`
`presentation.topic_tags`	At least one tag
`display.numDecimalPlaces`	Common is fine
`display.tolerance`	Common is fine — chart tolerance for missing years
`display.name`	Per-indicator — required for legend labels
`presentation.attribution_short`	Set explicitly — does NOT inherit from the origin's `attribution_short` (verified: MySQL `variables.attributionShort` stays `NULL` if it's only on the origin). Place under `definitions.common.presentation` for the common case.

Conditional: if processing_level: major, every indicator with that level MUST also have description_processing.

Not mandatory (skip if you don't need them): presentation.title_public, presentation.title_variant, presentation.attribution.

Dataset block. Garden .meta.yml MUST include update_period_days:

dataset:
  update_period_days: <N>

This controls the auto-update cadence. Even when the rest of the dataset: block is empty, never strip update_period_days — leave the block in place with just that field.

Link verification. Run a HEAD request on every URL in the new .dvc and .meta.yml files (all channels — meadow .meta.yml files matter when they exist). Anything non-2xx is a signal, not a guaranteed break — always double-check before acting:

for url in $(rg --no-filename -No "https?://[^\"' ]+" snapshots/<namespace>/<new_version>/ etl/steps/data/{meadow,garden,grapher}/<namespace>/<new_version>/ \
    | sed -E 's/[).,;:>]+$//' \
    | sort -u); do
    printf "%s  %s\n" "$(curl -sI -L -o /dev/null -w '%{http_code}' --max-time 15 -A 'Mozilla/5.0' "$url")" "$url"
done

The --no-filename flag prevents rg from prepending path: to each match (otherwise the for-loop tries to curl path:url and every check returns 000). -A 'Mozilla/5.0' sometimes coaxes a real response out of Cloudflare-fronted hosts, but it doesn't always work — see the next note.

curl non-2xx ≠ broken. Cloudflare-fronted sites (notably ourworldindata.org) can return 404 to curl on URLs that work fine in a browser, depending on edge-node routing, IP geolocation, and cached state. Before treating a 4xx as a real failure:

Re-check with WebFetch (the built-in tool). It uses a different code path and a Mozilla/5.0 UA that Cloudflare usually accepts. A 200 with a coherent page body is authoritative — trust it over curl.
If WebFetch also fails, sanity-check the Wayback Machine: https://web.archive.org/web/<year>/<url>. A recent successful snapshot means the URL is reachable on the public internet and your local route is the problem.
Only act on a true failure — both WebFetch and Wayback unable to reach the URL — and even then flag and ask the user before silently rewriting an external link in metadata. Replacing a working link with a "safer" alternative because of a curl false-positive is worse than leaving the original. Apply the same restraint here as the global "Checkpoints — when to pause" section.

Fix any genuinely-non-2xx hit on url_main, url_download, license.url, or URLs referenced from description / description_key before continuing. The sed strips trailing markdown/punctuation chars (), ., ,, ;, :, >) so URLs inside [text](url) aren't reported as broken because of a stray closing paren.

Verification. After editing, re-run the affected step (with --grapher if grapher) so the catalog reflects the changes. Then confirm presentation.attribution_short actually landed:

from owid.catalog import Dataset
ds = Dataset("data/grapher/<ns>/<v>/<short_name>")
tb = ds["<table>"]
print(tb["<col>"].metadata.presentation.attribution_short)  # must NOT be None

Or after the staging upload:

make query SQL="SELECT shortName, attributionShort FROM variables WHERE catalogPath LIKE '%<ns>/<v>/<short_name>%'"

Indicator upgrade (optional, staging only)

First upload the new grapher dataset to the staging DB (required before the upgrader can detect it):

STAGING=<branch> .venv/bin/etlr data://grapher/<namespace>/<new_version>/<short_name> --grapher --private

Then run the automatic upgrader:

STAGING=<branch> .venv/bin/etl indicator-upgrade auto

CRITICAL: After the upgrader finishes, always verify it actually worked by querying staging:

mysql -h "staging-site-<branch>" -u owid --port 3306 -D owid -e "SELECT COUNT(*) FROM chart_dimensions cd JOIN variables v ON cd.variableId = v.id WHERE v.catalogPath LIKE '%<namespace>/<new_version>%'"

If the count is 0, the upgrade did not run — re-run it.

Update context for public announcement

Maintain workbench/<short_name>/update-context.yml as the canonical record of facts discovered during the update. Do not wait until the end if a fact is already known; append/update as each step completes.

At minimum, record:

dataset:
  namespace: <namespace>
  old_version: <old_version>
  new_version: <new_version>
  short_name: <short_name>
  title: <public dataset title, if known>
  producer: <producer, if known>
source:
  release_date: <snapshot origin date_published, if known>
  next_release: <best-effort, or null>
  url_main: <source page, if known>
  citation_full: <citation, if known>
coverage:
  year_min: <garden min year>
  year_max: <garden max year>
  countries: <distinct countries/entities>
  includes_regions: <true/false>
  sparse_recent_year_note: <note, or null>
charts:
  published_count: <published chart count>
  size_qualifier: <handful|moderate|large|massive>
  selected_views:
    - title: <chart title>
      slug: <chart slug>
      rationale: <why this represents the dataset>
update_summary:
  snapshot_diff: <short summary or artifact path>
  meadow_diff: <short summary or artifact path>
  garden_diff: <short summary or artifact path>
  notable_changes: []
  sanity_check_findings: []
  resolved_workarounds: []
editorial_context:
  why_it_matters_snippets: []
  caveat_snippets: []
  interesting_update_snippets: []

Query the staging DB for published charts using the new dataset (filter on c.publishedAt IS NOT NULL). Draft/unlisted charts must not be counted in the announcement:

SELECT c.id, cc.slug, cc.full->>'$.title' as title, cc.full->>'$.type' as type, cc.full->>'$.hasMapTab' as hasMapTab
FROM charts c
JOIN chart_configs cc ON cc.id = c.configId
JOIN chart_dimensions cd ON cd.chartId = c.id
JOIN variables v ON cd.variableId = v.id
WHERE v.catalogPath LIKE '%<namespace>/<new_version>%'
  AND c.publishedAt IS NOT NULL
GROUP BY c.id

Map the published count to size_qualifier: 1–9 = handful, 10–49 = moderate, 50–199 = large, 200+ = massive.
Pick 1–3 selected_views using these criteria (in order of preference):
- Map views — immediately visual, readers can find their own country
- Charts with punchy, standalone headlines — titles that make a clear claim work best for social sharing
- Global trend charts (StackedArea / World) — show the big picture over time
- Skip: population-weighted variants (harder to read quickly), within-regime breakdowns (too niche), country-specific views
Add snippets for the editorial prompts from source metadata, garden/grapher metadata, resolved sanity-check/workaround notes, and non-routine PR changes. Keep these as snippets/facts, not polished Slack prose.

Slack announcement & PR update
- Run the data-updates-comms skill with workbench/<short_name>/update-context.yml as input. data-updates-comms is the canonical owner of the Slack form wording, copy-paste format, editorial framing, search URL, and any standalone fallback gathering. Do not duplicate that rendering logic here.
- Save the rendered draft to workbench/<short_name>/slack-announcement.md.
- If data-updates-comms reports missing mechanical fields, gather them, update update-context.yml, and re-render rather than inventing values. Ask the user if a missing field requires judgment.
- Add the announcement to the PR description as a collapsed <details> section titled "Slack Announcement", with the file content embedded inside a triple-backtick markdown fence.
- Post @codex review as a separate PR comment (not in the PR description) to trigger an automated code review. Use:
```
gh pr comment <pr_number> --body "@codex review"
```
- Tell the user, with a markdown link to the saved file so they can click through to open it: "Slack announcement drafted at [workbench/<short_name>/slack-announcement.md](workbench/<short_name>/slack-announcement.md) and added to the PR description. Please review and post it to #data-updates-comms." Always render the path as a markdown link […](…), not as inline-code — the chat UI renders it as clickable that way.

9b) Data update post (for OWID /latest) Draft the short reader-facing post that gets published on https://ourworldindata.org/latest. The team drafts these in Google Docs in the shared /Data updates Drive folder (https://drive.google.com/drive/folders/1oL0uLHKI6f2qi1rJA6-qFFRYEBw_-rfm), and OWID's CMS ingests the doc into the published feed.

The skill's job is to produce paste-ready Google Doc content in the exact CMS format the team uses (frontmatter title / excerpt / type / authors / kicker → \[+body\] marker → body prose with inline markdown links → {.cta} block → {.image} block → \[\] end marker). Don't invent your own format — every published post in the Drive folder follows the same shape.

This is separate from the Slack announcement — that one is a 10-field form for the internal channel; this one is a mini-blog-post for OWID readers, and the format is structured for CMS ingestion.

Steps:

Open .claude/skills/update-dataset/data-update-template.md and follow it — the template has the exact paste-ready format plus three worked examples (NVIDIA, H5N1, World Bank PIP) lifted verbatim from the Drive folder.
Use the facts already gathered in workbench/<short_name>/update-context.yml (step 8) — dataset.title, dataset.producer, source.url_main, source.citation_full, coverage.*, charts.published_count, charts.selected_views, and the editorial_context.* snippet lists. Also pull from workbench/<short_name>/slack-announcement.md (step 9 output) — the editorial framing already drafted there is the closest cousin. If a field needed for the post isn't yet in update-context.yml, gather it (snapshot DVC, garden .meta.yml, or url_main via WebFetch) and persist it back to the YAML so the next consumer doesn't re-do the work.
Title shape — a punchy finding/claim, a question, or an action/invitation. Not just the dataset name. See the template's "Field-by-field guidance" for examples and decision logic.
Body — 100–200 words, first-person, conversational. Sample: ATUS ~105, NVIDIA ~140, robots ~110, OECD Government at a Glance ~155, US data centers ~145, UNU-WIDER ~155, World Bank PIP ~190, ozone ~165, mobile money ~180, fertilizers ~170, H5N1 ~135. The body should give a reader a reason to care and at least one concrete number — not "I refreshed our charts".
Inline markdown links throughout the body for the producer's page, methodology pages, and related OWID articles. *italics* for emphasis, sparingly.
CTA URL choice:
- One chart focus ⇒ grapher URL https://ourworldindata.org/grapher/<slug>.
- Multiple charts (default) ⇒ search URL https://ourworldindata.org/search?datasetProducts=<URL-encoded dataset title> — value is the dataset title, resolved with this priority: (a) the dataset.title field in the garden .meta.yml if it's set there (an override), otherwise (b) the meta.origin.title field in the snapshot .dvc. Often includes a parenthetical acronym like Luxembourg Income Study (LIS) or World Bank Poverty and Inequality Platform (PIP). Not the bare producer field.
- Topic has an existing OWID explorer ⇒ https://ourworldindata.org/explorers/<name>.
- Curated topic page exists ⇒ topic URL (e.g. /sdgs).
- Do not use /collection/custom?charts=… URLs.
CTA text — descriptive: "Explore the updated data in our interactive charts" (default), "Explore all of the updated data in our interactive charts" (broad), "Explore the interactive version of this chart" (single chart), "Explore this data going back to YYYY in our interactive chart" (single chart with date depth).
Image filename — YYYY-MM-data-update-<slug>.png (e.g. 2026-04-data-update-h5n1-flu.png). The skill doesn't generate the image; the user adds it to the Doc separately.
Save the draft to workbench/<short_name>/data-update.md.
Add a collapsed <details> section titled "Data update post (for OWID /latest)" to the PR description, placed after the Slack-announcement section, with the file content embedded inside a triple-backtick markdown fence.
Tell the user, with a markdown link to the saved file so they can click through to open it: "Data update post drafted at [workbench/<short_name>/data-update.md](workbench/<short_name>/data-update.md) in the Google Docs CMS format. Please create a new Google Doc in /Data updates, paste the draft, attach the chart screenshot, and share for review." Always render workbench/<short_name>/data-update.md as a markdown link […](…) rather than as a bare path or inline-code path — the chat UI renders it as clickable that way.

Codex review: address comments and resolve threads

Wait ~60 seconds after posting @codex review, then poll for inline review comments:
```
gh api repos/owid/etl/pulls/<pr_number>/comments | python3 -m json.tool
```

Fetch open review thread IDs via GraphQL:

gh api graphql -f query='{ repository(owner:"owid", name:"etl") { pullRequest(number:<pr_number>) { reviewThreads(first:20) { nodes { id isResolved comments(first:1) { nodes { body } } } } } } }'

For each unresolved Codex comment:

If valid: apply the fix, commit, push, then resolve the thread:

gh api graphql -f query='mutation { resolveReviewThread(input:{threadId:"<thread_id>"}) { thread { id isResolved } } }'

If not valid / not applicable: reply explaining why, then resolve the thread:

gh api repos/owid/etl/pulls/<pr_number>/comments/<comment_id>/replies -f body="<explanation>"
gh api graphql -f query='mutation { resolveReviewThread(input:{threadId:"<thread_id>"}) { thread { id isResolved } } }'

If Codex hasn't posted yet after 60 s, wait another 60 s and retry (up to ~5 min total).

Committing and pushing

Commit and push incrementally as you go — after each step that produces code changes. Don't wait until the end. Use descriptive commit messages with appropriate emojis (the one auto-prepended by etl pr for the chosen category + 🤖 for AI-written code).

At the end of the workflow, update the PR description with:

A tracking-issue link as the first line of the Summary — e.g. Tracks: [owid/owid-issues#NNNN](https://github.com/owid/owid-issues/issues/NNNN). Most data updates have a corresponding owid-issues ticket; try to find it by searching the title or <short_name> first, and ask the user for the issue number if you can't locate one rather than skipping the link silently.
A summary of key changes at the top
Collapsed sections for each pipeline step (Snapshot, Meadow, Garden, Grapher)
A collapsed section for the Slack announcement

Downstream dependency check

After completing the update, check if any other datasets depend on the old version of the updated dataset:

rg "<namespace>/<old_version>/<short_name>" dag/ -g "*.yml" | grep -v "^dag/archive"

Filter out the old dataset's own DAG entries (snapshot → meadow → garden → grapher chain). Any remaining references are downstream dependents that still point to the old version.

If downstream dependents exist:

Tell the user which datasets depend on the old version and need updating in a follow-up PR
Add a "Downstream dependencies" section to the PR description (not collapsed — this is important) listing the dependent datasets with a note that they should be updated to point to the new version in a follow-up PR

DAG archiving & reordering

After the ETL update, etl update appends the new version entries to the bottom of the main DAG file while the old version's entries stay in their original slot. Always ask the user whether to archive — but never skip this checklist item, and when the user agrees, always do the reorder too (not just the archive).

Workflow when the user agrees:

Archive the old version. Move its entries (snapshot → meadow → garden → grapher) from the main DAG file (e.g., dag/poverty_inequality.yml) to the bottom of the corresponding archive file (dag/archive/<same_file>.yml). Include the original section comment (e.g., # 1000 Binned Global Distribution (World Bank PIP)) above the archived entries.
Move the new entries into the old slot so the dataset stays grouped with its neighbours and section comment. The new entries should not remain at the bottom of the main DAG.
Preserve the original section comment (same indentation as the old block) above the new entries.
Verify: rg "<namespace>/<old_version>/<short_name>" dag/ -g "*.yml" | grep -v "^dag/archive" returns nothing, and rg "<namespace>/<new_version>/<short_name>" dag/ -g "*.yml" shows the entries only in the main file (under the section comment), not at the bottom.
Run make check and commit with 🔨🤖 Archive old <name> entries and reorder DAG.

Final QA hand-off — Anomalist + Chart Diff in Wizard

This is the last step, after the DAG archive has been committed. Don't auto-run these — they're human-judgment tools. Hand off the two staging links so the user can review and click through:

Anomalist — flags variables whose new values diverge from the old version beyond statistical thresholds. Catches accidental scale changes, base-year rebases that propagated the wrong way, and silent drops.
```
http://staging-site-<container_branch>/etl/wizard/anomalist
```
Chart Diff — shows side-by-side before/after thumbnails for every chart that uses an upgraded indicator. Catches visual regressions the schema-level checks miss (axis ranges, color steps, legend changes).
```
http://staging-site-<container_branch>/etl/wizard/chart-diff
```

Important: derive <container_branch> correctly. The staging hostname is not simply staging-site-<branch>. The container name is produced by get_container_name(branch) in etl/config.py:

Replace /, ., _ with - in the branch name.
Strip a leading staging-site- if present.
Truncate to the first 28 characters (Cloudflare DNS limit).
Strip any trailing -.

Branches over 28 chars therefore get clipped. Example: data-military-expenditure-2026 (30 chars) → container data-military-expenditure-20 → hostname staging-site-data-military-expenditure-20. The simplest way to get the correct value is to call the helper:

.venv/bin/python -c "from etl.config import get_container_name; print(get_container_name('<branch>'))"

Tell the user something like: "Final QA: please review Anomalist and Chart Diff in the Wizard. If anything looks off, let me know and I'll investigate."

These pages need a fresh staging build, so they're only meaningful after the PR's grapher upload to staging has completed and the staging server has rebuilt.

Guardrails and tips

DAG consistency: After etl update, always verify that all new steps in dag/main.yml reference each other with the new version. A common bug is garden depending on old meadow or old snapshot — this silently loads stale data.
Never return empty tables or comment out logic as a workaround — fix the parsing/transformations instead.
Column name changes: update garden processing code and metadata YAMLs (garden/grapher) to match schema changes.
Indexing: avoid leaking index columns from reset_index(); format tables with tb.format(["country", "year"]) as appropriate.
Metadata validation errors are guidance — update YAML to add/remove variables as indicated.
Mixed-type object columns at meadow: when pd.read_csv produces an object column that mixes strings and NaN (common for sparse text columns like sources/comments/punishments), the feather repacker rejects it. Cast those columns to pandas "string" dtype before tb.format(...).
paths.regions auto-resolves DAG dependencies: paths.regions.add_population(tb) and paths.regions.add_aggregates(tb, regions=[...]) pick up the population and income_groups datasets directly from the DAG. Don't paths.load_dataset("population") and pass it through unless the helper specifically asks for the dataset — the parameter is unused.
WB income-group aggregates: add the four classification names (High-income countries, Upper-middle-income countries, Lower-middle-income countries, Low-income countries) to your REGIONS list and add data://garden/wb/<latest>/income_groups to the DAG. paths.regions.add_aggregates(...) auto-resolves the classification.
Detect structural placeholders dynamically: when a source ships "balanced panel" rows that are zero everywhere by design (status combos that exist only for completeness), detect them at runtime (groupby(...).max() == 0) and assert the count matches the codebook. A coding change in the source then surfaces as a test failure instead of silently shipping noise.
Codebook-vs-data inconsistencies: when the codebook documents one thing but the actual CSV shows another (placeholder claimed but non-zero rows present, etc.), preserve the data as-shipped and flag it in the PR description for the producer to confirm. Don't silently force the data to match the codebook.
processing_level: major requires description_processing: keep processing_level: minor as the common default and override to major only on indicators that have a description_processing field. Don't blanket-set major on the common block and then leave country-level proportions without their own processing note.
Per-indicator description_processing reads better than a generic shared note: when an indicator is derived (combined-categorical buckets, regional aggregates, computed counts), spell out that indicator's derivation. Reusing named definitions for shared boilerplate is fine; just compose them into per-indicator sentences rather than dropping a single generic note across all indicators.
description_key in definitions.common propagates only to indicators without their own list: if you want a bullet to appear on every indicator, either keep it on common.description_key and don't define per-indicator lists (it inherits), or prepend it explicitly to each per-indicator list (treats it as a "first bullet" pattern).
Phantom-category audit on categorical indicators: after building categorical indicators, sweep every indicator and compare YAML sort: labels against the unique values that actually appear in the data. Phantom labels (declared in sort: or in a category map but never produced) clutter chart legends with empty buckets. Either drop them from sort: and description_key, or remove them from the map if they can never occur given the data shape. Re-run the audit on every data refresh — phantoms can reappear when a category is dropped upstream.
NOTE: comments for the next maintainer when behaviour is data-conditional: when something in the code holds only because of the current data shape (e.g. "only 4 indicators have an EoE=0 row", "only Brazil 2025 is a transition-year artefact"), leave a # NOTE: comment near the relevant block asking the next data update to re-audit. Helps future maintainers spot which assumptions might decay before they bite.
Indicator Upgrader CLI for one-shot chart remaps: when v1 → v2 short_names change so much that the auto-upgrader can't match them, drive the remap manually. Write a small script that calls WizardDB.add_variable_mapping(mapping={old_id: new_id, ...}, dataset_id_old=..., dataset_id_new=..., comments="...") with the explicit pairs, then run from apps.indicator_upgrade.upgrade import cli_upgrade_indicators; cli_upgrade_indicators(dry_run=True) to preview affected charts, and (dry_run=False) to apply. Mappings stay in the wizard DB until WizardDB.delete_variable_mapping() is called, so a slug-collision failure can be recovered by fixing the slug and rerunning the upgrade — only un-upgraded charts get reattempted. The active staging DB is inferred from the current git branch.
Drop-in vs restructure decision point: when the new dataset has a different shape (long vs wide, more policies, changed score semantics, dropped composite measures), etl update --rename is the wrong starting point — the structure of meadow/garden/grapher needs to follow the new shape, and the rename flow will only produce confusion. Spot this fork early at the snapshot/codebook stage, before running etl update. Scaffold the new chain via the create-etl-steps skill (wraps the wizard's cookiecutter templates) or launch the wizard UI with etlwiz and use its "ETL Steps" page — both produce a consistent meadow/garden/grapher skeleton to fill in. Once scaffolded, read the v1 scripts as a reference for the source-specific logic that's still relevant (column-rename maps, status/category normalisations, country harmonisation map, sanity checks, codebook-driven structural assertions) — don't copy the v1 structure blindly, but port the bits that still apply to the new schema.

When the update is review-heavy and you need iterative back-and-forth with a topic owner over staging, see the report-indicator-changes skill for drafting the message.

Artifacts (expected)

workbench/<short_name>/snapshot-runner.md
workbench/<short_name>/progress.md
workbench/<short_name>/notes_to_check.md (one entry per carried-over # NOTE: / # TODO:, plus detected sanity_checks functions and their log-control flags)
workbench/<short_name>/sanity_checks.log (only if step 5b ran)
workbench/<short_name>/meadow_diff_raw.txt and meadow_diff.md
workbench/<short_name>/garden_diff_raw.txt and garden_diff.md
workbench/<short_name>/harmonization.log and harmonization_audit.md (from step 5c)
workbench/<short_name>/indicator_upgrade.json (if indicator-upgrader was used)
workbench/<short_name>/update-context.yml (canonical facts gathered during the update; consumed by data-updates-comms)
workbench/<short_name>/slack-announcement.md
workbench/<short_name>/data-update.md (public-facing post draft for OWID /latest, from step 9b)

Example usage

Minimal catalog URI with explicit old version:
- update-dataset data://snapshot/irena/2024-11-15/renewable_power_generation_costs 2023-11-15 update-irena-costs

Common issues when data structure changes

SILENT FAILURES WARNING: Never return empty tables or comment code as workarounds!
Column name changes: If columns are renamed/split (e.g., single cost → local currency + PPP), update:
- Python code references in the garden step
- Garden metadata YAML (e.g., food_prices_for_nutrition.meta.yml)
- Grapher metadata YAML (if it exists)
Index issues: Check for unwanted index columns from reset_index() — ensure proper indexing with tb.format(["country", "year"]).
Metadata validation: Use error messages as a guide — they show exactly which variables to add/remove from YAML files.

update-dataset

Mehr aus diesem Repository

Mehr aus diesem Repository

Update Dataset (PR → snapshot → steps → grapher)

Inputs

Progress checklist (maintain, tick live, and persist to progress.md)

Checkpoints — when to pause

When the update isn't a drop-in version bump

Workflow orchestration

Clarity checklist (do manually, no skill yet)

Committing and pushing

Downstream dependency check

DAG archiving & reordering

Final QA hand-off — Anomalist + Chart Diff in Wizard

Guardrails and tips

Artifacts (expected)

Example usage

Common issues when data structure changes

Update Dataset (PR → snapshot → steps → grapher)

Inputs

Progress checklist (maintain, tick live, and persist to progress.md)

Checkpoints — when to pause

When the update isn't a drop-in version bump

Workflow orchestration

Clarity checklist (do manually, no skill yet)

Committing and pushing

Downstream dependency check

DAG archiving & reordering

Final QA hand-off — Anomalist + Chart Diff in Wizard

Guardrails and tips

Artifacts (expected)

Example usage

Common issues when data structure changes