Run any Skill in Manus with one click

etl-profiling

Profile and optimize ETL step performance — CPU time, memory usage, and I/O bottlenecks. Use when an ETL step is slow, uses too much memory, or when the user asks to profile, optimize, or speed up a step. Covers profiling commands, categorical dtype optimization, vectorization, SUBSET filtering for fast dev runs, and iterative diagnose→fix→reprofile workflow.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/owid/etl --skill etl-profiling

Copy and paste this command into Claude Code to install the skill

Source

owid/etl

Stars148

Forks31

UpdatedApril 16, 2026 at 11:19

SKILL.md

readonly

More from this repository

same repository

add-ivs-indicators

owid/etl

Add new indicator codes (WVS/EVS question codes like C001, D059, H002_01, Y022) to the existing Integrated Values Surveys (IVS) pipeline WITHOUT bumping the version. Use when the user wants to add IVS/WVS/EVS questions to integrated_values_surveys, extend the IVS dataset with new survey items, or says "add these codes to IVS".

2026-06-03148

review-data-pr

owid/etl

Review an OWID ETL data update PR end-to-end — runs the pipeline, compares snapshot fields against the previous version, verifies links, audits indicator metadata coverage, and cross-checks workflow items from /update-dataset. Trigger when the user asks to "review this PR", "review the data PR", or invokes this on an open dataset-update branch.

2026-06-03148

update-dataset

owid/etl

End-to-end dataset update workflow with PR creation, snapshot, meadow, garden, and grapher steps. Use when user wants to update a dataset, refresh data, run ETL update, or mentions updating dataset versions.

2026-06-03148

review-explorer-mdim-mapping

owid/etl

Generate a self-contained HTML to review an explorer → MDIM view mapping side-by-side, with approve/flag controls and browser-persistent decisions. Consumes the output of the `map-explorer-to-mdim` skill (mapping_proposal.csv + mapping_rules.py + multidim_*_views.csv) and renders a single shareable HTML in `ai/`. Trigger after `map-explorer-to-mdim` when a human needs to verify the proposed explorer→MDIM correspondence chart-by-chart, or when the user asks to "review the explorer→MDIM mapping", "build a review tool for the <slug> migration", or "make a side-by-side HTML so <reviewer> can sign off".

2026-06-02148

vibe-app

owid/etl

Create a vibe app — a self-contained HTML page (data report, mini dashboard, prototype, …) — and publish it to the owid/vibe-webapps repo (served at vibe.owid.io). The scaffolder is geared toward data reports (text + embedded charts) but the same shape works for any internal webapp. Use when the user wants to turn an exploration, finding, analysis, or small interactive idea into a shareable internal page. (Renamed from `/data-report`.)

2026-05-29148

check-metadata-style

owid/etl

Check grapher chart metadata (titles, subtitles, descriptions, display names) against OWID's Writing and Style Guide. Use when the user mentions the style guide, writing guide, chart copy quality, title/subtitle review, or after editing .meta.yml files under etl/steps/data/grapher/.

2026-05-27148

Source

owid

owid/etl

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Data ScientistsComputer and Mathematical Occupations15-2051L4

name	etl-profiling
description	Profile and optimize ETL step performance — CPU time, memory usage, and I/O bottlenecks. Use when an ETL step is slow, uses too much memory, or when the user asks to profile, optimize, or speed up a step. Covers profiling commands, categorical dtype optimization, vectorization, SUBSET filtering for fast dev runs, and iterative diagnose→fix→reprofile workflow.
metadata	{"internal":true}

ETL Step Profiling & Optimization

Quick Start

# CPU profile — shows time per line in run()
.venv/bin/etl d profile --cpu garden/namespace/version/dataset

# Memory profile — shows memory per line in run()
.venv/bin/etl d profile --mem garden/namespace/version/dataset

# Profile a specific function
.venv/bin/etl d profile --cpu garden/namespace/version/dataset -f process_data

Workflow

Check feather schemas first — before profiling, inspect the on-disk types of large tables:

import pyarrow.feather as pf
for field in pf.read_table("data/meadow/.../table.feather").schema:
    print(f"{field.name}: {field.type}")
# large_string → should be dictionary (categorical)

Profile — measure, never guess
Identify the bottleneck — read the % column, focus on the top 3 lines
Diagnose — is it I/O, dtype waste, or algorithmic?
Fix one thing — apply the smallest targeted fix
Re-profile — verify improvement, repeat

Reading Profile Output

The profiler outputs a table with columns:

Line   Hits   Time          Per Hit     % Time    Line Contents
  46      1   1.46e+10     1.46e+10     41.7      tb = ds_meadow.read("population")

% Time — focus here. Sort by this mentally; anything >5% is worth investigating.
Hits — number of times the line executed. High hits on simple ops = loop to vectorize.
Per Hit — time per call. High per-hit on a single call = the operation itself is slow.

Caveat: line_profiler adds overhead. Functions that run fast but are called from many instrumented lines can appear inflated. If a line shows >20% but the step runs fast in practice, the profiler overhead is distorting results. Verify with wall-clock timing:

import time; t0 = time.time()
# ... suspect code ...
print(f"Took {time.time() - t0:.2f}s")

Common Bottlenecks & Fixes

1. String Columns That Should Be Categoricals

Symptom: ds.read() takes many seconds; memory in GB for a table with <1000 unique values per column.

Diagnose:

import pyarrow.feather as pf
arrow_table = pf.read_table("data/meadow/namespace/version/dataset/table.feather")
for field in arrow_table.schema:
    print(f"{field.name}: {field.type}")
# If you see `large_string` or `string` for country/variant/age/sex → problem

Check unique counts:

for col in ['country', 'variant', 'age', 'sex']:
    unique = arrow_table.column(col).unique()
    print(f"{col}: {len(unique)} unique values out of {arrow_table.num_rows:,} rows")

Fix upstream (meadow step) — convert to categorical before .format():

for col in ["country", "variant", "sex", "age"]:
    if col in tb.columns:
        tb[col] = tb[col].astype("category")
tb = tb.format(index_columns, short_name="table_name")

Fix downstream (garden step) — read with safe_types=False to preserve categoricals:

tb = ds_meadow.read("table_name", safe_types=False)

safe_types=True (the default) converts categoricals back to string[pyarrow], losing all the savings.

Impact: Typically 90-99% memory reduction and 5-30x faster reads for tables with >1M rows.

2. Slow Reads Despite Categoricals

Symptom: ds.read() is fast with safe_types=False but slow with default safe_types=True.

Fix: Use safe_types=False when you don't need the type safety guarantees. Be aware that categorical columns behave slightly differently (e.g., .replace() may warn about deprecated behavior — use .cat.rename_categories() instead).

3. Row-by-Row Operations

Symptom: .apply(lambda row: ..., axis=1) or Python loops over rows showing high time.

Fix with np.select:

# Bad — iterates row by row
tb["result"] = tb.apply(lambda row: row["a"] if row["a"] > 0 else row["b"], axis=1)

# Good — vectorized
import numpy as np
conditions = [tb["a"] > 0]
choices = [tb["a"]]
tb["result"] = np.select(conditions, choices, default=tb["b"])

Note on origins: np.where and np.select strip OWID metadata origins. To preserve them:

tb["result"] = tb["b"]  # default
tb.loc[tb["a"] > 0, "result"] = tb.loc[tb["a"] > 0, "a"]

4. Expensive Groupby on Large Tables

Symptom: .groupby().sum() or .groupby().agg() taking seconds on millions of rows.

Fixes:

Ensure groupby columns are categorical (much faster hashing)
Use observed=True to skip unused category combinations
Use as_index=False to avoid expensive multi-index creation
Never mix lambdas with string aggregations in .agg() — a single callable forces pandas off its fast C path, causing ~10× slowdown on ALL aggregations (including the string ones like "sum"). Split into two separate groupby calls instead.

# Good
tb.groupby(["country", "year", "sex"], as_index=False, observed=True)["value"].sum()

# Bad — lambda poisons the entire agg call
tb.groupby(cols).agg({"value": "sum", "country": lambda x: check(x)})

# Good — separate the fast and slow aggregations
result = tb.groupby(cols).agg({"value": "sum"})
checks = tb.groupby(cols)["country"].apply(lambda x: check(x))

Known issue: geo.add_region_aggregates() (deprecated) injects a per-group lambda to check countries_that_must_have_data. When that list is empty (common case), the lambda is a no-op but still causes the slowdown. This was fixed in 2026-03 to skip the lambda when no checks are needed. The newer paths.regions.add_aggregates() API doesn't have this issue.

5. Unnecessary Full-Table Reads

Symptom: Reading a large table but only using a few columns or a subset of rows.

Fix: Filter early. Add a SUBSET env var pattern for dev runs:

import os
SUBSET = os.environ.get("SUBSET")

def run():
    tb = ds_meadow.read("big_table", safe_types=False)
    if SUBSET:
        countries = [c.strip() for c in SUBSET.split(",")]
        tb = tb[tb["country"].isin(countries)]
    # ... rest of processing

Usage: SUBSET='France,Germany' .venv/bin/etlr namespace/version/dataset --private

6. Expensive `create_dataset` or `ds.add`

Symptom: paths.create_dataset(tables=..., check_variables_metadata=True) showing high time in profiler.

Diagnosis: Often this is profiler overhead, not real time. Verify with wall-clock:

t0 = time.time()
ds = paths.create_dataset(tables=tables, ...)
print(f"create_dataset: {time.time() - t0:.2f}s")

If it's genuinely slow, the cost is usually in update_metadata (YAML parsing) or ds.add (feather serialization for large tables). These are typically fixed costs and not worth optimizing unless the tables themselves are unnecessarily large.

Memory-Specific Profiling

.venv/bin/etl d profile --mem garden/namespace/version/dataset

Look for:

Spikes >100 MB on a single line — likely creating a large intermediate copy
Cumulative growth that never drops — objects not being freed

Quick memory check in code:

print(f"Memory: {tb.memory_usage(deep=True).sum() / 1e6:.0f} MB")
print(tb.dtypes)  # object dtype = memory hog

Iteration Tips

Always use SUBSET for profiling iterations. Never run full data until you've confirmed the fix works.
Use etl d profile for measuring, not etlr — the latter has overhead from change detection, dependency resolution, and dataset saving that drowns out the signal.
Small SUBSET for correctness (2-3 values), medium SUBSET for timing (10-15 values). Only go bigger if the bottleneck doesn't show up at small scale.
-f function_name to drill into specific functions. Only works for functions defined in the step's main module, not imported ones.

Checklist Before Optimizing

Profiled with actual data (not guessing)
Identified top 3 bottleneck lines by % time
Checked feather schema for string vs dictionary columns
Checked safe_types setting on large table reads
Verified with wall-clock timing (not just profiler)
Re-profiled after each fix to confirm improvement

etl-profiling

More from this repository

ETL Step Profiling & Optimization

Quick Start

Workflow

Reading Profile Output

Common Bottlenecks & Fixes

1. String Columns That Should Be Categoricals

2. Slow Reads Despite Categoricals

3. Row-by-Row Operations

4. Expensive Groupby on Large Tables

5. Unnecessary Full-Table Reads

6. Expensive create_dataset or ds.add

Memory-Specific Profiling

Iteration Tips

Checklist Before Optimizing

ETL Step Profiling & Optimization

Quick Start

Workflow

Reading Profile Output

Common Bottlenecks & Fixes

1. String Columns That Should Be Categoricals

2. Slow Reads Despite Categoricals

3. Row-by-Row Operations

4. Expensive Groupby on Large Tables

5. Unnecessary Full-Table Reads

6. Expensive create_dataset or ds.add

Memory-Specific Profiling

Iteration Tips

Checklist Before Optimizing

More from this repository

6. Expensive `create_dataset` or `ds.add`

6. Expensive `create_dataset` or `ds.add`