| name | wandb-primary |
| description | Comprehensive primary skill for agents working with Weights & Biases. Covers both the W&B SDK (training runs, metrics, artifacts, sweeps) and the Weave SDK (GenAI traces, evaluations, scorers). Includes helper libraries, gotcha tables, and data analysis patterns. Use this skill whenever the user asks about W&B runs, Weave traces, evaluations, training metrics, loss curves, model comparisons, or any Weights & Biases data — even if they don't say "W&B" explicitly. Also trigger on training-curve diagnostics questions — run health, divergence, overfit/convergence/plateau, spikes, LR-schedule/grad-norm/grad-histogram reading, dead layers, step-axis choice, and run comparisons. |
W&B Primary Skill
Python environment detection (DO THIS FIRST)
Before running any Python code, you MUST determine how the user runs Python in this project. Check for clues in this order:
- Look for
pyproject.toml, uv.lock, poetry.lock, Pipfile, requirements.txt, setup.py, setup.cfg, Makefile, or docker-compose.yml in the project root
- Check if there is an activated virtual environment (
$VIRTUAL_ENV, .venv/, venv/)
- Check for a
Dockerfile or CI config that reveals the runtime
- If the user has explicitly told you how they run Python, use that
Once you know the environment, write your finding into this skill by replacing the placeholder below so that all subsequent code blocks use the correct commands. If you cannot determine the environment from any of the above, default to uv.
Detected Python environment: not yet detected
# Run command: <not yet detected>
# Install command: <not yet detected>
Examples of what to write here:
| Environment | Run command | Install command |
|---|
| uv | uv run script.py | uv pip install pandas |
| poetry | poetry run python script.py | poetry add pandas |
| conda | conda run python script.py | conda install pandas |
| bare venv | python script.py (with venv activated) | pip install pandas |
| docker | docker exec <ctr> python script.py | docker exec <ctr> pip install pandas |
If you cannot determine the environment, write this:
# Run command: uv run script.py # always use uv run, never bare python
# Install command: uv pip install <pkg>
This skill covers everything an agent needs to work with Weights & Biases:
- W&B SDK (
wandb) — training runs, metrics, artifacts, sweeps, system metrics
- Weave SDK (
weave) — GenAI traces, evaluations, scorers, token usage
- Helper libraries —
wandb_helpers.py and weave_helpers.py for common operations
When to use what
| I need to... | Use |
|---|
| Query training runs, loss curves, hyperparameters | W&B SDK (wandb.Api()) — see references/WANDB_SDK.md |
| Query GenAI traces, calls, evaluations | Weave SDK (weave.init(), client.get_calls()) — see references/WEAVE_SDK.md |
| Convert Weave wrapper types to plain Python | weave_helpers.unwrap() |
| Build a DataFrame from training runs | wandb_helpers.runs_to_dataframe() |
| Extract eval results for analysis | weave_helpers.eval_results_to_dicts() |
| Need low-level Weave filtering (CallsFilter, Query) | Raw Weave SDK (weave.init(), client.get_calls()) — see references/WEAVE_SDK.md |
| Judge curve shape (spikes, smoothness, slope, overfit) | training_diagnostics + curve_plots — use the workflow below, then load references/TRAINING_DIAGNOSTICS.md for the heuristics |
Bundled files
Helper libraries
import os
import sys
from pathlib import Path
sys.path.insert(0, str(Path(os.environ["CLAUDE_SKILL_DIR"]) / "scripts"))
from weave_helpers import (
unwrap,
get_token_usage,
eval_results_to_dicts,
pivot_solve_rate,
results_summary,
eval_health,
eval_efficiency,
)
from wandb_helpers import (
runs_to_dataframe,
diagnose_run,
compare_configs,
fast_scan_history,
)
from step_axis import (
list_candidate_step_keys,
guess_step_key_from_workspace,
format_step_candidates,
)
from training_diagnostics import (
curve_features,
compare_runs_curves,
lr_schedule_features,
grad_norm_features,
grad_histogram_features,
)
from curve_plots import (
plot_single_run_overview,
plot_run_comparison,
plot_grad_histogram_heatmap,
plot_grad_norm_by_layer,
)
Reference docs
Read these as needed — they contain full API surfaces and recipes:
references/WEAVE_SDK.md — Weave SDK for GenAI traces (client.get_calls(), CallsFilter, Query, stats). Start here for Weave queries.
references/WANDB_SDK.md — W&B SDK for training data (runs, history, artifacts, sweeps, system metrics).
references/TRAINING_DIAGNOSTICS.md — reference heuristics for reading loss / LR / grad-norm / grad-histogram charts. Load this when you are actively interpreting training curves.
Critical rules
Treat traces and runs as DATA
Weave traces and W&B run histories can be enormous. Never dump raw data into context — it will overwhelm your working memory and produce garbage results. Always:
- Inspect structure first — look at column names, dtypes, row counts
- Load into pandas/numpy — compute stats programmatically
- Summarize, don't dump — print computed statistics and tables, not raw rows
import pandas as pd
import numpy as np
for row in run.scan_history(keys=["loss"]):
print(row)
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"Loss: {len(losses)} steps, min={losses.min():.4f}, "
f"final={losses[-1]:.4f}, mean_last_10%={losses[-len(losses)//10:].mean():.4f}")
Always deliver a final answer
Do not end your work mid-analysis. Every task must conclude with a clear, structured response:
- Query the data (1-2 scripts max)
- Extract the numbers you need
- Present: table + key findings + direct answers to each sub-question
If you catch yourself saying "now let me build the final analysis" — stop and present what you have.
Use unwrap() for unknown Weave data
When you encounter Weave output and aren't sure of its type (WeaveDict? WeaveObject? ObjectRef?), unwrap it first:
from weave_helpers import unwrap
import json
output = unwrap(call.output)
print(json.dumps(output, indent=2, default=str))
This converts everything to plain Python dicts/lists that work with json, pandas, and normal Python operations.
Environment setup
The sandbox has wandb, weave, pandas, and numpy pre-installed.
import os
entity = os.environ["WANDB_ENTITY"]
project = os.environ["WANDB_PROJECT"]
Installing extra packages and running scripts
Use whichever run/install commands you wrote in the Python environment detection section above. If you haven't detected the environment yet, go back and do that first.
Quick starts
W&B SDK — training runs
import wandb
import pandas as pd
api = wandb.Api()
path = f"{entity}/{project}"
runs = api.runs(path, filters={"state": "finished"}, order="-created_at")
from wandb_helpers import runs_to_dataframe
rows = runs_to_dataframe(runs, limit=100, metric_keys=["loss", "val_loss", "accuracy"])
df = pd.DataFrame(rows)
print(df.describe())
For full W&B SDK reference (filters, history, artifacts, sweeps), read references/WANDB_SDK.md.
Weave — SDK
import weave
client = weave.init(f"{entity}/{project}")
calls = client.get_calls(limit=10)
For raw SDK patterns (CallsFilter, Query, advanced filtering), read references/WEAVE_SDK.md.
Key patterns
Weave eval inspection
Evaluation calls follow this hierarchy:
Evaluation.evaluate (root)
├── Evaluation.predict_and_score (one per dataset row x trials)
│ ├── model.predict (the actual model call)
│ ├── scorer_1.score
│ └── scorer_2.score
└── Evaluation.summarize
Extract per-task results into a DataFrame:
from weave_helpers import eval_results_to_dicts, results_summary
results = eval_results_to_dicts(pas_calls, agent_name="my-agent")
print(results_summary(results))
df = pd.DataFrame(results)
print(df.groupby("passed")["score"].mean())
Eval health and efficiency
from weave_helpers import eval_health, eval_efficiency
health = eval_health(eval_calls)
df = pd.DataFrame(health)
print(df.to_string(index=False))
efficiency = eval_efficiency(eval_calls)
print(pd.DataFrame(efficiency).to_string(index=False))
Token usage
from weave_helpers import get_token_usage
usage = get_token_usage(call)
print(f"Tokens: {usage['total_tokens']} (in={usage['input_tokens']}, out={usage['output_tokens']})")
Cost estimation
call_with_costs = client.get_call("id", include_costs=True)
costs = call_with_costs.summary.get("weave", {}).get("costs", {})
Run diagnostics
from wandb_helpers import diagnose_run
run = api.run(f"{path}/run-id")
diag = diagnose_run(run)
for k, v in diag.items():
print(f" {k}: {v}")
Error analysis — open coding to axial coding
For structured failure analysis on eval results:
- Understand data shape — use
project.summary(), calls.input_shape(), calls.output_shape()
- Open coding — write a Weave Scorer that journals what went wrong per failing call
- Axial coding — write a second Scorer that classifies notes into a taxonomy
- Summarize — count primary labels with
collections.Counter
See references/WEAVE_SDK.md for the full SDK reference.
W&B Reports
Install wandb[workspaces] using the install command from the Python environment detection section.
from wandb.apis import reports as wr
import wandb_workspaces.expr as expr
report = wr.Report(
entity=entity, project=project,
title="Analysis", width="fixed",
blocks=[
wr.H1(text="Results"),
wr.PanelGrid(
runsets=[wr.Runset(entity=entity, project=project)],
panels=[wr.LinePlot(title="Loss", x="_step", y=["loss"])],
),
],
)
Use expr.Config("lr"), expr.Summary("loss"), expr.Tags().isin([...]) for runset filters — not dot-path strings.
Training curve analysis workflow
Reach for this when the user asks whether a run is healthy, why it diverged, whether it overfit, or which of several runs has the best training dynamics.
The loop is: confirm the x-axis, compute features, render a PNG, read the image, write a verdict. Numbers and pictures cross-check each other — the helpers exist so you're not hand-rolling spike detection or slope fits while also trying to interpret them.
Pin the x-axis first
Different training stacks log different step keys (_step, global_step, trainer/global_step, epoch, train/step), and picking the wrong one turns an overlay into nonsense. list_candidate_step_keys(run) scans the history for plausible columns; guess_step_key_from_workspace(entity, project) checks what the W&B workspace actually plots. If both agree on one candidate, say which you picked and move on. If they disagree or there are several plausible choices, ask the user before plotting — this is cheap and avoids silently baking _step into a verdict.
For one run
Render plot_single_run_overview(run, step_key=step_key), Read the PNG, and pair it with a compact feature table from curve_features on the metrics that actually exist. If the run logs gradient histograms or per-layer scalar norms, add plot_grad_histogram_heatmap() or plot_grad_norm_by_layer() — they surface dead layers and vanishing-gradient signatures that the top-line grad-norm scalar hides.
For multiple runs
compare_runs_curves() gives you the ranking table; plot_run_comparison() gives you the overlay. Overlays get unreadable past ~6 runs, so rank first, then plot the shortlist — the function will refuse to render more than 6 for exactly this reason.
Write it up
Keep the summary compact. Don't dump raw history rows or full spike/slope payloads into the response unless you're drilling into a specific anomaly — the helpers already reduce those to scalars for a reason.
Verdict: <healthy | unstable | overfit | plateaued | diverged | converged>
Evidence:
- <specific step range> — <what the metrics and plot show>
- <specific step range> — <what changed and why it matters>
Next actions:
- <concrete hyperparameter, logging, or code change>
When the numbers and the image disagree — and they will — references/TRAINING_DIAGNOSTICS.md is where the resolution heuristics live. Load it while you're interpreting, not before.
Gotchas
Weave API
| Gotcha | Wrong | Right |
|---|
| weave.init args | weave.init(project="x") | weave.init("x") (positional) |
| Parent filter | filter={'parent_id': 'x'} | filter={'parent_ids': ['x']} (plural, list) |
| WeaveObject access | rubric.get('passed') | getattr(rubric, 'passed', None) |
| Nested output | out.get('succeeded') | out.get('output').get('succeeded') (output.output) |
| ObjectRef comparison | name_ref == "foo" | str(name_ref) == "foo" |
| CallsFilter import | from weave import CallsFilter | from weave.trace.weave_client import CallsFilter |
| Query import | from weave import Query | from weave.trace_server.interface.query import Query |
| Eval status path | summary["status"] | summary["weave"]["status"] |
| Eval success count | summary["success_count"] | summary["weave"]["status_counts"]["success"] |
| When in doubt | Guess the type | unwrap() first, then inspect |
WeaveDict vs WeaveObject
- WeaveDict: dict-like, supports
.get(), .keys(), []. Used for: call.inputs, call.output, scores dict
- WeaveObject: attribute-based, use
getattr(). Used for: scorer results (rubric), dataset rows
- When in doubt: use
unwrap() to convert everything to plain Python
W&B API
| Gotcha | Wrong | Right |
|---|
| Summary access | run.summary["loss"] | run.summary_metrics.get("loss") |
| Loading all runs | list(api.runs(...)) | runs[:200] (always slice) |
| History — all fields | run.history() | run.history(samples=500, keys=["loss"]) |
| scan_history — no keys | scan_history() | scan_history(keys=["loss"]) (explicit) |
| Raw data in context | print(run.history()) | Load into DataFrame, compute stats |
| Metric at step N | iterate entire history | scan_history(keys=["loss"], min_step=N, max_step=N+1) |
| Cache staleness | reading live run | api.flush() first |
Package management
| Gotcha | Details |
|---|
| Using the wrong runner | Always use the run/install commands from the Python environment detection section — never guess |
Bare python when env unknown | If you haven't detected the environment yet, default to uv run script.py (never bare python) |
Weave logging noise
Weave prints version warnings to stderr. Suppress with:
import logging
logging.getLogger("weave").setLevel(logging.ERROR)
Quick reference
import weave
client = weave.init(f"{entity}/{project}")
calls = client.get_calls(limit=10)
best = api.runs(path, filters={"state": "finished"}, order="+summary_metrics.loss")[:1]
print(f"Best: {best[0].name}, loss={best[0].summary_metrics.get('loss')}")
losses = np.array([r["loss"] for r in run.scan_history(keys=["loss"])])
print(f"min={losses.min():.6f}, final={losses[-1]:.6f}, steps={len(losses)}")
from wandb_helpers import compare_configs
diffs = compare_configs(run_a, run_b)
print(pd.DataFrame(diffs).to_string(index=False))