-
Locate the export and the renderer.
- Export:
$ARGUMENTS[0], or the JSON you just generated.
- Renderer: the sibling session-metrics skill's script. Resolve its
path (it ships in the same plugin):
- plugin install:
../session-metrics/scripts/session-metrics.py
- dev repo:
.claude/skills/session-metrics/scripts/session-metrics.py
Use whichever exists (glob if unsure).
-
Prepare the worksheet + skeleton (preferred — you are an editor, not an
author). Run --prepare-tasks on the export: it prints a compact
one-line-per-request worksheet to stdout and writes a renderable candidate
<stem>_grouping.json next to the export, with deterministic clustering,
seeded titles, and suggested verdicts already filled in.
python3 <renderer> --prepare-tasks <export.json>
The worksheet is your single source of grouping signals — do not re-probe
the JSON with jq/Read. Each row shows the unit's candidate cluster
(cl), turns, cost, tokens, risk/reread/cbreak, idle gap, snippet, and top
tools; [cont] marks an agent-completion continuation and [blank] a
no-prompt unit (both pre-attached to the preceding cluster). Then edit
the skeleton per steps 3–5 below rather than writing it from scratch:
rename each seeded title (and drop its _auto_title field once named),
merge/split clusters where the worksheet warrants, write one-line rationales,
and fill any blank verdict the skeleton left for your judgment. Skip to
step 6 (render) when done.
(Fallback — manual authoring.) If you are not using --prepare-tasks, load
the export JSON and read request_units directly. Each unit has: unit_id
("<session_id>:<anchor_index>"), prompt_snippet, prompt_text,
turn_count, combined_cost_usd, total_tokens, tool_histogram,
risk_turn_count, reread_path_count, cache_break_count,
wall_clock_seconds, idle_gap_before_seconds, slash_command,
spawned_subagents, workflow_run_ids, multi_intent_possible.
If request_units is absent, tell the user to re-run session-metrics to
regenerate the export (the per-request breakdown is a newer feature) and stop.
-
Group into semantic tasks. Read the units in order and cluster
consecutive requests that pursue the same goal into one task. Signals, in
priority order:
- Topical/lexical continuity of
prompt_snippet/prompt_text (same
feature, file, bug, or subject) — the PRIMARY signal.
- Shared
tool_histogram / file targets across adjacent requests.
- Slash command / skill starts (
slash_command, a /debug,
/feature-dev, etc.) often begin a task.
- Idle gaps (
idle_gap_before_seconds) — a WEAK, confirming-only hint.
A long gap supports a split you already suspect topically; never split on
a gap alone (lunch breaks, overnight continuations).
- A unit flagged
multi_intent_possible may belong to two tasks — note it,
but keep the unit whole (it cannot be divided).
Most sessions yield a handful of tasks. Don't over-segment ("now fix the
test" is usually the SAME task as the feature it follows), and don't
under-segment (one giant "misc" task is useless).
At large scale (many dozens of units, e.g. a project-scope export):
group at session granularity — one titled task per coherent
session-goal — rather than attempting per-unit segmentation. Never emit a
single untitled catch-all task that swallows everything: the renderer's
collapse guard flags a blank-titled task covering the bulk of requests, and
it is a useless grouping anyway. If you cannot segment meaningfully, that is
a signal the input is too coarse for this skill (prefer a single-session
export).
-
Label each task with a verdict, using the deterministic waste signals as
evidence, NOT a guess:
worth_it — the task reached its goal at reasonable cost; low
risk_turn_count / re-read churn relative to its size.
likely_waste — high risk_turn_count, repeated reread_path_count,
many cache_break_count, or a long turn/cost run with little to show
(e.g. a debug loop that churned).
mixed — partly productive, partly churn, or you're unsure.
Bias toward mixed/worth_it when uncertain — a wrong likely_waste
damages trust more than a missed one. Keep the rationale to one honest
sentence tied to the signals ("12 turns, 4 risky re-reads of the same file
before the fix landed").
-
Write grouping.json next to the export (same directory), shape:
{
"schema_version": "1",
"scope_label": "session <id8> · <first_ts>",
"tasks": [
{
"title": "Add token-refresh to auth",
"verdict": "worth_it",
"rationale": "8 requests, one short debug detour, shipped.",
"request_unit_ids": ["<sid>:2", "<sid>:3", "<sid>:5"]
}
]
}
Cover every unit_id exactly once across all tasks. (Any you leave out
are swept into a synthetic "Ungrouped requests" task automatically, and the
renderer warns — aim for full coverage.)
-
Render the companion:
python3 <renderer> --render-tasks <export.json> <grouping.json>
The script validates the grouping (flags duplicate / unknown unit ids,
schema drift), recomputes every total from the export, and writes
<stem>_tasks.html + <stem>_tasks.md next to the export. It prints the
output paths and any validation warnings.
-
Report back to the user: the task list with each verdict, turn count and
cost (read these back from the script's stdout / the rendered *_tasks.md —
do not recompute), the paths to the two companion files, and any grouping
warnings the script surfaced. Offer to open the HTML page.