with one click
bitfab-assistant
// Iterate on a traced Bitfab function. Usage: /bitfab-assistant [all|dataset|experiment] [<trace-function-key>] [<dataset-id>]
// Iterate on a traced Bitfab function. Usage: /bitfab-assistant [all|dataset|experiment] [<trace-function-key>] [<dataset-id>]
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | bitfab-assistant |
| description | Fix and experiment on code, or ask for guidance |
Use the local plugin MCP tools (mcp__Bitfab__list_trace_functions, mcp__Bitfab__search_traces, mcp__Bitfab__read_traces, mcp__Bitfab__update_agent_labels, mcp__Bitfab__list_datasets, mcp__Bitfab__create_dataset, mcp__Bitfab__add_traces_to_dataset, mcp__Bitfab__remove_traces_from_dataset) to find what's failing in a traced function, build a dataset of labeled traces, and iterate on the code/prompts using replay until pass rates improve.
MCP tools: This skill uses list_trace_functions, search_traces, read_traces, update_agent_labels, list_datasets, create_dataset, add_traces_to_dataset, and remove_traces_from_dataset from the local plugin MCP server (bundled with this plugin), exposed under the mcp__Bitfab__* prefix.
Always use AskUserQuestion when asking questions, reporting results, or presenting choices. Never print a question as text and wait. Rules:
This skill has three invocation modes. all walks every phase. The two sub-modes do one focused thing each โ building a labeled dataset, or running experiments against an existing one โ and require the trace function key as the argument because they skip the function picker (Phase 1) and instrumentation/replay verification (Phase 2).
| Invocation | Action |
|---|---|
/bitfab-assistant or /bitfab-assistant all | Full flow: pick function โ verify instrumentation โ pick or create dataset โ label โ diagnose โ iterate โ wrap up |
/bitfab-assistant dataset <key> | Build or extend a labeled dataset for one function, then stop. No experiments run. Picks an existing dataset or creates a new one |
/bitfab-assistant experiment <key> [<dataset-id>] | Run experiments to fix failing traces against a labeled dataset, then wrap up. If <dataset-id> is omitted, you'll be asked to pick one. If the function has no datasets yet, run /bitfab-assistant dataset <key> first |
In sub-modes, grep the codebase for <key> early so labeling and experiments are grounded in the actual instrumented function (the full flow does this in Phase 2; sub-modes skip Phase 2 entirely).
Run only when mode is all.
If a traceFunctionKey was provided as an argument, skip the listing and the user prompt โ but still cross-check the provided key against the local codebase before moving on. Otherwise, work through all four steps below:
Skip this step if a traceFunctionKey argument was provided โ use the argument directly and continue to cross-check. Otherwise, call mcp__Bitfab__list_trace_functions to list all available trace functions. Use only the keys and metadata returned (trace counts, last activity) โ do NOT invent or infer descriptions of what each function does from its key name. Key names are often ambiguous or misleading, and guessing produces hallucinated descriptions that confuse the user.
Cross-check each key against the local codebase before presenting. For each returned key, grep the repo for string-literal uses of that exact key (across *.ts, *.tsx, *.py, *.rb, *.go, *.baml). Mark each function in the presented list as:
Skip this step if a traceFunctionKey argument was provided โ there's no list to present. Otherwise, present the full list in the question text showing ONLY: <key> ยท <trace count> ยท <last activity> ยท <instrumented-here marker + path, or not-found marker>. No invented summaries.
Skip this step if a traceFunctionKey argument was provided โ the function is already chosen. Otherwise, use AskUserQuestion with 2 options: the recommended function (prefer one that is โ
instrumented here AND has recent activity) and a free-text "Type a function key" option. If nothing is instrumented here, say so explicitly in the question โ don't hide it.
Run only when mode is all.
Check that this trace function has both instrumentation and a replay script.
Search the codebase for the trace function key to find where the SDK is used:
grep -r "<traceFunctionKey>" --include="*.ts" --include="*.tsx"grep -r "<traceFunctionKey>" --include="*.py"grep -r "<traceFunctionKey>" --include="*.rb"grep -r "<traceFunctionKey>" --include="*.go"If the key is found, note the file location โ this is the code you'll iterate on in later phases.
If the key is NOT found in the codebase, the function is instrumented elsewhere (the traces exist on Bitfab). Use AskUserQuestion to ask:
"I can't find
<traceFunctionKey>in this codebase โ it may be instrumented in another repo or under a different key."A) Instrument now โ set up tracing in this codebase (recommended) B) Continue anyway โ work with the traces even without local code C) Pick a different function D) Stop
If the user chooses "Instrument now", invoke /bitfab-setup instrument, then verify whether a replay script exists for this function. If "Continue anyway", skip the replay-script check and start building the dataset โ there's no local code to iterate on yet.
Search for a replay script that covers this trace function:
scripts/replay.*, scripts/*replay*, or any file that imports bitfab.replay / client.replayIf a replay script exists but targets a different function key, do NOT modify the existing script or suggest changing the code's function key. Instead, treat it as "no replay script for this function" and offer to create a new one.
If no replay script exists or it doesn't cover this function, use AskUserQuestion:
"No replay script found for
<traceFunctionKey>."A) Create replay now โ create the replay script inline (recommended) B) Pick a different function C) Stop
If the user chooses "Create replay now", invoke /bitfab-setup replay, then start building the dataset.
Run only when mode is all or dataset.
A dataset is the named bucket of labeled traces an experiment replays against. This phase picks (or creates) one for the trace function, labels candidate traces, attaches them to the dataset, then hands off to the per-dataset review page where the user approves labels and can ask the agent to add or remove traces.
In dataset mode this phase is the entry point โ Phase 1 (function picker) and Phase 2 (instrumentation/replay verification) are skipped, so the trace function key comes from the argument. Before calling any MCP tools, grep the codebase for the key (e.g. grep -r "<traceFunctionKey>" --include="*.ts" --include="*.tsx" --include="*.py" --include="*.rb" --include="*.go" --include="*.baml") and note the file path โ every later step ("Label them yourself", and Phase 4 "Read the code" in all mode) needs it.
Pick or create a dataset โ Call mcp__Bitfab__list_datasets with the trace function key. Then branch on whether any exist. Hold the chosen datasetId in working context โ every step from here on uses it.
mcp__Bitfab__create_dataset with traceFunctionKey: <key> and name: <key> (just the trace function key as the name; the user can rename it later in the UI if they want). Hold the returned datasetId and continue. The first-time user shouldn't have to answer a name prompt before they've even seen the dataset.AskUserQuestion, with one option per existing dataset (name ยท id ยท current trace count) plus a "Create new" option. Recommend the most recently used dataset that has traces. If the user picks an existing dataset, hold its id and continue. If the user picks "Create new", silently call mcp__Bitfab__create_dataset with name: "<key> #N" where N is one more than the number of existing datasets (e.g. eval-assistant #2) โ don't ask for a name. Hold the new id and continue.Ask what kinds of traces to find โ Before searching, find out what the user is actually trying to surface. The trace function may have thousands of traces; "what should I label?" is the question that makes the rest of this phase useful.
Skip this question if the chosen dataset already has labeled traces โ those labels + annotations are the prior context, and you should mirror that intent when finding more candidates. Only ask when there is no prior context: the dataset was just created (either silently, because none existed, or because the user picked "Create new"), or the picked dataset is empty.
When asking, use AskUserQuestion with these options (and a free-text fallback so the user can describe something specific):
Hold the user's answer (the chosen option and any free-text detail) in working context โ the next step uses it to shape the mcp__Bitfab__search_traces filters and which traces to prioritise reading. If they pick C, default to recent + diverse + non-empty outputs.
Find unlabeled traces โ Search without label filters to find unlabeled traces for the trace function. Shape the search by the intent captured in the previous step (or by the prior dataset's existing labels, if any): Option A = filter to traces matching the user's described failure pattern; Option B = filter by the user, session, or time window of the reported incidents; Option C = default sweep (recent, diverse inputs, non-empty outputs). Use mcp__Bitfab__search_traces with the relevant filters, then mcp__Bitfab__read_traces with scope: "summary" to read candidates and identify which are worth labeling โ look for diverse inputs, traces that produced output (not empty), and traces that cover different scenarios under the chosen intent. Filter out near-duplicates and uninteresting traces. If every trace is already labeled and attached to this dataset, you can move straight on with no new candidates.
Ask how the user wants to label โ Before any verdicts go on these candidate traces, use AskUserQuestion how the user wants to label them. There are exactly two modes, and the answer determines whether you call mcp__Bitfab__update_agent_labels at all:
A) Agent labels first, I approve / edit โ agent makes a first pass; you approve or edit each verdict in the labeling page (recommended) B) I'll label them manually โ no agent verdicts; you label every trace from scratch in the labeling page
Recommend Option A โ an agent first pass turns the labeling page into a quick approve/edit review. But respect the user's choice: if they pick B, do not call mcp__Bitfab__update_agent_labels for any of these candidates. They want to label from scratch in the labeling page, with no agent verdicts pre-filled. If no new candidate traces were found in the previous step, skip this question and continue.
Agent first pass: label them yourself before opening the labeling page โ Reachable only when the user picked Option A in the previous step. You label the approved candidate traces so the labeling page becomes an approve/edit review instead of a blank labeling session. Call mcp__Bitfab__read_traces with scope: "full" on the approved trace IDs (batch them โ up to 10 per call), read each trace's inputs / output / spans yourself, and decide for each one whether it looks like a PASS or a FAIL. Ground your judgment in the codebase, not just the trace text. Before you start labeling, read the instrumented function in the user's source (located in Phase 2 in all mode, or via the grep step in this phase's intro in dataset mode) and any nearby code that explains intent โ comments, docstrings, README sections, related tests, BAML files โ so you know what the function is supposed to do and what "good" looks like for it. Apply the same context to every trace: does this output achieve the function's goal as expressed in the code? Does it match the patterns in the already-validated traces? Then call mcp__Bitfab__update_agent_labels once with an array of { traceId, label, annotation } objects โ both label (true for pass, false for fail) and annotation (a one-or-two-sentence explanation written for the human reviewer, ideally referencing what the code is trying to do) are required for every trace. Commit to a verdict โ if you genuinely cannot decide, you didn't read the trace or the code carefully enough. The labels you save here start unapproved; they only become part of the validated dataset once a human approves them in the labeling page.
๐จ HARD RULE โ DO NOT SKIP (agent-first mode only): When the user picked Option A, you MUST call
mcp__Bitfab__update_agent_labelswith verdicts for every approved trace BEFORE runningstartDataset.jsto open the labeling page. Sending the user into an agent-first review with no pre-labeled verdicts is a process violation. (In manual mode this step is unreachable, and the rule does not apply.)
Made a mistake? If you realize a verdict was wrong (e.g., you mislabeled a trace or want to re-evaluate), call
mcp__Bitfab__update_agent_labelsagain with{ traceId, archive: true }for those traces. The previous label is hidden (kept for audit), and you can re-label the trace from scratch with anotherupdate_agent_labelscall.
Attach candidate traces to the dataset โ Call mcp__Bitfab__add_traces_to_dataset with the datasetId chosen earlier and the array of approved candidate trace IDs (in agent-first mode, the ones you just labeled; in manual mode, the candidates the user approved in find-unlabeled). The call is idempotent โ re-adding traces already in the dataset is a no-op, so it's safe to include the full set. If no new candidate traces were approved (the dataset was already populated), skip this step.
๐จ HARD RULE โ DO NOT SKIP: All approved candidate trace IDs MUST be attached to the dataset before opening the page. The page reviews the dataset's contents, not the trace function's label table. An empty dataset means an empty review.
Open the dataset page (background process) โ Start the dataset script as a long-running background process. It will live until the user clicks Done (or cancels) and emit one JSON line on stdout per event:
node "${CURSOR_PLUGIN_ROOT:-${CLAUDE_PLUGIN_ROOT}}/dist/commands/startDataset.js" <functionKey> <datasetId>
Run it as a long-running terminal process (your runtime's equivalent of "start and don't wait"). Hold the process handle so you can tail its stdout in the next step.
The script first prints "Opening your browser..." plus the URL on stderr (surface the URL to the user verbatim if the auto-launch didn't open one), then begins emitting JSON event lines on stdout as the user interacts with the page:
{"event":"modify","datasetId":"...","ts":"..."} โ non-terminal: the user clicked Edit with agent{"event":"saved","status":"saved",...} โ terminal: the user clicked Done{"event":"cancelled","status":"cancelled"} โ terminal: the user cancelledThe script exits 0 only after a terminal event. Modify events keep the process alive.
๐จ Stream is mixed (stdout + stderr). Most agent runtimes capture stdout and stderr together when reading a long-running process, so the lines you read in the next step are a mix of: (a) JSON event lines (what you act on), (b) browser-handoff status text like Opening your browser... and (If the browser didn't open, ...), and (c) periodic heartbeats like [bitfab] waiting for browser handoff... 30s elapsed. You MUST filter to lines that parse as JSON before routing. Skip anything that doesn't parse โ never error out on non-JSON lines.
Read the next event from the background script. Tail the long-running process's stdout/stderr for new lines.
The captured output is a mix of JSON event lines and free-form status text (see the warning at the end of the previous step). For each new line:
event field, route on event:event: modify โ user clicked Edit with agent โ go to the modify loop, then come back here to read the next eventevent: saved โ user clicked Done โ dataset is finalised, move on to confirm + summariseevent: cancelled or process exits non-zero โ stop the flowModify loop: add or remove traces in chat โ The page is still open and the background script is still alive; the user wants you to add or remove traces. Ask in plain chat:
What would you like to add or remove? You can describe by criteria (e.g. "drop empty-output traces", "add 5 more from last week with errors") or paste explicit trace IDs.
Then wait for the user's next message. It will contain their answer. Do NOT use AskUserQuestion here (the answer is free-form and options would just add an extra step before the user can type).
Then act on it:
mcp__Bitfab__search_traces / mcp__Bitfab__read_traces, then respect the labeling mode the user chose earlier in this phase (the ask-labeling-mode step). In agent-first mode (Option A), label them yourself with mcp__Bitfab__update_agent_labels (same rigor as label-self โ every trace gets a verdict + annotation, grounded in the code) before attaching. In manual mode (Option B), do NOT call mcp__Bitfab__update_agent_labels โ the user wants no agent verdicts pre-filled, and the labeling page lets them label the new candidates from scratch alongside the existing ones. Either way, call mcp__Bitfab__add_traces_to_dataset to attach.mcp__Bitfab__remove_traces_from_dataset with the trace IDs to remove. The traces themselves aren't deleted โ only their membership in the dataset.The page reflects each add/remove live (SSE), so the user sees changes flow in as you make them. When you're done, summarise what changed in chat and return to the await-event step to read the next event โ do NOT re-run the dataset script; it's still alive in the background. The user can click Edit with agent again for another modify round, or Done to finalise.
Build the dataset โ You already know the trace IDs in this dataset (you attached them in earlier steps and tracked any add/remove from modify rounds). Call mcp__Bitfab__read_traces with all of them and scope: "full" to load the labels + annotations into context. This is the working set for confirm + every Phase 5 experiment.
Confirm the dataset โ Present the dataset via AskUserQuestion: each entry showing (trace ID, label, annotation summary). The dataset must contain at least one validated failing label โ i.e. at least one trace where a human either authored or approved a false label. To check, call mcp__Bitfab__search_traces restricted to the dataset trace IDs with validated: true and labelResult: false. Two outcomes:
Unapproved agent labels do not satisfy this gate by design โ validated: true excludes them.
12. Hold in-context โ This approved dataset is the benchmark for all experiments in Phase 5. Keep both the datasetId and the trace IDs in your working context throughout. In dataset mode the skill stops here โ surface the dataset summary (including the id) and exit so they can pick up later with /bitfab-assistant experiment <key> <datasetId>.
Run only when mode is all.
Understand failures. Using the failed traces you read in Phase 3 (or read them now if you haven't):
mcp__Bitfab__read_traces on 3โ5 failed traces with scope: "full"Synthesize the failure patterns โ what's going wrong, what the common threads are.
Read the code.
all mode you found it in Phase 2; this step is unreachable in dataset / experiment modes).baml filesCategorize fixes based on failure annotations. Based on the failure patterns, the code, and the labeled dataset from Phase 3, categorize proposed changes into three buckets:
Bucket 1 โ Code fixes: Deterministic bugs (off-by-one, type mismatch, missing null check, wrong variable). These won't recur once fixed. Bundle all code fixes into a single experiment unless they are large feature changes. These are applied first as a foundation that all subsequent experiments build on.
Bucket 2 โ Judgment-based fixes: Prompt changes, context truncation, search tuning, output formatting, etc. These require the user's judgment to evaluate correctness. Each gets its own experiment.
Bucket 3 โ Infrastructure proposals: Larger changes that require new infrastructure, architectural changes, or significant feature work. These are separated out because experiments become harder to compare when some include large infra changes and others don't โ apples-to-apples comparison requires a consistent baseline. Do not run experiments for these. Instead, if the user has integrations (Linear, Notion, Jira), propose creating a task with a clear writeup for future work.
Present the categorized plan via AskUserQuestion:
"Based on the N traces in the dataset, here's what I see:
Code fixes (experiment #1 โ bundled):
- [Fix]: [What and why, which traces it addresses]
Judgment-based experiments (#2, #3, ...):
- [Experiment]: [What change, which traces it targets, hypothesis]
Future infrastructure (not experiments):
- [Proposal]: [What it would require, which traces it would help]
I'll replay each experiment against the labeled dataset and evaluate using the annotations as acceptance criteria."
Get the user's confirmation before proceeding.
Run an iterative improvement loop. Each iteration:
Run only when mode is experiment.
The trace function key comes from the argument and no prior phase has run. Pick the dataset to iterate against, then locate the code:
grep -r "<traceFunctionKey>" --include="*.ts" --include="*.tsx" --include="*.py" --include="*.rb" --include="*.go" --include="*.baml") and note the file path. This is the code you'll iterate on.<dataset-id> argument was provided, use it directly. Otherwise call mcp__Bitfab__list_datasets with the trace function key, present the result to the user via AskUserQuestion, and use their choice. Hold the chosen datasetId in working context.mcp__Bitfab__read_traces with the dataset's trace IDs and scope: "full" so labels + annotations are in context.list_datasets returned empty), or the picked dataset has no validated failing labels โ tell the user the function has no usable dataset yet and recommend running /bitfab-assistant dataset <key> first; stop the flowRun only when mode is all or experiment.
Make the change.
AskUserQuestion to explain what you're changing and why, and confirm before editingRun only when mode is all or experiment.
Replay against the dataset. Collect the trace IDs from the labeled dataset (built in Phase 3 in all mode, or rehydrated at the start of this phase in experiment mode). Run the replay script with those specific traces.
# The exact command depends on the replay script โ adapt to what exists
# Example for TypeScript:
cd <project-dir> && npx tsx scripts/replay.ts <pipeline-name> --trace-ids <id1>,<id2>,<id3>,...
Before running: verify the replay script prints the full original and new output values to stdout for every item (not just lengths, counts, hashes, or truncated previews). If it doesn't, fix the script first โ the Replay Output Contract and example script live in the SDK reference at https://docs.bitfab.ai/<language>-sdk#replay. Subagents can't evaluate an improvement from 5 โ 7 (+2).
Capture the testRunId from the replay output โ the SDK prints it (alongside testRunUrl) when the run completes. Track every testRunId produced across all iterations of this phase: you'll feed them to open-experiments so the user can review every experiment side-by-side in one viewer.
Run only when mode is all or experiment.
Evaluate against labels & annotations. Read the replay output. For each trace in the dataset, use the label (pass/fail) and annotation (from Phase 3, or rehydrated at the start of this phase in experiment mode) to judge whether the new output is an improvement:
Run only when mode is all or experiment.
Open the experiment viewer. Run the open-experiments command with every testRunId you've collected across iterations of this phase (comma-separated). The viewer renders each experiment as a card so the user can compare pass/fail counts and drill into individual traces side-by-side.
node "${CURSOR_PLUGIN_ROOT:-${CLAUDE_PLUGIN_ROOT}}/dist/commands/openExperiments.js" <testRunId1>,<testRunId2>,<testRunId3>
The command opens a browser window and exits immediately โ do not wait for it, and do not poll. Continue straight to share-results. The viewer is a parallel review surface for the human; your textual summary in the next step is still required.
If no testRunIds were captured (e.g. the replay script didn't print them), skip this step and continue โ but flag it to the user in share-results so the script can be fixed before the next iteration.
Run only when mode is all or experiment.
Share results to the user.
"After N experiments these are the results: X/Y traces now pass.
- โ Trace
abc123: Now passes โ [how the annotation's issue was resolved]- โ Trace
def456: Still failing โ annotation said [X], output still [Y]- โโ ๏ธ Trace
ghi789: Was passing, now failing (regression)"
Show this across the full data set, and highlight the best outcome concisely. Explain why it worked best with references to code, docs, and/or research if needed. For the best outcome:
AskUserQuestion to confirm whether they want to keep iterating or stopEnsure your question includes your recommended next step.
A) Keep iterating โ run another experiment from the plan (recommended) B) Stop and wrap up โ move to the final summary
Run only when mode is all or experiment.
Summary. Use AskUserQuestion to present the final results similar to this. You may expand where appropriate based on context from the user:
"Improvement summary for
<traceFunctionKey>:
- Failed traces fixed: X/Y (from N% โ M% pass rate on labeled failures)
- Full replay pass rate: A/B
- Changes made:
- [File]: [Description of change]
- [File]: [Description of change]
The changes are in your working tree (not committed). Review the diffs and commit when ready."