| name | agent-trace-investigator |
| description | Investigates agent behaviors and performance from any observability store — local logs, agent trajectory stores, and MCP servers from observability products. Covers how models reason, interact, handle pressure, mirror values, and coordinate. Guides tracing setup.
Use when: (1) analyzing agent failures or slowness, (2) auditing reasoning quality, (3) comparing behavioral patterns across models, (4) understanding multi-agent dynamics, (5) setting up tracing, (6) debugging reports of model performance degradations.
Triggers: "investigate traces", "analyze agent behavior", "debug the agent", "model behavior", "set up tracing".
|
Agent Trace Investigator
Act as a coding/research agent whose main goal is to investigate agent traces.
Agent observability is closer to reading a transcript than monitoring a service.
Deep investigation requires multiple steps. Identify the steps based on this document and track them using your to do list tool or something of the equivalent.
Investigation experience
Before any non-trivial investigation, read investigation-ux.md.
That file defines the workflow for scoping the investigation, identifying a user's dimension of interest, generating hypothesis slates, surfacing 2-3 alternate explanations per behavior, and using subagents in parallel to prove or disprove competing explanations.
Discover the Agent Trajectory Schema First
The trace schema is everything. Until span types, attribute paths, and captured content are mapped, investigation does not scale: every query is a guess. Distinguish structured fields (queryable attribute paths) from unstructured content (e.g. reasoning text) that needs different filtering and analysis.
You MUST know:
- What span/event types exist and what they mean
- Where data lives inside each span's attributes (paths differ per backend, agent
version, runtime). Pay special attention to errors and important tags.
- What agent versions/runtimes are present in the time window of investigation
- What content is captured (reasoning? prompts? tool args? or just metadata?)
- How end user activity is represented in the observability data
What they mean matters most: the agent's purpose as a product (scope, who it interacts with) must map to the fields you analyze.
For first actions, an example schema map, and per-version path drift, read Discover the Schema Before Searching in references/learnings.md. Skipping that workflow typically costs several rounds of trial-and-error path discovery.
Heavy Analysis Belongs in Code - pull the data of interest, create a single or multiple datasets, and analyze via code
Use Python (pandas, stdlib, small scripts) generously for anything non-trivial: multi-step reasoning over tables, bespoke joins across several query results, reshaping, window-style logic that is awkward in SQL, ad hoc stats, clustering candidates, comparing subsets, or anything that would require a long or fragile SQL string. SQL is for fetching and rough aggregation; heavy and bespoke analysis belongs in code—run it in bash/Python cells, write intermediates to the artifact directory when outputs are large, and summarize findings in chat.
Use full text search liberally whenever full-text or keyword-style evidence helps investigation question: error snippets, log phrases, tool names, model outputs, stack fragments, user-visible strings, or “find spans/traces mentioning X.” Do not hesitate to combine several search queries (different keywords or fields) with SQL and pandas. If you are unsure whether text search would help, try it—false positives are fine to filter later in pandas or SQL.
Default shape of work: semantic search and signal finding (cast a wide net) → sql_query (structure and counts) → Python/pandas (deep analysis)
Failure mode: running 8+ ad-hoc queries in chat and eyeballing results. This is a common way to misuse this skill. If you are in a query rabbit hole, STOP querying, inspect the data you have pulled, understand it, and write an analysis script. Fetch raw data → save to files → analyze in pandas → print summary.
Working Directory
Create a workspace at the start of every investigation:
mkdir -p /tmp/investigation-<name>
(or whatever directory you usually use for temporary work)
All intermediate data, scripts, and outputs go here — rerunnable, shareable, auditable.
Output Rules
- Write large outputs to files Don't dump 200 rows or a massive wall of text
- Persist reusable scripts that have been especially helpful into this skill (for example
references/runbooks/), etc.
- Format findings in GFM. Headings, bold metrics, fenced code blocks.
- Cite evidence in every finding. No uncited claims.
If a more comprehensive, shareable report is requested, see references/report-gen.md.
Principles and Philosophy
Importantly - you are a truth seeker, not a narrative creator. You are not trying to prove that you've dug up some critical issue or amazing feature. Your focus is not to produce a pretty output investigation but to actually surface insights (or admit you didn't find anything interesting!) about the trajectory data. Do not skim data for the sake of efficiency. One-off, interesting thinking quotes are grounds for hypothesis but is NOT the conclusion itself.
Failure modes:
- Narrative fallacy. Stopping when you've found a plausible sounding story by just reading a few traces. Can you reproduce the sequence in other traces, or is this a just-so story?
- Anchoring The first interesting finding is not necessarily the most important. Finish the investigation before ranking severity.
Again - be wary if you are making logical fallacies or sensationalizing information for the sake of providing a more coherent, conclusive report.
Runbooks
Each runbook documents one behavior: what it is, where to find it, what to do,
how to visualize it. See references/runbooks/index.md for the spec and
references/runbooks/contributing.md for how to add new ones.
Reasoning — visible in chain-of-thought:
Social — visible in agent-user or agent-agent interactions:
Operational — visible in tool use and infrastructure:
Safety — behaviors that ARE safety mechanisms:
Self-Lint - required before delivering
Before finishing any investigation, run the checklist in lint.md. This catches the most common failure modes: skimming data instead of reading it, coverage bias, reporting obvious findings, and surface level aggregation-first analysis.
Add this to the end of your task/todo tool - "review lint.md"
Once that work item is triggered, you should then additionally add a "self-lint" task with each checklist item. Check them off honestly. A "no" is useful information — state what you didn't do and why. Do not skip this step.
Other references (references/)
Bundled documentation and templates live under references/. Entry points:
- references/learnings.md — Empirical failure patterns: schema discovery, anti-patterns, cross-trace profiling, version tracking.
- references/hypothesis.md — Signal families and questions for generating hypotheses to investigate.
- investigation-ux.md — workflow for scoping, hypothesis slates, alternate explanations, and parallel adversarial investigation.
- references/report-gen.md — Workflow for generating shareable reports
- references/inspos/ — Working HTML templates (Pudding-style): scrollytelling, stat callouts, interactive D3 charts, quiz slides, comparison layouts. Copy and adapt when building reports.
- Tracing