| name | fix-datadog-issues |
| description | Find, triage, and fix production errors captured by Datadog Error Tracking, then open a PR. Use when asked to look at "Datadog issues/incidents/errors", "find and fix bugs from Datadog", investigate the most-frequent or newest production errors, or work a specific Datadog Error Tracking issue. |
Fix Datadog Error Tracking issues
This skill takes a brand-new agent from "look at Datadog" all the way to a reviewable PR.
You will run it repeatedly across separate, context-free sessions, so it is written to be
restartable: a fixed error simply stops receiving occurrences, and (when the tooling allows) you
leave a comment on the issue so the next agent does not redo your analysis.
The Datadog MCP server is plugin:datadog:mcp. Its tools are deferred — their schemas load on
demand. Load them with ToolSearch (e.g. select:mcp__plugin_datadog_mcp__aggregate_spans) before
calling, and run the server's skill-discovery first (see §1).
The pipeline (overview)
0. Setup & disambiguate → 1. Navigate Datadog → 2. Pick an issue → 3. Root-cause in code
→ 4. ⛔ CONFIRM PLAN WITH USER → 5. Reproduce with tests + fix → 6. Comment on the issue → 7. PR to development
Hard gate at step 4: you investigate freely, but you do not write a fix, create a branch, or open
a PR until the user has seen your plan and approved it (see §4). The only exception is explicit
pre-authorization (e.g. "just fix it and PR").
Do not try to fix every issue. Scope a PR by size/risk (see §2.4): one big/critical bug = its own
PR; a few small, independent, related bugs may share one PR (cap ~4).
0. Setup and critical disambiguations
Read these first — each one was a real wall that cost time.
-
"Incidents" almost always means Error Tracking issues, not Incident Management.
Datadog Incident Management (search_datadog_incidents) is typically empty (0) here, and incidents
have no "occurrences". When the user says incidents / issues / errors / "most occurrences" / "newest",
they mean Error Tracking (errors grouped into issues, with occurrence counts and trends).
-
Only v2 services are fixable from this repo. This repo (latitude-v2, trunk development) owns:
api, ingest, web, workers, workflows.
The latitude-llm-* services (latitude-llm-web, latitude-llm-workers, latitude-llm-gateway, …)
are the legacy v1 codebase (branch latitude-v1) — out of scope unless the user says otherwise.
Derive the live v2 list from ls apps/ so it never goes stale.
-
Production only. Filter env:production. Error Tracking buckets by env; staging issues exist
separately (env:staging) but are out of scope unless the user hands you a specific staging issue.
-
The Latitude MCP server (mcp__latitude__*) is NOT this. mcp__latitude__listIssues /
getIssue / resolveIssues operate on issues detected in Latitude customers' LLM traces — a
different product surface. Never use them to triage our own app's runtime errors. Use
plugin:datadog:mcp.
-
Error Tracking toolset must be enabled. This skill needs the error-tracking toolset (stable,
not on by default). If ToolSearch for datadog error tracking issue finds no
mcp__plugin_datadog_mcp__*error_tracking* tools, the toolset is off — tell the user to enable it via
the /datadog:ddtoolsets skill (add error-tracking), then /reload-plugins + re-auth. You can still
do everything except read/write issue objects directly via the span fallback in §1.3.
1. Navigating Datadog
1.1 Always start with skill discovery
The server ships domain guides that are not visible in tool names. In parallel:
load_datadog_skill(skill_name="datadog/traces") — span query syntax & attributes.
list_datadog_skills(query="error tracking ...") — find the right guide.
Load datadog/logs if you pivot to logs. Skip re-loading a guide you already loaded this session.
1.2 The lay of the land
- Org/site:
datadoghq.eu (UI: app.datadoghq.eu, MCP domain mcp.datadoghq.eu).
- Services:
search_datadog_services lists everything (apps, DB adapters like *-postgres/*-aws-s3,
and external hosts like api.openai.com). Only the bare v2 app names matter here.
service.version == git commit SHA. Every span tags the deployed SHA. This is gold: it maps an
error to a commit and lets you correlate with deploys via get_change_stories.
1.3 Where Error Tracking issues live, and how to read them
An issue is a fingerprinted group of error occurrences. Two ways in — use both:
(a) Error Tracking tools (preferred; needs the toolset). Discover exact names at runtime:
ToolSearch(query="datadog error tracking issue"). Expect at least get_datadog_error_tracking_issue
(by issue id) and a list/search tool; there may be an update/state and/or comment tool — confirm
their real names/schemas before relying on them (see §6).
(b) Span aggregation (always works, even with the toolset off). Issues are stamped onto error spans:
custom.issue.id — the issue UUID (fetch with custom_attributes:["issue.*"]).
issue.first_seen (epoch ms), issue.first_seen_version (git SHA of first occurrence), issue.age.
- Plus
@error.type, @error.message, error.stack, resource_name, service, env.
Find the heavy hitters (the workhorse query):
aggregate_spans(
query = "status:error env:production",
from = "now-7d", to = "now",
computes= [{field:"*", aggregation:"COUNT", output:"count", sort:"desc"}],
group_by= {fields:["service","@error.type"], limit:40}
)
Then narrow into messages/resources for the candidates you care about:
aggregate_spans(query="service:workers status:error env:production @error.type:(TypeError OR RepositoryError)",
group_by={fields:["@error.message","resource_name"], limit:25}, computes=[COUNT desc])
Read raw detail (stack, http, issue id) for a specific group:
search_datadog_spans(
query = "service:web status:error env:production @error.type:Error resource_name:GET",
custom_attributes = ["error.*","http.*","issue.*"], max_tokens = 7000)
Link to an issue (needed in the §4 report and the §7 PR). Prefer the canonical URL the error-tracking
tool returns. Otherwise build it from the org base + issue id and open it to confirm it resolves:
https://app.datadoghq.eu/error-tracking/issues/<issue.id>. (Span search responses also return a
base_url and a traces_explorer_url you can fall back to.)
1.4 Query pitfalls (these bit us)
@error.message is not reliably wildcard/full-text searchable. @error.message:"foo*" may return 0.
Instead group by @error.message in aggregate_spans, or filter by @error.type + resource_name
and read messages from raw spans.
- Reserved attrs take no
@: service, resource_name, status, type, trace_id. Span attrs take
@: @error.type, @http.status_code, @duration (nanoseconds!).
- Group multi-values:
@error.type:(A OR B), not @error.type:A OR @error.type:B.
2. Pick an issue
2.1 If the user named an issue
Target it directly (by issue id/slug/url, or a quoted error message → resolve via the queries above).
Skip the ranking; go to §3.
2.2 Otherwise: the funnel (classification before ranking)
Occurrence count alone is a trap. Most high-count "errors" are not fixable code bugs. Filter first.
Stage A — scope gate (cheap, 1–2 aggregate calls): v2 service, env:production, status:error,
now-7d. Inspect the top ~15 by count. Ignore issues with < 5 occurrences in the window unless
they are new+rising or user-specified (sub-5 are usually non-reproducible one-offs).
Stage B — classify each candidate (read its message + stack). Bucket it:
| Class | Signatures (examples) | Default action |
|---|
| A. Genuine code bug | TypeError, null/undefined deref, validation/logic errors, data-handling (bad UTF-8 / lone surrogates, encoding, parsing) | Candidate to fix |
| B. Infra / transient | Timeout, socket hang up, ECONNRESET, deadlock detected, timeout exceeded when trying to connect, 429, 503, pool exhaustion | Usually not a code fix. Note & skip (resilience/retry work only if asked) |
| C. Deploy / version skew | "Server function info not found for <hash>", "Failed to fetch dynamically imported module" (old first_seen_version ≠ current service.version; stale client tabs) | Framework-level graceful handling, not a logic bug |
| D. Expected / not-an-error | BullMQ DelayedError, *LockUnavailableError that is retried with backoff, a NotFoundError that callers handle | Noise — ignore |
| E. Upstream / third-party | external provider 5xx, provider rate limits with correct handling | Not ours |
Only Class A proceeds. For B–E: leave a one-line verdict comment on the issue if you can (§6), then skip.
Stage C — rank the Class-A bugs on three axes, then pick the top:
- Impact — customer-facing (
web/api) > background (workers/workflows); data corruption/loss >
transient failure; silent-wrong > loud-fail; does it block a user flow?
- Volume × trend — occurrences and direction. Use the 14-day trend; rising/new beats flat/decaying.
- Fix confidence × blast radius — clear, bounded root cause + small change = high ROI; sprawling or
unknown = defer.
Pick = highest (impact × trend) among the confidently fixable.
2.3 The recency premium (catch regressions early)
Give extra weight to new + rising issues even at lower absolute count:
- "New" =
issue.first_seen within ~72h, OR first_seen_version is one of the last 1–2 deploys.
- A recent first-seen usually means a fresh regression —
first_seen_version + get_change_stories
often hand you the culprit commit, making the fix faster and higher-confidence, and catching it
early prevents pile-up. An ancient, flat, high-count issue is a yellow flag, not an automatic top pick.
2.4 How many to fix / how to batch
- Solo PR if: critical or large; touches core/shared code; or the root cause is non-trivial.
- Group 2–4 into one PR only if all are: small, independent, low-risk, and thematically
related (same subsystem → one reviewer context). Never mix a risky fix with trivial ones. Cap ~4.
3. Root-cause in code
Use the analyze-problem skill's method. Then, specific to this workflow:
- Generalize past the observed symptom. One issue is often one instance of a broader bug. Example:
a Voyage embeddings
400 invalid UTF-8 and a ClickHouse missing second part of surrogate pair were
one root cause — unsanitized lone UTF-16 surrogates hitting two sinks. Fix the source, not each
sink, and look for sibling call sites with the same flaw.
- Map the error to code from the span:
service → apps/<svc>; resource_name (e.g. bullmq
process <queue>, or GET /…) → the handler/job; error.stack frames → the throwing module. Confirm
the deployed SHA (service.version) matches what you're reading.
- Find the commit that introduced it.
issue.first_seen_version is the git SHA of the first
occurrence — the deploy that introduced the regression. Once you've located the faulty line(s), run
git blame/git log -S '<symbol>' -- <file> to name the culprit commit, and get_change_stories
to see that deploy in context. Capture both the file:line and the commit SHA — §4 asks for them.
- Decide fixability honestly. If it's Class B/C/D/E in disguise, or the fix needs product/infra
decisions beyond code, say so and record it (§6) instead of forcing a fake fix.
- Respect the architecture (
architecture-boundaries): fix at the right layer (domain use-case vs
platform adapter vs app boundary). Prefer the layer the codebase already uses for that concern — search
for an existing helper before writing a new one.
4. Checkpoint — confirm the plan with the user (do not skip)
Stop here. Do not write a fix, create a branch, or open a PR until the user approves. This is a hard
gate: everything up to now is read-only investigation. Report back with a concise plan and wait.
Present it in this order — be organized, not a wall of text:
- Issues found — a short list (table is fine) of the candidates from §2. For each, give:
the error signature, the service, occurrences + 14-day trend, the class (§2.2), a link to the Datadog
issue (§1.3), and a one-line description of what it actually is — not just the raw message.
Make clear which ones you ruled out and why.
- What you're focusing on — the issue(s) you chose and why (impact × trend × confidence), plus
what you deliberately skipped with the class reason (infra / version-skew / not-an-error / upstream).
- Hypothesis — for each chosen issue, the root cause in plain language, backed by concrete evidence:
file:line references to the code at fault (and sibling call sites if it generalizes);
- the commit that introduced it where you can find it (
issue.first_seen_version + git blame /
git log -S, §3) — link it as <repo>/commit/<sha>;
- whether this issue is one instance of a broader bug.
- Proposed fix — what you'll change, at which layer, the blast radius, and the test plan
(the reproduction plus the novel cases you'll add).
- PR plan — single PR vs grouped (per §2.4), and the base branch (
development).
- Assistance needed / blockers — anything you need from the user, asked explicitly: ambiguous intent,
a product/infra decision, missing access (e.g. the error-tracking toolset is off), an issue that looks
unfixable in code, or a fix that turned out larger/riskier than expected.
Then wait for confirmation and adjust the plan to their feedback before proceeding to §5.
Pre-authorization escape hatch: if the user already said to fix and open the PR without checking back
(e.g. "just fix it and PR", or a non-interactive/scheduled run with standing approval), state the plan
briefly and proceed — but still stop and ask if you hit a blocker, an ambiguous choice, or a fix
materially larger or riskier than what you described.
5. Reproduce with tests, then fix
Tests first, and broader than the single failure.
- Write a failing test that reproduces the bug and new, different inputs that exercise the same
root cause (not just the one occurrence you saw). This both pins the bug precisely and guards the
general case. Follow the
testing skill (Vitest layers, PGlite/chdb testkit, /testing exports;
don't vi.mock repositories — use fakes/testkit).
- Confirm the test fails without the fix (sanity-check it actually targets the bug — temporarily
revert the fix or assert the pre-fix behavior).
- Apply the fix at the root cause.
- Confirm the new tests pass, and existing tests still pass:
pnpm --filter <pkg> test.
- Typecheck + lint the changed packages — never run
tsc:
pnpm --filter <pkg> typecheck (tsgo) and pnpm exec biome check <files>.
Environment gotcha: a fresh git worktree may have no node_modules (vitest: command not found,
"node_modules missing"). Run pnpm install at the repo root first (per toolchain-commands); if scripts
need node/pnpm in child shells, eval "$(mise env)".
6. Record on the issue (so the next agent doesn't redo it)
Triage state lives in Datadog; there is no external ledger. Policy (decided with the team):
- Comment when the PR is created. If an Error Tracking comment tool exists (discover via
ToolSearch(query="datadog error tracking ... comment") and verify its schema), post a short note on
the issue: what the root cause was and the PR link. For Class B–E issues you investigated and chose
not to fix, comment the verdict + reason ("infra timeout, not a code bug", "version skew", etc.) so
no one re-investigates.
- Do not auto-resolve. A fixed error simply stops receiving occurrences after the deploy — that is
the real signal. Marking "Resolved" before deploy would hide something still firing. (Only resolve if
the user explicitly asks.)
- If no comment/write tool is available (toolset off or tool absent): skip recording. Open issues are
re-discovered next run; fixed ones go quiet. Don't invent an external tracker.
The exact error-tracking write tool names/schemas weren't loadable when this skill was written. Always
discover and verify them at runtime before calling — don't assume a signature.
7. Create the PR
Follow the create-pr skill for the description. Specifics for this workflow:
- Base branch =
development (v2 trunk). Verify ancestry before opening
(git merge-base --is-ancestor origin/development HEAD); never base on main.
- Branch first if you're on
development/detached.
- In the description, include:
- The Datadog issue(s) addressed — always a link per issue (§1.3), with its error signature and
occurrence/trend context. Link the introducing commit too when you found it (§3).
- Root cause in plain language, and why the fix is at this layer (note if it generalizes beyond
the observed symptom / fixes sibling call sites).
- The tests added and that they fail-without / pass-with the fix.
- Verification steps to confirm the fix — typically: "after deploy, occurrences of issue
<id>
should drop to zero," plus how to reproduce locally.
- If you grouped multiple issues, list each with its own root-cause line.
Appendix A — Worked example (the surrogate bug)
- Symptoms:
workers AIError "Embedding failed (voyage-4-large): 400 … input … valid UTF-8 …
special characters properly escaped" (15×) and RepositoryError "Cannot parse escape sequence:
missing second part of surrogate pair … value of key summary" (3×). Ranked ~15th by volume — not
the top occurrence count.
- Classification: the top groups were noise —
DelayedError (D), *LockUnavailableError (D),
RepositoryError: socket hang up/Timeout/deadlock (B), web "Server function info not found" (C).
- Root cause (generalized): lone UTF-16 surrogates (from arbitrary LLM I/O and from length-sliced
previews splitting an emoji's surrogate pair) flowed unsanitized into two sinks — ClickHouse JSON
insert and the Voyage API.
- Fix: sanitize at the source (
packages/domain/taxonomy/.../record-session-observation.ts) with the
existing tested helper stripLoneSurrogates (@domain/spans), covering both short- and long-session
paths — mirroring how build-trace-search-document already handles the identical ClickHouse constraint.
- Tests: added a case feeding lone surrogates, asserting the embed input and the persisted summary
are sanitized; confirmed it fails without the fix.
Appendix B — Command cheatsheet
# Confirm the toolset / discover error-tracking tools
ToolSearch "datadog error tracking issue"
ToolSearch "select:mcp__plugin_datadog_mcp__aggregate_spans,mcp__plugin_datadog_mcp__search_datadog_spans"
# Are there *incident-management* incidents? (usually 0 — then it's Error Tracking)
search_datadog_incidents(query="state:(active OR stable)")
# Heavy hitters by service + error type (last 7d, prod)
aggregate_spans(query="status:error env:production", from="now-7d",
computes=[{field:"*",aggregation:"COUNT",output:"count",sort:"desc"}],
group_by={fields:["service","@error.type"],limit:40})
# Drill into messages/resources for chosen services
aggregate_spans(query="service:web status:error env:production @error.type:(Error OR TypeError)",
computes=[{field:"*",aggregation:"COUNT",output:"count",sort:"desc"}],
group_by={fields:["@error.message","resource_name"],limit:25})
# Read stacks + issue ids
search_datadog_spans(query="service:workers status:error env:production @error.type:RepositoryError resource_name:\"process taxonomy\"",
custom_attributes=["error.*","issue.*"], max_tokens=8000)
# Correlate a fresh regression with deploys
get_change_stories(service_name="web", env="production", start_ts=..., end_ts=..., story_types=["deployment"])
# Verify a fix locally
pnpm install # if node_modules missing in the worktree
pnpm --filter <pkg> test
pnpm --filter <pkg> typecheck # tsgo — never `tsc`
pnpm exec biome check <changed-files>
Related skills
analyze-problem (root-cause method) · testing (Vitest/testkit) · create-pr (PR description) ·
architecture-boundaries (which layer to fix in) · database-clickhouse & effect-and-errors
(common error sources) · toolchain-commands (install/run/env) · /datadog:ddtoolsets (enable the
error-tracking toolset).