Run any Skill in Manus with one click

$pwd:

debug-issue-with-datadog

Name: Debug Issue With Datadog
Author: langfuse

// Debug a user-reported issue, Linear ticket, or incident report by combining Datadog (APM, logs, metrics) with the Langfuse repo to establish a root cause. Use when given a Linear issue URL/ID (e.g. LFE-XXXX), a GitHub issue, or a pasted error/report and asked to investigate, root-cause, or triage. Produces a structured analysis — error breakdown, hypothesis-by-class, suggested patches with code references.

Run Skill in Manus

$ git log --oneline --stat

stars:27,778

forks:2,843

updated:May 11, 2026 at 20:42

File Explorer

5 files

SKILL.md

readonly

name

debug-issue-with-datadog

description

Debug a user-reported issue, Linear ticket, or incident report by combining Datadog (APM, logs, metrics) with the Langfuse repo to establish a root cause. Use when given a Linear issue URL/ID (e.g. LFE-XXXX), a GitHub issue, or a pasted error/report and asked to investigate, root-cause, or triage. Produces a structured analysis — error breakdown, hypothesis-by-class, suggested patches with code references.

Debug Issue with Datadog

Use this skill whenever the task is investigative rather than implementational: a user, customer, or oncall has surfaced a problem and you need to figure out what is actually happening in production and where in the code it lives. The deliverable is an analysis, not a patch — though the analysis should make the right patch obvious.

When to Apply

A Linear issue (typically with an LFE-XXXX ID) describes a production failure, error spike, or customer report.
A GitHub issue or pasted incident/error report needs triage.
A monitor alerted and you need to understand why before deciding what to fix.
Existing tickets under the "Make monitoring useful again" project (parent LFE-8837) and similar — these expect the structured analysis output below.

If the task is "implement this fix" rather than "figure out what's broken", this is the wrong skill — go to backend-dev-guidelines or the relevant package guide.

Workflow

Read the inputs first, then plan the Datadog sweep, then read the code, then write the analysis. Do not skip ahead to suggested patches before the data supports them.

Intake. Pull every signal already available in the report. See references/intake.md. For a Linear URL/ID, fetch the issue and its comments via the Linear MCP — the description is often updated inline as triage proceeds. For a GitHub issue, use gh issue view. For pasted text, treat it as the description.
Scope the sweep. From the intake, pick the affected subsystem and time window. Use references/repo-debug-map.md to translate "PostHog integration", "ingestion failures", "evals stuck", etc. into the Datadog filters and source files you should be looking at.
Run the broad Datadog sweep. Default to the full sweep in references/datadog-playbook.md: APM spans, error logs, metrics, and monitors — split across prod-eu and prod-us (and prod-hipaa / prod-jp when relevant). Always check regional disparity first; it usually rules whole hypotheses in or out. Use datadog-query-recipes for reusable tenant, public API, queue consumer, and cross-environment query shapes.
Cluster the errors. Group by (projectId, error.message) or (error.type, error.message). Treat each distinct cluster as its own hypothesis — Langfuse incidents commonly have multiple coexisting root causes, not one.
Map clusters to code. For each cluster, open the relevant handler file from the repo-debug map and read enough of it to confirm or refute the hypothesis. Cite specific files and line ranges in the output.
Write the analysis using references/output-template.md.
Deliver. Default: print the analysis in chat. If the user asked for it, also save under the workflow they specified (file, Linear comment via, etc.).

Datadog MCP Usage Notes

Two Datadog MCP servers are typically available — one bound to the EU site (datadoghq.eu) and one to the US site (datadoghq.com). Always run region-relevant queries against both unless intake clearly localizes the incident. The prod-eu / prod-us env tags live on each side respectively.

Span search filter pattern: service:worker resource_name:"process posthog-integration-project" status:error
Log search filter pattern: service:worker env:prod-eu @langfuse.project.id:cm1r6u… status:error
For high-volume queries, prefer aggregate_spans / aggregate_events grouped by (error.message, projectId) over fetching individual traces.
Always link to the Datadog UI for the queries you ran (final section of the output template).

See references/datadog-playbook.md for the full set of starter queries and parameter shapes.

Output Expectations

From the output template:

Header: data source, time window, region split (EU vs US table).
Hotspots: per-projectId (or per-cluster) error counts.
Root cause by error class: each cluster gets a short hypothesis with reasoning, distinguishing primary causes from symptoms.
Suggested patches: P0/P1/P2 grouped, with concrete file paths and short code sketches. Reference the actual handler in worker/src/features/** or web/src/**.
Dashboards: paste the Datadog query URLs at the end.

Findings come first, recommendations last. If the data is thin, say so explicitly and propose what would need to be true to confirm each hypothesis — do not invent root causes.

Cross-References

Production telemetry query recipes, tenant/public API usage, and queue consumer measurements: datadog-query-recipes
Backend layout, queue contracts, instrumentation patterns: backend-dev-guidelines
ClickHouse-related findings (memory ceilings, JOIN spills, slow queries): clickhouse-best-practices
Once a fix is identified and you switch to implementation, hand off to the package AGENTS.md for the affected directory.

related-skills.json

same repository

pnpm-upgrade-package.md

from "langfuse/langfuse"

Use when upgrading a dependency in this pnpm workspace, including requests to bump a package to a specific version, compare the registry latest version with the latest version installable under the current minimum-release-age window, or decide whether minimumReleaseAgeExclude in pnpm-workspace.yaml must change. Ask the user for the package name or target version when either is missing.

2026-05-1927.8k

agent-setup-maintenance.md

from "langfuse/langfuse"

Shared workflow for editing Langfuse's repo-owned agent setup under `.agents/`. Use when changing AGENTS files, shared skills, `.agents/config.json`, generated shim behavior, provider discovery paths, or install-time agent sync.

2026-05-1427.8k

skill-creator.md

from "langfuse/langfuse"

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Codex's capabilities with specialized knowledge, workflows, or tool integrations.

2026-05-1427.8k

datadog-query-recipes.md

from "langfuse/langfuse"

Langfuse-specific Datadog query recipes for production telemetry research. Use when asked to investigate tenant or project activity, public API endpoint usage, queue consumer behavior, spans, logs, metrics, or ad hoc production questions across prod-us, prod-eu, prod-hipaa, and prod-jp. This skill is for reusable query shapes and measured research; pair it with debug-issue-with-datadog when the task is an incident or root-cause analysis.

2026-05-1127.8k

detect-prod-regressions.md

from "langfuse/langfuse"

Proactively detect production regressions in Langfuse by comparing recent Datadog errors, error logs, error spans, and API route latency signals against baseline benchmarks or traces across prod-us, prod-eu, prod-hipaa, and prod-jp. Use when asked to sweep production for new bugs, catch regressions early, catch low-occurrence coding bugs or edge cases, compare recent changes to Datadog measurements, or prepare measured production evidence for human review before any optional Linear handoff.

2026-05-1127.8k

backend-dev-guidelines.md

from "langfuse/langfuse"

Shared backend guide for Langfuse's Next.js, tRPC, BullMQ, and TypeScript monorepo. Use when creating or reviewing tRPC routers, public REST endpoints, BullMQ queue processors, backend services, middleware, Prisma or ClickHouse data access, OpenTelemetry instrumentation, Zod validation, env configuration, or backend tests across web, worker, or packages/shared.

2026-05-1127.8k

package.json

"author": "langfuse"

"repository": "langfuse/langfuse"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name

debug-issue-with-datadog

description

Debug Issue with Datadog

When to Apply

A Linear issue (typically with an LFE-XXXX ID) describes a production failure, error spike, or customer report.
A GitHub issue or pasted incident/error report needs triage.
A monitor alerted and you need to understand why before deciding what to fix.
Existing tickets under the "Make monitoring useful again" project (parent LFE-8837) and similar — these expect the structured analysis output below.

If the task is "implement this fix" rather than "figure out what's broken", this is the wrong skill — go to backend-dev-guidelines or the relevant package guide.

Workflow

Read the inputs first, then plan the Datadog sweep, then read the code, then write the analysis. Do not skip ahead to suggested patches before the data supports them.

Intake. Pull every signal already available in the report. See references/intake.md. For a Linear URL/ID, fetch the issue and its comments via the Linear MCP — the description is often updated inline as triage proceeds. For a GitHub issue, use gh issue view. For pasted text, treat it as the description.
Scope the sweep. From the intake, pick the affected subsystem and time window. Use references/repo-debug-map.md to translate "PostHog integration", "ingestion failures", "evals stuck", etc. into the Datadog filters and source files you should be looking at.
Run the broad Datadog sweep. Default to the full sweep in references/datadog-playbook.md: APM spans, error logs, metrics, and monitors — split across prod-eu and prod-us (and prod-hipaa / prod-jp when relevant). Always check regional disparity first; it usually rules whole hypotheses in or out. Use datadog-query-recipes for reusable tenant, public API, queue consumer, and cross-environment query shapes.
Cluster the errors. Group by (projectId, error.message) or (error.type, error.message). Treat each distinct cluster as its own hypothesis — Langfuse incidents commonly have multiple coexisting root causes, not one.
Map clusters to code. For each cluster, open the relevant handler file from the repo-debug map and read enough of it to confirm or refute the hypothesis. Cite specific files and line ranges in the output.
Write the analysis using references/output-template.md.
Deliver. Default: print the analysis in chat. If the user asked for it, also save under the workflow they specified (file, Linear comment via, etc.).

Datadog MCP Usage Notes

Span search filter pattern: service:worker resource_name:"process posthog-integration-project" status:error
Log search filter pattern: service:worker env:prod-eu @langfuse.project.id:cm1r6u… status:error
For high-volume queries, prefer aggregate_spans / aggregate_events grouped by (error.message, projectId) over fetching individual traces.
Always link to the Datadog UI for the queries you ran (final section of the output template).

See references/datadog-playbook.md for the full set of starter queries and parameter shapes.

Output Expectations

From the output template:

Header: data source, time window, region split (EU vs US table).
Hotspots: per-projectId (or per-cluster) error counts.
Root cause by error class: each cluster gets a short hypothesis with reasoning, distinguishing primary causes from symptoms.
Suggested patches: P0/P1/P2 grouped, with concrete file paths and short code sketches. Reference the actual handler in worker/src/features/** or web/src/**.
Dashboards: paste the Datadog query URLs at the end.

Findings come first, recommendations last. If the data is thin, say so explicitly and propose what would need to be true to confirm each hypothesis — do not invent root causes.

Cross-References

Production telemetry query recipes, tenant/public API usage, and queue consumer measurements: datadog-query-recipes
Backend layout, queue contracts, instrumentation patterns: backend-dev-guidelines
ClickHouse-related findings (memory ceilings, JOIN spills, slow queries): clickhouse-best-practices
Once a fix is identified and you switch to implementation, hand off to the package AGENTS.md for the affected directory.

debug-issue-with-datadog

Debug Issue with Datadog

When to Apply

Workflow

Datadog MCP Usage Notes

Output Expectations

Cross-References

More from this repository

More from this repository

Debug Issue with Datadog

When to Apply

Workflow

Datadog MCP Usage Notes

Output Expectations

Cross-References