一键在 Manus 中运行任何 Skill

$pwd:

distributed-tracing

Name: Distributed Tracing
Author: LiboMa

// Distributed trace analysis via Jaeger — cross-service causal chain construction, latency bottleneck identification, error propagation tracking. Provides 4 trace query tools and decision trees for investigating cascading failures across microservices.

在 Manus 中运行

$ git log --oneline --stat

stars:3

forks:1

updated:2026年3月3日 16:12

文件资源管理器

2 个文件

SKILL.md

readonly

name	distributed-tracing
description	Distributed trace analysis via Jaeger — cross-service causal chain construction, latency bottleneck identification, error propagation tracking. Provides 4 trace query tools and decision trees for investigating cascading failures across microservices.
metadata	{"author":"agenticops","version":"1.0","domain":"observability"}
tools	["agenticops.tools.trace_tools.query_traces","agenticops.tools.trace_tools.get_trace_detail","agenticops.tools.trace_tools.get_service_dependencies","agenticops.tools.trace_tools.find_error_traces"]

Distributed Tracing Skill

Overview

When this skill is activated, 4 trace query tools are dynamically registered:

Tool	Purpose	Key Args
`query_traces`	Find traces for a service (summary list)	`service`, `lookback`, `operation`, `min_duration`, `limit`
`get_trace_detail`	Full span tree for a trace ID	`trace_id`
`get_service_dependencies`	Service-to-service call graph	`lookback`
`find_error_traces`	Error traces grouped by origin service	`service`, `lookback`, `limit`

When to Activate This Skill

Activate when the issue involves ANY of:

Service degradation (latency spikes, timeout errors, 5xx responses)
Cascading failures (multiple services affected, unclear origin)
Cross-service errors (error in service A caused by downstream service B)
Intermittent failures (some requests fail, others succeed — trace sampling reveals the pattern)

Do NOT activate for:

Single-service issues (e.g., pod OOM, node disk pressure)
Infrastructure-only issues (e.g., VPC routing, security groups)
Issues where the failing service is already clearly identified

Investigation Decision Tree

1. Service Degradation (High Latency / 5xx)

Alert: "frontend HighErrorRate" or "HighLatencyP99"
│
├── Step 1: get_service_dependencies()
│   → Understand the full call graph: who calls whom?
│
├── Step 2: query_traces(service="frontend", lookback="15m", min_duration="1s")
│   → Find slow traces — which traces are taking > 1s?
│
├── Step 3: get_trace_detail(trace_id=SLOWEST_TRACE)
│   → See the full span tree — WHERE is the time being spent?
│   → Look for the SLOWEST span — that's the bottleneck
│
│   Example span tree:
│   frontend: /checkout [5.2s] OK
│   ├── checkoutservice: /PlaceOrder [4.8s] OK
│   │   ├── cartservice: /GetCart [4.5s] ERROR  ← majority of time
│   │   │   └── redis-cart: GET [4.2s] ERROR    ← ROOT CAUSE
│   │   └── productcatalogservice: /GetProduct [50ms] OK
│   └── currencyservice: /Convert [30ms] OK
│
├── Step 4: find_error_traces(service="frontend")
│   → Confirm error pattern: which downstream service has most errors?
│
└── Conclusion: Root cause is redis-cart (4.2s timeout),
    NOT frontend (which is just the symptom)

2. Intermittent Failures

Alert: "Service X intermittent 5xx"
│
├── Step 1: query_traces(service="X", lookback="30m")
│   → Compare successful vs failed traces
│
├── Step 2: get_trace_detail(FAILED_TRACE_ID)
│   → Find where the error occurs in the chain
│
├── Step 3: get_trace_detail(SUCCESSFUL_TRACE_ID)
│   → Compare — what's different? Different path? Different downstream?
│
└── Common patterns:
    - Load balancer routing to unhealthy backend
    - Connection pool exhaustion (some requests get pooled conn, others timeout)
    - Retry storms (downstream overload causes more retries → more overload)

3. Unknown Dependency Failure

Alert on service A, but service A pods/metrics look healthy
│
├── Step 1: get_service_dependencies()
│   → Discover: A → B → C → D (chain you didn't know about)
│
├── Step 2: find_error_traces(service="A")
│   → Error origins show: D has 95% of errors, not A
│
├── Step 3: get_trace_detail(ERROR_TRACE_ID)
│   → Confirms: D is the fault origin, errors propagate D → C → B → A
│
└── Conclusion: Investigate service D, not A

Interpreting Span Trees

Key Signals

Signal	Meaning
SLOWEST span deep in the tree	Downstream bottleneck — root cause is the slow service
ERROR on leaf span only	Single point of failure — error originates at the leaf
ERROR propagating up the tree	Cascading failure — fix the deepest ERROR first
One branch slow, others fast	Isolated issue in one dependency path
All branches slow	Possible network issue or shared resource (DB, cache) saturation

Duration Analysis

Compare span duration to its parent — if a child is >80% of parent's duration, that child is the bottleneck
Gaps between child spans indicate processing time in the parent service
Overlapping child spans indicate parallel calls (fan-out pattern)

Error Tags

Jaeger uses the error=true tag on spans. Additional context from:

http.status_code: HTTP response code (500, 503, 504)
otel.status_code: OpenTelemetry status (ERROR)
otel.status_description: Error message text

Confidence Scoring with Traces

Evidence	Confidence Boost
Trace shows clear bottleneck (>80% of total duration in one span)	+0.3
Multiple error traces point to same downstream service	+0.2
Service dependency graph confirms the affected path	+0.1
Trace evidence correlates with metric anomaly timing	+0.2

Example: Without traces, confidence might be 0.4 (speculation). With traces showing redis-cart as bottleneck across 15/20 error traces: 0.4 + 0.3 + 0.2 + 0.1 = 0.9 (high confidence).

related-skills.json

同仓库

kubernetes-admin.md

from "LiboMa/agenticops-chat"

Kubernetes administration and troubleshooting — covers pod debugging (CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending), node issues, CNI/networking, CoreDNS, PVC/storage, HPA/VPA autoscaling, and EKS-specific patterns. Includes decision trees for common failure modes.

2026-04-063

document-analysis.md

from "LiboMa/agenticops-chat"

Read and analyze documents — PDF, DOCX, Markdown, HTML, CSV, XLSX, JSON, YAML. Provides read_document tool with no output truncation and page-range support for PDFs. Use when the user shares a document or asks to explain, summarize, or extract information from files.

2026-03-263

web-research.md

from "LiboMa/agenticops-chat"

Fetch open web data — cloud status pages, documentation, API endpoints, changelogs, and CVE databases. Provides web_fetch tool for HTTP GET with security controls (private IP blocking, size limits, timeout). Use for checking service status pages, reading upstream documentation, or fetching public API data during investigation.

2026-03-243

security-engineer.md

from "LiboMa/agenticops-chat"

AWS security posture assessment and incident response — covers IAM analysis (overprivileged roles, unused credentials, MFA gaps), Security Hub findings, GuardDuty threats, Inspector vulnerabilities, S3 public access, SG/NACL misconfigurations, KMS key rotation, WAF rules, Config compliance, and CloudTrail integrity.

2026-03-103

notification-operator.md

from "LiboMa/agenticops-chat"

Send notifications and distribute formatted reports to channels (Feishu, Slack, Email, SES, SNS, DingTalk, WeCom, Webhook). Supports batch multi-channel delivery with format-aware conversion (HTML, PDF, Markdown). Activate to gain send and distribute tools.

2026-03-053

local-os-operator.md

from "LiboMa/agenticops-chat"

Local filesystem operations — read configs, tail logs, search files, list directories, inspect file metadata, and write files. Provides secure access to local operational artifacts (Terraform, CloudFormation, Kubernetes manifests, systemd units, nginx configs, application properties, log files). Includes security blocklists for sensitive files.

2026-03-023

package.json

"author": "LiboMa"

"repository": "LiboMa/agenticops-chat"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

网络与计算机系统管理员计算机与数学类职业15-1244L4

distributed-tracing

Distributed Tracing Skill

Overview

When to Activate This Skill

Investigation Decision Tree

1. Service Degradation (High Latency / 5xx)

2. Intermittent Failures

3. Unknown Dependency Failure

Interpreting Span Trees

Key Signals

Duration Analysis

Error Tags

Confidence Scoring with Traces

同仓库更多 Skills

同仓库更多 Skills

Distributed Tracing Skill

Overview

When to Activate This Skill

Investigation Decision Tree

1. Service Degradation (High Latency / 5xx)

2. Intermittent Failures

3. Unknown Dependency Failure

Interpreting Span Trees

Key Signals

Duration Analysis

Error Tags

Confidence Scoring with Traces