一键在 Manus 中运行任何 Skill

$pwd:

apm-health-summary

Name: Apm Health Summary
Author: elastic

// Get a cluster-level rollup of service health from APM telemetry — the "how's my environment right now?" entry point for observability investigations. Use whenever the user asks about HEALTH, STATUS, or general wellbeing of an environment / cluster / namespace ("how's my cluster", "status of the X env", "what's broken", "any issues", "show me the health of …", "give me a status report", "what should I look at", "things feel slow"). This applies regardless of any time qualifier — "show me the health of X over the past hour" still routes here (with lookback="1h"), NOT to observe. observe is for raw-metric queries; this tool is for the rollup. Gracefully degrades: layers in Kubernetes pod data and ML anomaly context when those backends are present, but still returns useful APM-only output if they aren't. Do not use for log-only or metrics-only customers — this tool requires Elastic APM.

在 Manus 中运行

$ git log --oneline --stat

stars:7

forks:6

updated:2026年5月1日 08:58

SKILL.md

readonly

related-skills.json

同仓库

observe.md

from "elastic/example-mcp-app-observability"

The agent's Elastic-access primitive. Four modes: wait for an ML anomaly to fire, poll an ES|QL metric (live-sample or wait for a threshold), read a single-instance scalar value, or return a full ES|QL table. Use when the user says "tell me when...", "let me know if...", "wait until X drops below Y", "watch for anything unusual", "monitor for the next N minutes", "poll until stable", "what is X right now", "list …", "which … are …", or wants transient (session-scoped) monitoring or ad-hoc querying without creating a persistent Kibana rule. Also trigger for "keep an eye on" and post-remediation validation.

2026-05-077

mcp-app-dev-setup.md

from "elastic/example-mcp-app-observability"

Bootstrap or repair a development environment for the Elastic Observability MCP App with Forge as the data driver. Use when the user says "set up Forge for me", "get me ready to work on this MCP app", "run the validation suite", "I just cloned this repo, what now", or wants the dev environment refreshed after a long gap. Verifies sibling Forge clone, Python venv, cluster credentials, MCP harness, and runs a smoke test against the canonical validation suite.

2026-05-017

apm-service-dependencies.md

from "elastic/example-mcp-app-observability"

Map the application topology from APM telemetry — which services call which, over what protocols, with what call volume and latency. Use when the user asks "what calls X", "what depends on X", "show me the topology", "what are the upstream/downstream services", "where does this service fit", or is doing root-cause investigation and needs to trace how a problem propagates through the call graph. Also trigger for "service map", "dependency graph", "blast radius of service X", or "who's the dependency of Y". Requires Elastic APM — do not trigger for log-only or metrics-only customers.

2026-05-017

ml-anomalies.md

from "elastic/example-mcp-app-observability"

Query Elastic ML anomaly detection results to understand what's behaving unusually, why, and how badly. Use when the user asks "what's anomalous", "is anything unusual happening", "why is X slow/spiking", "show me the weirdness", or mentions memory growth, CPU spikes, restart patterns, unusual latency, unexpected error rates, or drift from typical behavior. Also trigger for "ML anomalies", "anomaly detection", "Elastic ML", "what does ML think", or when the user wants to understand behavior that deviates from baseline. The tool opens an inline explainer view with a severity gauge, plain-English narrative, and per-entity deviation breakdown — so the agent should USE the visualization, not just dump JSON.

2026-05-017

manage-alerts.md

from "elastic/example-mcp-app-observability"

CRUD for Kibana alerting rules — create, list, get, or delete custom-threshold rules. Use when the user says "alert me when", "create a rule for", "page me if", "set up an alert", "show me my rules", "what alerts do I have", "delete that alert", "remove the rule". Backend-agnostic — works on any metric field in any index pattern (metrics-*, logs-*, traces-apm*, custom). For transient session-scoped monitoring use `observe` instead. Requires Kibana with the Alerting feature enabled — the tool is auto-disabled when no Kibana URL is configured.

2026-04-307

k8s-blast-radius.md

from "elastic/example-mcp-app-observability"

Assess the impact of a Kubernetes node going offline — which deployments lose all replicas (full outage), which lose partial capacity (degraded), which are unaffected, and whether the cluster has enough spare capacity to reschedule the lost pods. Use when the user asks "what happens if node X goes down", "what's the blast radius of draining this node", "can I safely maintain node Y", "what's running on this node", "if I evict this node what breaks", or is planning node maintenance, a cluster upgrade, or investigating an actual node failure. Requires Kubernetes (kubeletstats metrics) and Elastic APM for downstream service impact — do not trigger for non-K8s deployments.

2026-04-307

package.json

"author": "elastic"

"repository": "elastic/example-mcp-app-observability"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

网络与计算机系统管理员计算机与数学类职业15-1244L4

name

apm-health-summary

description

Get a cluster-level rollup of service health from APM telemetry — the "how's my environment right now?" entry point for observability investigations. Use whenever the user asks about HEALTH, STATUS, or general wellbeing of an environment / cluster / namespace ("how's my cluster", "status of the X env", "what's broken", "any issues", "show me the health of …", "give me a status report", "what should I look at", "things feel slow"). This applies regardless of any time qualifier — "show me the health of X over the past hour" still routes here (with lookback="1h"), NOT to observe. observe is for raw-metric queries; this tool is for the rollup. Gracefully degrades: layers in Kubernetes pod data and ML anomaly context when those backends are present, but still returns useful APM-only output if they aren't. Do not use for log-only or metrics-only customers — this tool requires Elastic APM.

APM Health Summary

This is the first tool to reach for in vague-symptom investigations — "something feels off, where should I look?" It gives you a one-shot rollup: degraded services, top resource consumers, active anomalies, and a data_coverage report showing what backends contributed. From there, you pick the right follow-up tool.

Prerequisites

Signal	Required?	What happens without it
Elastic APM	Required	Tool returns a warning and suggests `ml-anomalies`/`observe`/`manage-alerts` instead.
Kubernetes (kubeletstats)	Optional	`pods` section is replaced by a note; service health still reported.
ML anomaly jobs	Optional	`anomalies` section is replaced by a note; service health still reported.

If the user is log-only or metrics-only (no APM), do not call this tool. Suggest ml-anomalies (for ML-backed anomaly detection) or observe / manage-alerts (both universal).

Tools

Tool	Purpose
`apm-health-summary`	The rollup. First call in most investigations.
`ml-anomalies`	Drill into anomalies flagged in the summary.
`apm-service-dependencies`	Map topology around any degraded service.
`k8s-blast-radius`	If the summary implicates a node (pod resource pressure), assess node impact.
`observe`	Post-investigation: observe for stabilization or follow-on anomalies.

How to call apm-health-summary

{
  "cluster": "prod-us-east",
  "namespace": "otel-demo",
  "lookback": "1h"
}

cluster: pass whenever the user names a cluster (even partially) — "the oteldemo cluster", "how's prod-us-east doing", "check the staging env". Use the user's literal phrasing; the tool fuzzy-matches it. Omit only when the user clearly wants a cross-cluster view or there's a single cluster in the env.
namespace: pass when the user scopes to a K8s namespace. Same fuzzy-match logic as cluster.

Handling disambiguation responses

When the user-supplied cluster or namespace matches multiple candidates (or none), the tool does not run the analysis. It returns a short response with disambiguation_needed set to "cluster", "namespace", or "cluster_and_namespace", plus the candidate list:

{
  "disambiguation_needed": "cluster",
  "cluster_requested": "oteldemo",
  "cluster_candidates": ["oteldemo-prod", "oteldemo-staging"],
  "cluster_match": "multiple"
}

When you see this, don't re-call the tool with a guessed cluster name. Surface the candidates to the user verbatim, ask which one they meant, then re-call the tool with the exact name they pick. Same flow for namespace_match: "none" (the requested name doesn't exist in recent telemetry) — show candidates and ask.

lookback: default 1h for any unqualified prompt — "what's the status of X", "how is X doing", "check on X", "give me a status report". Don't drop to 15m unless the user explicitly says something time-localized like "right now / this second / this minute". Use the user's time window literally when they give one ("over the past 30 minutes" → 30m; "in the last 6 hours" → 6h; "today" → 24h). The 1h default is intentional — most cluster-state questions need a window wide enough to surface degradation patterns, and 15m hides slow-burning issues.
job_filter: optional ML-job prefix, e.g. k8s-. Rarely needed.
exclude_entities: optional wildcard to hide known noise, e.g. chaos-*.

After the tool returns

The tool renders an inline MCP App view — status badge, scope card (cluster › namespace › service/pod counts plus an applications strip), KPI tile rows, anomaly-severity donut + heatmap, top memory pods, service throughput list, and a next-step button row driven by investigation_actions. Use the view for the visual rollup; narrate findings below it.

Inspect data_coverage first — this tells you which signals contributed.

The scope field anchors what the user is looking at — start narration with it when present: "Looking at cluster prod-us-east, namespace payments, 12 services across 42 pods…". When scope.service_groups is populated the view shows clickable application chips users can toggle to filter the page client-side (throughput rows, top pods, anomaly heatmap, donut counts all recompute). Don't suggest re-running the tool when the user wants to narrow to one application — point them at the chips instead. Only re-call with a different cluster / namespace when they're crossing the scope boundary.

Ignore _setup_notice if present in the response — it's view-side chrome (welcome banner / skill-gap hint) that the UI handles. Don't echo or summarize it in chat.

Then walk the output top-down:

Overall health (healthy / degraded / critical): lead with this.
Degraded services: name them with reasons (error rate, latency). These are the investigation targets.
Pods (if present): top memory consumers — cross-reference with degraded services.
Anomalies (if present): by-severity counts + top entities. Drives the ML follow-up.
Alerts (alerts field, always emitted): active_count / recovered_count plus top_rules and active_samples. Read these before reaching for manage-alerts — the rollup already shows what fired and why. Only call manage-alerts when the user wants to create/modify rules (not just see what fired). Cross-reference active alerts with degraded services: a pod-memory alert on the same pod that's degrading is a strong signal.
SLOs (slos field, always emitted): authoritative source for "is this cluster meeting its objectives?". configured: false means no SLOs exist — surface the note once and move on. configured: true gives you violated_count, healthy_count, and top_violations[] with each violated SLO's current sli_value, target, and one_hour_burn_rate. Read burn rate hard: > 14.4× means the rolling-window error budget burns out in <2h at the current rate (page-worthy); 6–14× is degrading; < 1× is safe pace. Cross-reference top_violations[].name with degraded_services[] — services that appear in both are the priority drilldowns. Don't suggest creating SLOs if configured: true; do suggest it if configured: false.
Next-step buttons: the view surfaces investigation_actions as clickable prompts (drill into the top pod, investigate the degraded service, check blast radius). Mention them in chat so the user knows.

Based on what you see, pick the next tool:

Degraded service named → apm-service-dependencies first. This is the highest-yield drilldown for a known-degraded service in almost every cluster. The topology map points directly at upstream/downstream root causes (slow gRPC dependency, hung leaf node, fan-out timing). Don't reach for ml-anomalies first — most clusters don't have anomaly jobs configured for arbitrary services, and you'll waste a turn on an empty result.
ml-anomalies is a complementary, not primary, drilldown. Use it when (a) the user wants anomaly detail on a known-degraded service AND data_coverage.ml_anomalies is true, OR (b) the user is investigating a vague symptom and wants detection. If data_coverage.ml_anomalies is false, skip ml-anomalies entirely — there are no jobs to query.
If you do call ml-anomalies and it returns empty / no-jobs for the entity, fall back to apm-service-dependencies for that same service immediately. Don't leave the user at a dead end.
High anomaly count in the rollup → ml-anomalies with matching lookback (this is the "lots of anomalies, what's worst?" path — different from the named-degraded-service path above).
Pod resource pressure on a specific node → k8s-blast-radius with that node name.

Key principles

Start here, then narrow. Don't guess which service is the problem — let the rollup tell you.
Respect data_coverage. If K8s is absent, don't suggest k8s-blast-radius. If APM is absent, don't call this tool at all.
The overall health is coarse. "Healthy" doesn't mean nothing is wrong — it means nothing meets the degraded thresholds. Always scan the details.
Graceful degradation is by design. APM-only output is still useful — don't apologize for missing K8s or ML signals; just report what you have.

Investigation discipline

Multi-tool investigations should be sequential and narrated, not parallel and silent. Each tool call renders its own widget in the chat — firing 4-5 in a row after a single "yes" creates a wall of "Waiting…" placeholders that look like the system is broken.

One tool call per turn. After a tool returns, narrate what you saw — the headline number, what it implies, what it rules in or out — before making the next call. The narration is the user's only signal that you read the result.
Sequential offers, not OR offers. Don't ask "Want me to check X or Y?" — that's ambiguous and the user's "yes" turns into both calls in parallel. Phrase offers as a chain: "Want me to start with X? If it's inconclusive I can follow up with Y." The user gets the same options without the parallel-execution trap.
Commit to a plan before "yes." If a triage will need 3-4 tool calls, lay out the plan first ("I'll check anomalies for flagd, then its pod resources, then trace errors from product-reviews — pause me if you've seen enough at any point") and execute one step at a time. Don't pre-fire all 3 calls because the user agreed to "the plan."
Read the rollup before drilling. apm-health-summary already includes services, pods, anomalies, fired alerts, and SLOs in one response. If the user asks "what fired recently?" — answer from alerts.top_rules, don't call manage-alerts. If they ask "what's anomalous?" — answer from anomalies.top_entities, don't call ml-anomalies for the same data.

apm-health-summary

同仓库更多 Skills

同仓库更多 Skills

APM Health Summary

Prerequisites

Tools

How to call apm-health-summary

Handling disambiguation responses

After the tool returns

Key principles

Investigation discipline

APM Health Summary

Prerequisites

Tools

How to call apm-health-summary

Handling disambiguation responses

After the tool returns

Key principles

Investigation discipline