تشغيل أي مهارة في Manus بنقرة واحدة

$pwd:

k8s-blast-radius

Name: K8s Blast Radius
Author: elastic

// Assess the impact of a Kubernetes node going offline — which deployments lose all replicas (full outage), which lose partial capacity (degraded), which are unaffected, and whether the cluster has enough spare capacity to reschedule the lost pods. Use when the user asks "what happens if node X goes down", "what's the blast radius of draining this node", "can I safely maintain node Y", "what's running on this node", "if I evict this node what breaks", or is planning node maintenance, a cluster upgrade, or investigating an actual node failure. Requires Kubernetes (kubeletstats metrics) and Elastic APM for downstream service impact — do not trigger for non-K8s deployments.

تشغيل في Manus

$ git log --oneline --stat

stars:٧

forks:٦

updated:٣٠ أبريل ٢٠٢٦ في ١٤:١٩

SKILL.md

readonly

name

k8s-blast-radius

description

Assess the impact of a Kubernetes node going offline — which deployments lose all replicas (full outage), which lose partial capacity (degraded), which are unaffected, and whether the cluster has enough spare capacity to reschedule the lost pods. Use when the user asks "what happens if node X goes down", "what's the blast radius of draining this node", "can I safely maintain node Y", "what's running on this node", "if I evict this node what breaks", or is planning node maintenance, a cluster upgrade, or investigating an actual node failure. Requires Kubernetes (kubeletstats metrics) and Elastic APM for downstream service impact — do not trigger for non-K8s deployments.

Kubernetes Blast Radius

Answers hypothetical and real node-failure questions with data. Categorizes every deployment touching the node into full-outage / degraded / unaffected, totals up memory at risk, and checks whether the remaining cluster has capacity to reschedule.

Prerequisites

Signal	Required?	What you get without it
Kubernetes (kubeletstats)	Required	Tool does not apply — suggest the user instrument with kubeletstats receiver.
Elastic APM	Optional	Core node-impact analysis still works. The `downstream_services` section (user-facing services in affected namespaces) is omitted with a note.

If the user is not running Kubernetes, this tool does not apply. But a Kubernetes-only customer (no APM) still gets the full pod-level impact assessment and rescheduling feasibility — the majority of the value.

Tools

Tool	Purpose
`k8s-blast-radius`	Run the impact assessment for a specific node.
`apm-health-summary`	Before: check which services are already degraded.
`apm-service-dependencies`	After: map downstream ripple for affected services.
`ml-anomalies`	After: is unusual behavior already showing up on affected workloads?

How to call k8s-blast-radius

{
  "node": "gke-prod-pool-1-abc123",
  "cluster": "prod-us-east",
  "layout": "summary"
}

Parameter-filling guidance:

node: must be exact. Matched literally against kubernetes.node.name. If the user describes a node ambiguously ("the noisy node", "the one running frontend"), ask them to confirm the exact node name before calling. Do not guess.
cluster: required when the same node name might exist in multiple clusters — auto-generated cloud node names (GKE / EKS) sometimes collide. Resolves fuzzily against k8s.cluster.name (OTel) / orchestrator.cluster.name (ECS); on miss the response includes cluster_candidates. Omit for single-cluster deployments.
layout: default summary (compact, collapsible sections). Use radial when the user wants a visual "impact-by-proximity" diagram.

After the tool returns

Response shape:

status: AT RISK (full outage), PARTIAL RISK (degraded only), or SAFE (no impact).
data_coverage: which backends contributed (always kubernetes: true; apm: true|false).
pods_at_risk: count of pods on the node.
full_outage[]: deployments losing all replicas — lead with these.
degraded[]: deployments losing partial capacity.
unaffected / unaffected_count: deployments not touching the node.
rescheduling: memory required vs available, and whether it's feasible.
downstream_services[] (only if APM present): user-facing services whose namespace is affected.
downstream_services_note (only if APM absent): explains the gap.
investigation_actions: next-step prompts surfaced as click-to-send buttons in the view (includes a SPOF callout when a single-replica deployment is implicated).
render_instructions: HTML render spec — let the inline MCP App view handle visualization (floating summary card, radial affected-deployment sweep, safe-zone arc, hover tooltips).

Ignore _setup_notice if present — it's view-side chrome (welcome banner) that the UI handles. Don't echo or summarize it in chat.

Narrate in this order:

Headline status: "AT RISK — 3 deployments lose all replicas if gke-prod-pool-1-abc123 goes offline."
Full outage list: name the deployments. These are the critical ones.
Degraded list: name them, note surviving replica counts.
Rescheduling feasibility: "Cluster has X GB available across N nodes to absorb Y GB required — safe / not safe / marginal."
Downstream services (if APM present): name the services in affected namespaces that might be user-visible.
Recommend action: for AT RISK + infeasible reschedule, "don't drain this node without scaling up." For PARTIAL RISK + feasible, "safe to drain with PodDisruptionBudgets in place."

Key principles

Hypothetical framing. Unless the node is actually down, always present results as "if X goes offline, then Y" — not as current reality.
Rescheduling feasibility is a heuristic. It compares memory only — doesn't account for CPU, storage, affinity rules, taints, or PodDisruptionBudgets. Note this caveat.
Full-outage >> degraded. A deployment with 1 replica on the node is a full outage; a deployment with 3 replicas losing 1 is degraded. Treat them very differently in recommendations.
Downstream services matter. Even if a deployment is degraded not down, user-facing services might see tail latency. Mention the downstream APM services.
Don't conflate "at risk" with "broken." The status reflects potential impact. The node may be fine.

related-skills.json

نفس المستودع

observe.md

from "elastic/example-mcp-app-observability"

The agent's Elastic-access primitive. Four modes: wait for an ML anomaly to fire, poll an ES|QL metric (live-sample or wait for a threshold), read a single-instance scalar value, or return a full ES|QL table. Use when the user says "tell me when...", "let me know if...", "wait until X drops below Y", "watch for anything unusual", "monitor for the next N minutes", "poll until stable", "what is X right now", "list …", "which … are …", or wants transient (session-scoped) monitoring or ad-hoc querying without creating a persistent Kibana rule. Also trigger for "keep an eye on" and post-remediation validation.

2026-05-077

mcp-app-dev-setup.md

from "elastic/example-mcp-app-observability"

Bootstrap or repair a development environment for the Elastic Observability MCP App with Forge as the data driver. Use when the user says "set up Forge for me", "get me ready to work on this MCP app", "run the validation suite", "I just cloned this repo, what now", or wants the dev environment refreshed after a long gap. Verifies sibling Forge clone, Python venv, cluster credentials, MCP harness, and runs a smoke test against the canonical validation suite.

2026-05-017

apm-health-summary.md

from "elastic/example-mcp-app-observability"

Get a cluster-level rollup of service health from APM telemetry — the "how's my environment right now?" entry point for observability investigations. Use whenever the user asks about HEALTH, STATUS, or general wellbeing of an environment / cluster / namespace ("how's my cluster", "status of the X env", "what's broken", "any issues", "show me the health of …", "give me a status report", "what should I look at", "things feel slow"). This applies regardless of any time qualifier — "show me the health of X over the past hour" still routes here (with lookback="1h"), NOT to observe. observe is for raw-metric queries; this tool is for the rollup. Gracefully degrades: layers in Kubernetes pod data and ML anomaly context when those backends are present, but still returns useful APM-only output if they aren't. Do not use for log-only or metrics-only customers — this tool requires Elastic APM.

2026-05-017

apm-service-dependencies.md

from "elastic/example-mcp-app-observability"

Map the application topology from APM telemetry — which services call which, over what protocols, with what call volume and latency. Use when the user asks "what calls X", "what depends on X", "show me the topology", "what are the upstream/downstream services", "where does this service fit", or is doing root-cause investigation and needs to trace how a problem propagates through the call graph. Also trigger for "service map", "dependency graph", "blast radius of service X", or "who's the dependency of Y". Requires Elastic APM — do not trigger for log-only or metrics-only customers.

2026-05-017

ml-anomalies.md

from "elastic/example-mcp-app-observability"

Query Elastic ML anomaly detection results to understand what's behaving unusually, why, and how badly. Use when the user asks "what's anomalous", "is anything unusual happening", "why is X slow/spiking", "show me the weirdness", or mentions memory growth, CPU spikes, restart patterns, unusual latency, unexpected error rates, or drift from typical behavior. Also trigger for "ML anomalies", "anomaly detection", "Elastic ML", "what does ML think", or when the user wants to understand behavior that deviates from baseline. The tool opens an inline explainer view with a severity gauge, plain-English narrative, and per-entity deviation breakdown — so the agent should USE the visualization, not just dump JSON.

2026-05-017

manage-alerts.md

from "elastic/example-mcp-app-observability"

CRUD for Kibana alerting rules — create, list, get, or delete custom-threshold rules. Use when the user says "alert me when", "create a rule for", "page me if", "set up an alert", "show me my rules", "what alerts do I have", "delete that alert", "remove the rule". Backend-agnostic — works on any metric field in any index pattern (metrics-*, logs-*, traces-apm*, custom). For transient session-scoped monitoring use `observe` instead. Requires Kibana with the Alerting feature enabled — the tool is auto-disabled when no Kibana URL is configured.

2026-04-307

package.json

"author": "elastic"

"repository": "elastic/example-mcp-app-observability"

فتح مستودع GitHub عرض مستودعات المنشئ

$ install --global

$ download --local

تشغيل في Manus

$ useful --forSOC

مديرو الشبكات وأنظمة الحاسوبمهن الحاسوب والرياضيات15-1244L4

name

k8s-blast-radius

description

Kubernetes Blast Radius

Prerequisites

Signal	Required?	What you get without it
Kubernetes (kubeletstats)	Required	Tool does not apply — suggest the user instrument with kubeletstats receiver.
Elastic APM	Optional	Core node-impact analysis still works. The `downstream_services` section (user-facing services in affected namespaces) is omitted with a note.

Tools

Tool	Purpose
`k8s-blast-radius`	Run the impact assessment for a specific node.
`apm-health-summary`	Before: check which services are already degraded.
`apm-service-dependencies`	After: map downstream ripple for affected services.
`ml-anomalies`	After: is unusual behavior already showing up on affected workloads?

How to call k8s-blast-radius

{
  "node": "gke-prod-pool-1-abc123",
  "cluster": "prod-us-east",
  "layout": "summary"
}

Parameter-filling guidance:

node: must be exact. Matched literally against kubernetes.node.name. If the user describes a node ambiguously ("the noisy node", "the one running frontend"), ask them to confirm the exact node name before calling. Do not guess.
cluster: required when the same node name might exist in multiple clusters — auto-generated cloud node names (GKE / EKS) sometimes collide. Resolves fuzzily against k8s.cluster.name (OTel) / orchestrator.cluster.name (ECS); on miss the response includes cluster_candidates. Omit for single-cluster deployments.
layout: default summary (compact, collapsible sections). Use radial when the user wants a visual "impact-by-proximity" diagram.

After the tool returns

Response shape:

status: AT RISK (full outage), PARTIAL RISK (degraded only), or SAFE (no impact).
data_coverage: which backends contributed (always kubernetes: true; apm: true|false).
pods_at_risk: count of pods on the node.
full_outage[]: deployments losing all replicas — lead with these.
degraded[]: deployments losing partial capacity.
unaffected / unaffected_count: deployments not touching the node.
rescheduling: memory required vs available, and whether it's feasible.
downstream_services[] (only if APM present): user-facing services whose namespace is affected.
downstream_services_note (only if APM absent): explains the gap.
investigation_actions: next-step prompts surfaced as click-to-send buttons in the view (includes a SPOF callout when a single-replica deployment is implicated).
render_instructions: HTML render spec — let the inline MCP App view handle visualization (floating summary card, radial affected-deployment sweep, safe-zone arc, hover tooltips).

Ignore _setup_notice if present — it's view-side chrome (welcome banner) that the UI handles. Don't echo or summarize it in chat.

Narrate in this order:

Headline status: "AT RISK — 3 deployments lose all replicas if gke-prod-pool-1-abc123 goes offline."
Full outage list: name the deployments. These are the critical ones.
Degraded list: name them, note surviving replica counts.
Rescheduling feasibility: "Cluster has X GB available across N nodes to absorb Y GB required — safe / not safe / marginal."
Downstream services (if APM present): name the services in affected namespaces that might be user-visible.
Recommend action: for AT RISK + infeasible reschedule, "don't drain this node without scaling up." For PARTIAL RISK + feasible, "safe to drain with PodDisruptionBudgets in place."

Key principles

Hypothetical framing. Unless the node is actually down, always present results as "if X goes offline, then Y" — not as current reality.
Rescheduling feasibility is a heuristic. It compares memory only — doesn't account for CPU, storage, affinity rules, taints, or PodDisruptionBudgets. Note this caveat.
Full-outage >> degraded. A deployment with 1 replica on the node is a full outage; a deployment with 3 replicas losing 1 is degraded. Treat them very differently in recommendations.
Downstream services matter. Even if a deployment is degraded not down, user-facing services might see tail latency. Mention the downstream APM services.
Don't conflate "at risk" with "broken." The status reflects potential impact. The node may be fine.

k8s-blast-radius

Kubernetes Blast Radius

Prerequisites

Tools

How to call k8s-blast-radius

After the tool returns

Key principles

المزيد من هذا المستودع

Kubernetes Blast Radius

Prerequisites

Tools

How to call k8s-blast-radius

After the tool returns

Key principles

المزيد من هذا المستودع