Run any Skill in Manus with one click

analyze-prod

Use this skill when investigating a production incident, when an alert fires (latency spike, error rate, pod crashloop), when a customer-reported issue needs prod telemetry, or as the diagnosis step of an incident-response or production-bugfix flow — including when the user describes a prod symptom without asking to "analyze" — to analyze the production environment by collecting Kubernetes pod status, managed database health, logs, metrics, and networking and diagnosing issues, supporting GCP, Azure, and AWS via the `cloud-platforms` skill and applying the SRE or DevOps role.

Run Skill in Manus

Stars1

Forks0

UpdatedMay 18, 2026 at 02:53

Source

alex-voloshin-dev

alex-voloshin-dev/ai-skills

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

SKILL.md

readonly

More from this repository

same repository

knowledge-sync-init

alex-voloshin-dev/ai-skills

Use this skill when bootstrapping scheduled knowledge-base sync for a repo that has no knowledge/.knowledge-sync.yml yet — to run one-time setup that detects the knowledge_root from CLAUDE.md/AGENTS.md, maps doc areas to source globs, records opt-in external sources (Linear/Notion/WebFetch, all disabled by default), captures a baseline last_scanned_sha, sets the per-area update policy, generates or seeds knowledge/CONVENTIONS.md, provisions the L4 memory dir, and offers to register the daily routine. Routes ongoing recurring sync operations to /knowledge-sync.

2026-05-231

knowledge-sync

alex-voloshin-dev/ai-skills

Use this skill when running the recurring (daily) knowledge-base rescan for a repo that already has knowledge/.knowledge-sync.yml — the main-thread dispatcher that reads the config, computes the git delta since last_scanned_sha, maps changed paths to affected doc areas, early-exits cheaply when nothing changed, then fans out one Agent(content-writer) per affected area, applies the propose/direct update policy, advances the baseline only on success, and writes an L4 run log — all with the G1 untrusted-content choke-point, secret-scan, deny-list, and budget controls woven in. For first-time setup use /knowledge-sync-init.

2026-05-231

memory-init

alex-voloshin-dev/ai-skills

Use this skill when bootstrapping memory in a freshly cloned repo or when upgrading from a pre-memory plugin version — to initialize the .ai-skills-memory/ skeleton in the current project (directory structure, .gitignore rules, learnings.md template, .committed/ allowlist). Idempotent — safe to re-run on a project that already has memory wired.

2026-05-231

feedback

alex-voloshin-dev/ai-skills

Use this skill when reviewing how the plugin behaved across recent runs, after a release to confirm reliability, before filing a plugin bug report, or when planning the next plugin improvement cycle — to collect and analyze past Claude Code session logs for the ai-skills plugin and surface agent, subagent, skill, command, and hook errors, timeouts, unexpected exits, and other anomalies that point at plugin defects. Defaults to the last 7 days of sessions for the current project. Produces an extended Markdown report on disk plus a brief on-screen summary.

2026-05-231

ai-skills-init

alex-voloshin-dev/ai-skills

Use this skill when bootstrapping a target repository to be ai-skills-aware — on the first run of any ai-skills workflow in a fresh repo, when adopting the ai-skills plugin in an existing repo, or after upgrading to a plugin version that adds new memory paths or templates, including when the user does not say "init" but asks to "set up" or "onboard" the repo — to detect codebase type, create CLAUDE.md + AGENTS.md scaffolding, initialize the .ai-skills-memory/ directory tree from L1 templates, and configure .gitignore. Idempotent — safe to re-run. Accepts `--codebase-type <type>` and `--overwrite`. Not for re-initializing only memory — use `/memory-init` instead.

2026-05-181

analyze-local

alex-voloshin-dev/ai-skills

Use this skill when a local container won't start, a service is unreachable from the host, a local docker-compose stack is misbehaving, or as the Docker-layer diagnosis step of a local bugfix flow — including when the user describes the symptom without naming Docker — to run a Docker-specific local diagnostic that collects container status, logs, networking, and resource usage and diagnoses issues, applying the SRE or DevOps role for investigation. For multi-scope environment analysis (Docker + Kubernetes + CI runner + drift snapshot + optional auto-fix) use `/env-analyze` instead.

2026-05-181

name	analyze-prod
description	Use this skill when investigating a production incident, when an alert fires (latency spike, error rate, pod crashloop), when a customer-reported issue needs prod telemetry, or as the diagnosis step of an incident-response or production-bugfix flow — including when the user describes a prod symptom without asking to "analyze" — to analyze the production environment by collecting Kubernetes pod status, managed database health, logs, metrics, and networking and diagnosing issues, supporting GCP, Azure, and AWS via the `cloud-platforms` skill and applying the SRE or DevOps role.
context	fork
argument-hint	service name, symptom, or incident description
allowed-tools	Read Grep Glob Bash

Analyze Production Environment

Systematic production-environment analysis. Collects cluster status, pod health, database metrics, logs, and diagnoses issues. Standalone or as the entry point for production bugfixing.

Cloud platform detection: Read CLAUDE.md to identify the cloud platform, then load the matching reference module (GCP / Azure / AWS) from @cloud-platforms.

⚠️ SAFETY: READ-ONLY commands only. No mutations (scale, delete, restart, deploy) without explicit user approval at Step 6.

1. Clarify the Scope

Ask the user:

Problem / question? (e.g., "pods crashlooping", "high latency on /api/orders", "DB connections exhausted", "general health check")
Affected service(s)? (deployment name, namespace, or "all")
Cloud environment? (project / subscription / account + cluster name)
When did it start? (deploy, config change, traffic spike, or unknown)
Active incident? (P1–P4 severity, or none)

If invoked as part of a bugfix / incident flow, extract context from the parent conversation instead.

2. Apply Appropriate Role

Select role by problem type:

Problem Type	Primary Role
Pod crashes, restarts, health-check failures	`Agent(sre-engineer)`
High latency, error-rate spikes, SLO burn	`Agent(sre-engineer)`
Managed DB issues (connections, replication, CPU)	`Agent(sre-engineer)`
Networking, ingress, DNS, connectivity	`Agent(sre-engineer)` + `Agent(devops-engineer)`
Deployment failures, rollback needed	`Agent(devops-engineer)`
Terraform / infra drift, resource config	`Agent(devops-engineer)`
Application errors in logs	Stack-specific (`Agent(java-engineer)` / `Agent(python-engineer)` / `Agent(frontend-engineer)`)
General / unclear	`Agent(sre-engineer)`

Announce the applied role(s). For P1/P2 incidents, always apply Agent(sre-engineer).

3. Establish Context

3a. Cloud Platform Context

Detect platform from CLAUDE.md and verify authentication via @cloud-platforms:

GCP: gcloud config get-value project + gcloud auth list
Azure: az account show + az aks list
AWS: aws sts get-caller-identity + aws eks list-clusters

Confirm the active project / subscription / account matches the user's target.

3b. Kubernetes Cluster

// turbo
kubectl config current-context
kubectl cluster-info

Run the platform-specific cluster list command from @cloud-platforms to verify cluster status.

Record: Cluster name, location, version, node count, current kubectl context.

4. Collect Production Snapshot

Run the following read-only diagnostics. Present results as a structured summary.

4a. Node Health

// turbo
kubectl get nodes -o wide
kubectl top nodes

Flag: NotReady nodes, CPU/memory >80%, version skew.

4b. Pod Status

For the affected namespace (or all namespaces if unspecified):

// turbo
kubectl get pods -n <namespace> -o wide --sort-by='.status.startTime'
kubectl get pods -n <namespace> --field-selector=status.phase!=Running

Flag: CrashLoopBackOff / Error / Pending / ImagePullBackOff, restart count >3 in last hour, replica count mismatch (kubectl get deployments).

4c. Pod Details (for problematic pods)

kubectl describe pod <pod-name> -n <namespace>
kubectl logs --tail=200 --timestamps <pod-name> -n <namespace>
kubectl logs --previous --tail=50 <pod-name> -n <namespace>

Look for: OOMKilled (exit 137), app exceptions, connection errors, startup failures, readiness probe failures.

4d. Resource Usage

// turbo
kubectl top pods -n <namespace> --sort-by=memory
kubectl get hpa -n <namespace>

Flag: Pods near memory limits, HPA at max replicas, CPU throttling.

4e. Managed Database Health

DB diagnostic commands per cloud — see @cloud-platforms. Key metrics: CPU / memory / disk utilization, active vs max connections, replication lag (HA / read replicas), failed connection count.

4f. Networking and Ingress

// turbo
kubectl get ingress -n <namespace>
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>

Flag: Services with 0 endpoints (no healthy backends), Ingress with no address, port mismatches.

4g. Recent Events

// turbo
kubectl get events -n <namespace> --sort-by='.lastTimestamp' --field-selector type!=Normal

Record: Warning / Error events — especially FailedScheduling, OOMKilled, FailedMount, Unhealthy, BackOff.

4h. Observability Methodology

Observability methodology (Golden Signals / RED / USE / Distributed Tracing) — see @observability-methods. Pick a named method per problem class, then cross-reference SLI metrics, error-budget burn, and active alerts against that method's signals.

4i. Production Telemetry Surface

Telemetry stack patterns + per-vendor queries — see @telemetry-stacks. Identify the stack from CLAUDE.md, helm charts, prometheus-operator CRDs, or the OTel collector config, then query directly using the vendor patterns documented there.

5. Analyze Findings

Using the applied role's expertise:

Correlate the user's problem with collected data.
Identify root cause using the role's debugging methodology.
Assess SLO impact (when Agent(sre-engineer) is active): error budget consumed, burn rate.

<common_prod_issues>

CrashLoopBackOff: OOM (check limits), app startup failure, missing config/secret, failed dependency (DB unreachable)
High latency: DB slow queries, connection-pool exhaustion, CPU throttling (requests vs limits), upstream timeout
DB connection exhaustion: too many connections from pods, app leak, pool misconfig, no pooler / proxy
Pods Pending: insufficient cluster resources (autoscaler max), affinity/taint mismatch, PVC not bound
5xx errors: app exception, upstream timeout, readiness probe failing, resource exhaustion
Deployment stuck: image pull error, readiness probe failure, PDB blocking rollout, insufficient quota
Network unreachable: NetworkPolicy block, service selector mismatch, DNS failure, VPC / firewall rule </common_prod_issues>

6. Present Diagnosis

Structure the diagnosis:

## Production Environment Summary
- Cloud: [GCP/Azure/AWS] | Project/Sub/Account: [id] | Cluster: [name] ([location])
- Nodes: [count] ([healthy]/[total]) | K8s: [version] | Managed DB: [instance] ([state], [tier])

## SLO Impact (if applicable)
- SLO: [target] | Current: [actual] | Error budget: [remaining]% | Burn rate: [Nx] | Time to exhaustion: [duration]

## Findings
### [Issue 1: title]
- **Symptom / Evidence / Root cause / Severity (P1–P4) / Blast radius**

## Recommendations
1. [Immediate mitigation] — [command] ⚠️ REQUIRES APPROVAL
2. [Root cause fix] — [change description]
3. [Prevention] — [long-term improvement]

## Environment Health: [HEALTHY | DEGRADED | OUTAGE]

7. Fix or Escalate

⚠️ All production mutations require explicit user approval.

Immediate mitigation (rollback, restart, scale): present the exact command, wait for approval, execute, verify.
Application code fix: transition to a stack-specific role. Fix, test locally (optionally /analyze-local), deploy via normal CI/CD.
Infrastructure fix (Terraform, Helm, K8s config): apply via PR with Agent(devops-engineer) — never direct apply.
Root cause unclear: propose additional diagnostics (increased log verbosity, tracing, profiling, query analysis).

After any fix, re-run relevant Step 4 commands to verify resolution.

8. Summary

Problem: original report / incident ID
Severity: P1–P4
Role(s) applied
Root cause
SLO impact: error budget consumed
Fix applied (or "investigation only — no fix applied")
Verification: confirmation the issue is resolved
Action items: postmortem (if P1/P2), prevention, monitoring improvements

Integration

Called by: /bugfix (production environment diagnostics)
Roles: Agent(sre-engineer), Agent(devops-engineer)
Skills: @cloud-platforms (platform-specific CLI commands, managed service diagnostics), @observability-methods (Golden Signals / RED / USE / Tracing), @telemetry-stacks (Prometheus / Datadog / Honeycomb / New Relic / Sentry / OTel queries)