تشغيل أي مهارة في Manus بنقرة واحدة

incident-response

النجوم١

التفرعات٠

آخر تحديث٢ أبريل ٢٠٢٦ في ١٨:٥٩

Manage and respond to production incidents effectively. Establish on-call rotations, runbooks, and post-mortem processes.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

HelixDevelopment

HelixDevelopment/HelixAgent

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

مديرو الشبكات وأنظمة الحاسوبمهن الحاسوب والرياضيات·SOC 15-1244

SKILL.md

readonly

name	incident-response
description	Manage and respond to production incidents effectively. Establish on-call rotations, runbooks, and post-mortem processes.
triggers	["/incident","/oncall","/postmortem"]

Incident Management

This skill covers effective incident response including detection, communication, resolution, and post-incident analysis to minimize downtime and improve reliability.

When to use this skill

Use this skill when you need to:

Respond to production incidents
Set up incident management processes
Create on-call rotations
Write runbooks for common issues
Conduct post-mortems

Prerequisites

Incident management tool (PagerDuty, Opsgenie, VictorOps)
Communication channels (Slack, Teams, Zoom)
Monitoring and alerting setup
Documentation system for runbooks
Post-mortem template

Guidelines

Incident Lifecycle

Detection → Triage → Mitigation → Resolution → Post-Incident
    ↓          ↓          ↓            ↓            ↓
 Alerting   Assess     Stop the    Fix root    Learn &
  System    impact     bleeding    cause       improve

Detection

Alert Sources

Monitoring alerts (high error rate, latency)
Customer reports
Automated health checks
Synthetic monitoring failures

Severity Levels

SEV 1: Critical - Complete outage, data loss
SEV 2: High - Major functionality impaired
SEV 3: Medium - Partial degradation
SEV 4: Low - Minor issues, workarounds exist

Triage

Initial Assessment

Acknowledge the alert
Assess scope and impact
Determine severity
Create incident channel/document
Notify stakeholders

Triage Questions

What is affected? (services, regions, customers)
When did it start?
What changed recently? (deployments, config)
Is there a workaround?

Communication

Incident #1234 - Payment Service Degraded

Status: Investigating
Severity: SEV 2
Impact: 50% of payment requests failing
Started: 2024-01-15 14:30 UTC

Updates: #incident-1234 (Slack)

Mitigation

Priorities

Stop the bleeding (mitigate impact)
Restore service (short-term fix)
Fix root cause (long-term fix)

Mitigation Strategies

Rollback recent deployments
Enable feature flags to disable affected features
Scale up resources
Failover to standby systems
Use cached data instead of live queries

Runbook-Driven Response

# Database Connection Pool Exhausted

## Symptoms
- Error: "Too many connections"
- High response times
- Connection timeouts

## Checks
1. Check active connections: `SELECT count(*) FROM pg_stat_activity;`
2. Check connection pool metrics in dashboard

## Mitigation
1. Restart application to reset connection pool
2. Scale up database connection limit temporarily
3. Enable connection pooling (PgBouncer)

## Verification
- Monitor connection count
- Check error rates
- Confirm response times normalizing

Resolution

Root Cause Analysis

Collect logs and metrics
Identify trigger (deployment, config change, traffic spike)
Trace code path that caused issue
Reproduce in non-production if possible

Fix Implementation

Test fix in staging
Deploy with monitoring
Verify resolution
Update runbooks if new issue type

Post-Incident Review

Timeline Construction

14:30 - Deployment #5678 to production
14:35 - Error rate alerts triggered
14:40 - Incident declared, rollback initiated
14:45 - Service restored
15:00 - Root cause identified (missing database index)

Post-Mortem Template

# Post-Mortem: Incident #1234

## Summary
Payment service degraded due to missing database index after schema migration.

## Timeline
[Detailed timeline of events]

## Root Cause
Database query performance degraded after adding new column without index.

## Impact
- 50% payment failure rate
- ~$50K revenue impact
- 200 affected customers

## Lessons Learned
- Schema migrations need performance testing
- Missing index detection in CI/CD

## Action Items
| Task | Owner | Due |
|------|-------|-----|
| Add index monitoring | @alice | 2024-01-22 |
| Update migration checklist | @bob | 2024-01-20 |

Blameless Culture

Focus on systems, not individuals
Assume good intent
Share learnings widely
Track action items to completion

On-Call Best Practices

Rotation Design

Primary and secondary responders
Follow-the-sun for global teams
Limit consecutive shifts (max 7 days)
Compensate for off-hours work

Handoff Process

Document ongoing issues
Transfer context about alerts
Review recent changes
Share relevant runbooks

Preparation

Ensure laptop is setup and charged
Have VPN access tested
Keep phone charged with alerts enabled
Have escalation contacts ready

Tools and Automation

Incident Management

PagerDuty, Opsgenie, VictorOps for alerting
Status page (Statuspage.io) for external communication
Incident.io for modern incident management

Automation

Auto-remediation for known issues
Automated runbook execution
Self-healing systems where possible
ChatOps for common tasks

Examples

See the examples/ directory for:

runbooks/ - Common issue runbooks
postmortem-template.md - Post-mortem template
incident-response-playbook.md - Response procedures
oncall-checklist.md - On-call preparation checklist

References

المزيد من هذا المستودع

نفس المستودع

onboarding-agent

HelixDevelopment/HelixAgent

Interactive onboarding workflow that interviews users to understand their coding goals and generates PR-ready implementation plans. Use when starting a new development task to ensure clear requirements and structured execution.

2026-05-161

gamma-security-basics

HelixDevelopment/HelixAgent

Implement security best practices for Gamma integration. Use when securing API keys, implementing access controls, or auditing Gamma security configuration. Trigger with phrases like "gamma security", "gamma API key security", "gamma secure", "gamma credentials", "gamma access control".

2026-05-161

documentation

HelixDevelopment/HelixAgent

Write effective technical documentation including READMEs, API docs, architecture decisions, and inline code documentation.

2026-05-061

azure-devops

HelixDevelopment/HelixAgent

Build and manage CI/CD pipelines with Azure DevOps. Configure builds, releases, and automate software delivery workflows.

2026-04-021

azure-functions

HelixDevelopment/HelixAgent

Develop, deploy, and manage Azure Functions for serverless computing. Supports HTTP triggers, timers, queues, and event-driven architectures.

2026-04-021

azure-resource-management

HelixDevelopment/HelixAgent

Manage Azure resources effectively using CLI, Portal, Bicep, and ARM templates. Use for provisioning, organizing, and maintaining cloud infrastructure.

2026-04-021

name	incident-response
description	Manage and respond to production incidents effectively. Establish on-call rotations, runbooks, and post-mortem processes.
triggers	["/incident","/oncall","/postmortem"]

Incident Management

This skill covers effective incident response including detection, communication, resolution, and post-incident analysis to minimize downtime and improve reliability.

When to use this skill

Use this skill when you need to:

Respond to production incidents
Set up incident management processes
Create on-call rotations
Write runbooks for common issues
Conduct post-mortems

Prerequisites

Incident management tool (PagerDuty, Opsgenie, VictorOps)
Communication channels (Slack, Teams, Zoom)
Monitoring and alerting setup
Documentation system for runbooks
Post-mortem template

Guidelines

Incident Lifecycle

Detection → Triage → Mitigation → Resolution → Post-Incident
    ↓          ↓          ↓            ↓            ↓
 Alerting   Assess     Stop the    Fix root    Learn &
  System    impact     bleeding    cause       improve

Detection

Alert Sources

Monitoring alerts (high error rate, latency)
Customer reports
Automated health checks
Synthetic monitoring failures

Severity Levels

SEV 1: Critical - Complete outage, data loss
SEV 2: High - Major functionality impaired
SEV 3: Medium - Partial degradation
SEV 4: Low - Minor issues, workarounds exist

Triage

Initial Assessment

Acknowledge the alert
Assess scope and impact
Determine severity
Create incident channel/document
Notify stakeholders

Triage Questions

What is affected? (services, regions, customers)
When did it start?
What changed recently? (deployments, config)
Is there a workaround?

Communication

Incident #1234 - Payment Service Degraded

Status: Investigating
Severity: SEV 2
Impact: 50% of payment requests failing
Started: 2024-01-15 14:30 UTC

Updates: #incident-1234 (Slack)

Mitigation

Priorities

Stop the bleeding (mitigate impact)
Restore service (short-term fix)
Fix root cause (long-term fix)

Mitigation Strategies

Rollback recent deployments
Enable feature flags to disable affected features
Scale up resources
Failover to standby systems
Use cached data instead of live queries

Runbook-Driven Response

# Database Connection Pool Exhausted

## Symptoms
- Error: "Too many connections"
- High response times
- Connection timeouts

## Checks
1. Check active connections: `SELECT count(*) FROM pg_stat_activity;`
2. Check connection pool metrics in dashboard

## Mitigation
1. Restart application to reset connection pool
2. Scale up database connection limit temporarily
3. Enable connection pooling (PgBouncer)

## Verification
- Monitor connection count
- Check error rates
- Confirm response times normalizing

Resolution

Root Cause Analysis

Collect logs and metrics
Identify trigger (deployment, config change, traffic spike)
Trace code path that caused issue
Reproduce in non-production if possible

Fix Implementation

Test fix in staging
Deploy with monitoring
Verify resolution
Update runbooks if new issue type

Post-Incident Review

Timeline Construction

14:30 - Deployment #5678 to production
14:35 - Error rate alerts triggered
14:40 - Incident declared, rollback initiated
14:45 - Service restored
15:00 - Root cause identified (missing database index)

Post-Mortem Template

# Post-Mortem: Incident #1234

## Summary
Payment service degraded due to missing database index after schema migration.

## Timeline
[Detailed timeline of events]

## Root Cause
Database query performance degraded after adding new column without index.

## Impact
- 50% payment failure rate
- ~$50K revenue impact
- 200 affected customers

## Lessons Learned
- Schema migrations need performance testing
- Missing index detection in CI/CD

## Action Items
| Task | Owner | Due |
|------|-------|-----|
| Add index monitoring | @alice | 2024-01-22 |
| Update migration checklist | @bob | 2024-01-20 |

Blameless Culture

Focus on systems, not individuals
Assume good intent
Share learnings widely
Track action items to completion

On-Call Best Practices

Rotation Design

Primary and secondary responders
Follow-the-sun for global teams
Limit consecutive shifts (max 7 days)
Compensate for off-hours work

Handoff Process

Document ongoing issues
Transfer context about alerts
Review recent changes
Share relevant runbooks

Preparation

Ensure laptop is setup and charged
Have VPN access tested
Keep phone charged with alerts enabled
Have escalation contacts ready

Tools and Automation

Incident Management

PagerDuty, Opsgenie, VictorOps for alerting
Status page (Statuspage.io) for external communication
Incident.io for modern incident management

Automation

Auto-remediation for known issues
Automated runbook execution
Self-healing systems where possible
ChatOps for common tasks

Examples

See the examples/ directory for:

runbooks/ - Common issue runbooks
postmortem-template.md - Post-mortem template
incident-response-playbook.md - Response procedures
oncall-checklist.md - On-call preparation checklist