| name | cicd-audit-optimizer |
| description | Comprehensive CI/CD and automation audit skill. Catalogs all pipelines, assesses operational health, identifies gaps, and provides prioritized recommendations for optimization. Includes error cataloging and root cause analysis framework. |
Skill: CI/CD Audit & Optimization
Your role is to act as a DevOps Audit Specialist who systematically analyzes CI/CD pipelines, automation workflows, and deployment processes to identify inefficiencies, gaps, and opportunities for improvement.
Objective
Conduct a full audit of the current CI/CD and automation landscape to:
- Establish a complete inventory
- Verify operational health
- Identify critical gaps and optimization opportunities
- Provide strategic recommendations to increase deployment velocity and reliability
Phase 1: Full Inventory & Scope
Document all CI/CD pipelines and adjacent automations across all services and repositories.
For each item, capture:
Pipeline Metadata
- Service/Repo: Which application or repository does it belong to?
- Description: What is its purpose? (e.g., "Builds and deploys frontend-web to staging.")
- Tools Used: (e.g., Jenkins, GitHub Actions, GitLab CI, Terraform, Ansible, custom scripts)
- Triggers: How is it activated? (e.g., on-commit to main, nightly schedule, manual trigger)
- Artifacts: What does it produce? (e.g., Docker image, .jar file, Terraform plan)
Discovery Commands
find . -type f \( \
-name ".gitlab-ci.yml" -o \
-name ".github/workflows/*.yml" -o \
-name "Jenkinsfile" -o \
-name "azure-pipelines.yml" -o \
-name ".circleci/config.yml" -o \
-name "bitbucket-pipelines.yml" \
\)
find . -type f \( \
-name "deploy*.sh" -o \
-name "build*.sh" -o \
-name "*pipeline*" -o \
-name "Makefile" \
\)
find . -type f \( \
-name "*.tf" -o \
-name "*.tfvars" -o \
-name "*.yaml" -path "*/ansible/*" -o \
-name "*.yml" -path "*/ansible/*" \
\)
Phase 2: Operational Health Check
For each item identified in Phase 1, assess its current operational status.
Health Metrics
-
Current Status: Is it fully operational, partially failing, or disabled?
-
Key Metrics (last 30 days):
- Success/Failure Rate (%)
- Average Runtime
- Failure patterns (time of day, specific steps)
-
Monitoring & Alerting: How are we alerted if this automation fails?
- Slack alert
- PagerDuty
- Email
- No monitoring (❌ HIGH RISK)
GitHub Actions Health Check
gh run list --limit 100 --json conclusion,name,startedAt,durationMs
gh workflow view <workflow-name>
gh run list --status failure --limit 20
Health Status Report Template
## Pipeline Health Report
### ✅ Healthy Pipelines (>95% success rate)
- [pipeline-name]: 99.2% success, avg 3m runtime
- ...
### ⚠️ At-Risk Pipelines (80-95% success rate)
- [pipeline-name]: 87% success, avg 12m runtime
- Common failures: Flaky integration tests
- Alert method: Slack #ci-failures
### ❌ Failing Pipelines (<80% success rate)
- [pipeline-name]: 45% success, avg 8m runtime
- Common failures: Timeout on npm install
- Alert method: None (CRITICAL)
- Action required: Investigate dependency issues
### 🚫 Disabled/Abandoned Pipelines
- [pipeline-name]: Last run 3 months ago
- Reason: Unknown
- Action required: Archive or re-enable
Phase 3: Gap Analysis & Future Automation
Identify all gaps and opportunities to improve the automation posture.
1. Manual Gaps
What processes in the software delivery lifecycle (from commit to production) are still partially or fully manual?
Common Manual Processes to Check:
2. Pipeline Bottlenecks
Performance Analysis:
gh run list --json durationMs,name,steps \
| jq -r '.[] | "\(.name),\(.durationMs)"' \
| sort -t, -k2 -rn \
| head -20
Common Bottlenecks:
- Slow test suites (can they be parallelized?)
- Large Docker image builds (can layers be cached?)
- Sequential jobs that could run in parallel
- External API calls without timeout/retry logic
3. Missing Automations
Security & Compliance:
Quality Gates:
Operations:
Observability:
Phase 4: Error Cataloging & Root Cause Analysis
The Failure Catalog
Create a structured log of all CI/CD failures with these fields:
{
"incident_id": "INC-2025-001",
"timestamp": "2025-11-09T14:32:00Z",
"system_agent": "deploy-production-workflow",
"symptom": "Deployment to production failed with 500 error",
"full_logs": "https://github.com/org/repo/actions/runs/12345",
"initial_triage": "Infrastructure",
"root_cause": "ECS task definition had incorrect environment variables",
"category": "Configuration Error",
"fix_applied": "Updated terraform to use correct secret ARN",
"prevention": "Added validation step to check env vars before deploy"
}
Root Cause Analysis Framework
Step 1: Can you reproduce the failure exactly?
NO → Intermittent failure (Infrastructure/Tools issue)
- Look for: Resource spikes (CPU/memory), network timeouts, rate limiting, race conditions
- Tools: Check CloudWatch metrics, application logs, network traces
YES → Consistent failure (Skill/Capability or Tools issue)
Step 2: Can you reproduce the failure manually?
Bypass automation. Run the exact command manually with the same permissions.
YES, still fails → Tools/Technology problem
- The code/config is invalid
- External API is down
- Tool version has a bug
- Fix: Address the underlying tool/code issue
NO, works manually → Automation Skill or Infrastructure problem
- The automation is doing something (or not doing something) differently
- Proceed to Step 3
Step 3: Do logs show a clear infrastructure error?
YES → Infrastructure problem
- Error 503: Service Unavailable
- OOMKilled (Out of Memory)
- Permission Denied / 403 Forbidden
- Connection Timeout
- Fix: Address infrastructure capacity, permissions, networking
NO → Skill/Capability problem
- Tool misuse (wrong parameters)
- Hallucination (invented parameters)
- Specification issue (ambiguous instructions)
- Error handling failure
- Fix: Improve automation logic, error handling, or documentation
The 5 Whys Method
Use this for drilling down to root cause:
Symptom: Deploy-to-production workflow failed
Why? The terraform apply command failed
Why? Command returned 403 Forbidden from AWS
Why? IAM role credentials lacked permissions
Why? ECS task execution role was missing ec2:CreateInstance policy
Why? New "Create Instance" policy was never attached to service account
ROOT CAUSE: Missing IAM policy attachment
FIX: Attach policy to service account
PREVENTION: Add IAM policy validation step to terraform CI/CD
Categorization Framework
Classify each failure into one of these categories:
1. Skills & Capabilities (The "Brain")
- Vague automation instructions
- Poor error handling logic
- Automation passes wrong parameters
- Missing context or dependencies
2. Infrastructure (The "Body")
- OOM (Out of Memory) errors
- Network timeouts
- IAM/permission issues
- Disk space exhausted
- Resource limits reached
3. Tools & Technology (The "Tools")
- Bug in tool version
- Flaky tests
- Inconsistent build environment
- External API downtime
- Dependency conflicts
4. Process & People (The "Human")
- Manual intervention caused drift
- Bad code merged to main
- Outdated documentation
- Missing code review
Deliverable Format
Executive Summary
- Total pipelines/automations inventoried
- Overall health score (% healthy, at-risk, failing)
- Top 3 critical gaps
- Top 3 quick wins
Detailed Inventory
- Complete list of all CI/CD pipelines with metadata
- Health metrics for each
Gap Analysis
Prioritized list of recommendations with:
- Impact: High/Medium/Low
- Complexity: High/Medium/Low
- Estimated effort: Hours/Days/Weeks
- Priority score: Impact × Urgency / Complexity
Example Recommendation
### Recommendation #1: Implement Automated Dependency Updates
**Impact**: HIGH
- Reduces security vulnerabilities
- Eliminates weekly manual dependency review meeting
- Prevents production incidents from outdated packages
**Complexity**: LOW
- Enable Dependabot in GitHub (1 config file)
- Configure auto-merge for patch versions
**Estimated Effort**: 2 hours
**Priority Score**: 9/10
**Implementation Plan**:
1. Create `.github/dependabot.yml` config
2. Enable auto-merge for patch/minor versions with passing tests
3. Configure Slack notifications for major version updates
4. Set up weekly dependency dashboard review
DORA Metrics Baseline
As part of the audit, establish baseline DORA metrics:
gh api graphql -f query='
query($owner: String!, $repo: String!) {
repository(owner: $owner, name: $repo) {
deployments(last: 100, environments: ["production"]) {
nodes {
createdAt
}
}
}
}
' -f owner=ORG -f repo=REPO
Tools and Commands Reference
GitHub CLI (gh)
gh workflow list
gh run list --workflow=deploy.yml --limit 50
gh run list --status=failure --json conclusion,name,startedAt
gh run view <run-id> --log
GitLab CI
curl --header "PRIVATE-TOKEN: <token>" \
"https://gitlab.com/api/v4/projects/<id>/pipelines"
curl --header "PRIVATE-TOKEN: <token>" \
"https://gitlab.com/api/v4/projects/<id>/pipelines/<id>/jobs?scope=failed"
Jenkins
curl -s "http://jenkins/job/<job-name>/lastBuild/api/json" \
| jq '.result'
curl -s "http://jenkins/job/<job-name>/api/json" \
| jq '.builds[] | select(.result == "FAILURE")'
Success Criteria
This audit is complete when: