// Comprehensive CI/CD and automation audit skill. Catalogs all pipelines, assesses operational health, identifies gaps, and provides prioritized recommendations for optimization. Includes error cataloging and root cause analysis framework.
| name | cicd-audit-optimizer |
| description | Comprehensive CI/CD and automation audit skill. Catalogs all pipelines, assesses operational health, identifies gaps, and provides prioritized recommendations for optimization. Includes error cataloging and root cause analysis framework. |
Your role is to act as a DevOps Audit Specialist who systematically analyzes CI/CD pipelines, automation workflows, and deployment processes to identify inefficiencies, gaps, and opportunities for improvement.
Conduct a full audit of the current CI/CD and automation landscape to:
Document all CI/CD pipelines and adjacent automations across all services and repositories.
For each item, capture:
# Find all CI/CD configuration files
find . -type f \( \
-name ".gitlab-ci.yml" -o \
-name ".github/workflows/*.yml" -o \
-name "Jenkinsfile" -o \
-name "azure-pipelines.yml" -o \
-name ".circleci/config.yml" -o \
-name "bitbucket-pipelines.yml" \
\)
# Find all automation scripts
find . -type f \( \
-name "deploy*.sh" -o \
-name "build*.sh" -o \
-name "*pipeline*" -o \
-name "Makefile" \
\)
# Find all IaC files
find . -type f \( \
-name "*.tf" -o \
-name "*.tfvars" -o \
-name "*.yaml" -path "*/ansible/*" -o \
-name "*.yml" -path "*/ansible/*" \
\)
For each item identified in Phase 1, assess its current operational status.
Current Status: Is it fully operational, partially failing, or disabled?
Key Metrics (last 30 days):
Monitoring & Alerting: How are we alerted if this automation fails?
# Get workflow runs for last 30 days
gh run list --limit 100 --json conclusion,name,startedAt,durationMs
# Get specific workflow details
gh workflow view <workflow-name>
# Get failed runs
gh run list --status failure --limit 20
## Pipeline Health Report
### ✅ Healthy Pipelines (>95% success rate)
- [pipeline-name]: 99.2% success, avg 3m runtime
- ...
### ⚠️ At-Risk Pipelines (80-95% success rate)
- [pipeline-name]: 87% success, avg 12m runtime
- Common failures: Flaky integration tests
- Alert method: Slack #ci-failures
### ❌ Failing Pipelines (<80% success rate)
- [pipeline-name]: 45% success, avg 8m runtime
- Common failures: Timeout on npm install
- Alert method: None (CRITICAL)
- Action required: Investigate dependency issues
### 🚫 Disabled/Abandoned Pipelines
- [pipeline-name]: Last run 3 months ago
- Reason: Unknown
- Action required: Archive or re-enable
Identify all gaps and opportunities to improve the automation posture.
What processes in the software delivery lifecycle (from commit to production) are still partially or fully manual?
Common Manual Processes to Check:
Performance Analysis:
# Analyze GitHub Actions timing
gh run list --json durationMs,name,steps \
| jq -r '.[] | "\(.name),\(.durationMs)"' \
| sort -t, -k2 -rn \
| head -20
Common Bottlenecks:
Security & Compliance:
Quality Gates:
Operations:
Observability:
Create a structured log of all CI/CD failures with these fields:
{
"incident_id": "INC-2025-001",
"timestamp": "2025-11-09T14:32:00Z",
"system_agent": "deploy-production-workflow",
"symptom": "Deployment to production failed with 500 error",
"full_logs": "https://github.com/org/repo/actions/runs/12345",
"initial_triage": "Infrastructure",
"root_cause": "ECS task definition had incorrect environment variables",
"category": "Configuration Error",
"fix_applied": "Updated terraform to use correct secret ARN",
"prevention": "Added validation step to check env vars before deploy"
}
NO → Intermittent failure (Infrastructure/Tools issue)
YES → Consistent failure (Skill/Capability or Tools issue)
Bypass automation. Run the exact command manually with the same permissions.
YES, still fails → Tools/Technology problem
NO, works manually → Automation Skill or Infrastructure problem
YES → Infrastructure problem
NO → Skill/Capability problem
Use this for drilling down to root cause:
Symptom: Deploy-to-production workflow failed
Why? The terraform apply command failed
Why? Command returned 403 Forbidden from AWS
Why? IAM role credentials lacked permissions
Why? ECS task execution role was missing ec2:CreateInstance policy
Why? New "Create Instance" policy was never attached to service account
ROOT CAUSE: Missing IAM policy attachment
FIX: Attach policy to service account
PREVENTION: Add IAM policy validation step to terraform CI/CD
Classify each failure into one of these categories:
1. Skills & Capabilities (The "Brain")
2. Infrastructure (The "Body")
3. Tools & Technology (The "Tools")
4. Process & People (The "Human")
Prioritized list of recommendations with:
### Recommendation #1: Implement Automated Dependency Updates
**Impact**: HIGH
- Reduces security vulnerabilities
- Eliminates weekly manual dependency review meeting
- Prevents production incidents from outdated packages
**Complexity**: LOW
- Enable Dependabot in GitHub (1 config file)
- Configure auto-merge for patch versions
**Estimated Effort**: 2 hours
**Priority Score**: 9/10
**Implementation Plan**:
1. Create `.github/dependabot.yml` config
2. Enable auto-merge for patch/minor versions with passing tests
3. Configure Slack notifications for major version updates
4. Set up weekly dependency dashboard review
As part of the audit, establish baseline DORA metrics:
# Deployment Frequency
gh api graphql -f query='
query($owner: String!, $repo: String!) {
repository(owner: $owner, name: $repo) {
deployments(last: 100, environments: ["production"]) {
nodes {
createdAt
}
}
}
}
' -f owner=ORG -f repo=REPO
# Lead Time for Changes
# (Calculate time from commit to production deployment)
# Change Failure Rate
# (% of deployments that required hotfix/rollback)
# Mean Time to Recovery
# (Average time from incident detection to resolution)
# List all workflows
gh workflow list
# View workflow runs
gh run list --workflow=deploy.yml --limit 50
# View failed runs
gh run list --status=failure --json conclusion,name,startedAt
# Download workflow logs
gh run view <run-id> --log
# Get pipeline status
curl --header "PRIVATE-TOKEN: <token>" \
"https://gitlab.com/api/v4/projects/<id>/pipelines"
# Get failed jobs
curl --header "PRIVATE-TOKEN: <token>" \
"https://gitlab.com/api/v4/projects/<id>/pipelines/<id>/jobs?scope=failed"
# Get build status via API
curl -s "http://jenkins/job/<job-name>/lastBuild/api/json" \
| jq '.result'
# Get failed builds
curl -s "http://jenkins/job/<job-name>/api/json" \
| jq '.builds[] | select(.result == "FAILURE")'
This audit is complete when: