| name | fault-scaffolding |
| description | Assists with creating new fault mechanisms for ITBench scenarios, from incident
description through Ansible implementation. Guides brainstorming, service selection,
and code implementation consistent with existing fault patterns.
|
Purpose
This skill guides you through the complete fault creation workflow:
- Incident Analysis - Understanding the IT incident to reproduce
- Application Brainstorming - Mapping incident to available applications
- Service Selection - Identifying target services/components
- Ansible Implementation - Writing fault injection code
When to Use This Skill
This skill auto-activates when:
- Working with files matching
**/faults/index.json
- Editing files in
**/faults/tasks/inject_*.yaml
- User mentions "fault scaffold", "new fault", "incident", "reproduce fault"
- Creating or modifying fault definitions
Workflow
Step 0: Check for Existing TODOs or Create Scaffolding
First, check if there are existing fault TODOs:
If TODOs exist ā Skip to Step 4 to complete them.
If NO TODOs exist ā Create new fault scaffolding (Steps 1-3).
Step 1: Incident Description & Fault Input Collection
1.0 Check for Similar Existing Faults
IMPORTANT: Before creating a new fault, always check if a similar fault already exists.
Search existing faults:
-
Search fault index by keywords:
jq '.[] | select(.name | test("(?i)configmap|image|network|memory"))' \
scenarios/sre/roles/documentation/files/library/faults/index.json
-
List all fault injection tasks:
ls scenarios/sre/roles/faults/tasks/inject_*.yaml
-
Search fault tasks by pattern:
grep -r "ConfigMap\|Image\|NetworkPolicy" scenarios/sre/roles/faults/tasks/
If similar fault exists:
- ā
Reuse the existing fault for your scenario
- ā
Reference the existing implementation pattern
- ā
Extend the existing fault if needed (add new arguments)
If no similar fault exists:
- ā
Proceed with creating a new fault
Why check first?
- Avoids duplicate faults
- Maintains consistency across scenarios
- Saves implementation time
- Leverages tested fault mechanisms
1.1 Gather Incident Information
Ask the user to provide:
- Link to incident documentation (Jira, GitHub issue, etc.), OR
- Written description of the IT incident/problem
Extract key information:
- What is failing? (service, pod, connection, etc.)
- What observable symptoms? (high latency, errors, pod crashes, etc.)
- What is the root cause? (misconfiguration, resource exhaustion, network issue, etc.)
1.2 Collect Fault Details
Gather required information (similar to scaffolding/tasks/collect_fault_inputs.yaml):
-
Fault Name - Human-readable name
- Example: "Nonexistent Kubernetes Workload Container Image"
-
Fault Description - Technical explanation of the mechanism
- Example: "This fault injects a nonexistent image into a designated Kubernetes workload's container."
-
Fault Expectation - Observable behavior when fault is active
- Example: "The faulted pod(s) will enter the
Pending state due to an ImagePullBackOff error. The workload will become unable to function."
-
Tags - Read available tags from schema file:
jq '.properties.tags.items.enum' \
scenarios/sre/roles/documentation/files/library/faults/schema.json
Choose the most appropriate tag(s) for the fault mechanism.
-
Generate Fault ID - Derive from name (lowercase, kebab-case):
fault_id = fault_name.lower().replace(" ", "-").regex_replace("[^a-z0-9-]", "")
Example: "Nonexistent Kubernetes Workload Container Image"
ā "nonexistent-kubernetes-workload-container-image"
1.3 Create Scaffolding Files
First, read the fault schema to understand required fields:
cat scenarios/sre/roles/documentation/files/library/faults/schema.json
File 1: scenarios/sre/roles/documentation/files/library/faults/index.json
Add new fault entry with fields from schema:
jq '.required' scenarios/sre/roles/documentation/files/library/faults/schema.json
jq '.properties | keys' scenarios/sre/roles/documentation/files/library/faults/schema.json
Create entry matching the schema (required fields: arguments, description, expectation, name, platform, resources, solutions, tags):
{
"alerts": "TODO",
"arguments": "TODO",
"description": "<fault description from user>",
"expectation": "<fault expectation from user>",
"id": "<generated-fault-id>",
"name": "<fault name from user>",
"platform": "Kubernetes",
"resources": "TODO",
"solutions": "TODO",
"tags": ["<tag from user>"]
}
File 2: scenarios/sre/roles/faults/tasks/inject_<fault-id>.yaml
IMPORTANT: File naming convention uses underscores only (e.g., inject_my_fault_name.yaml), not hyphens.
Create stub injection task:
---
- name: Print message
ansible.builtin.debug:
msg: This fault injection (<fault-id>) is unimplemented.
Display summary after creation:
ā
Fault scaffolding created!
Name: <fault name>
ID: <fault-id>
Tags: [<tags>]
Files created:
ā faults/index.json (entry added with TODOs)
ā faults/tasks/inject_<fault-id>.yaml (stub created)
Next steps:
1. Complete TODO fields in fault index (arguments, alerts, resources, solutions)
2. Implement Ansible injection task
3. Test fault injection
Step 2: Brainstorm Implementation
Read available applications dynamically from:
cat scenarios/sre/roles/applications/defaults/main/managers.yaml
Extract application details:
grep -E "^ [a-z_]+:" scenarios/sre/roles/applications/defaults/main/managers.yaml | sed 's/://g' | awk '{print $1}'
grep -A 15 "^ <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml
grep -A 15 "^ <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml | grep -E "namespace:|url:|documentation:"
For each application found, dynamically extract:
- Application key/ID
- Kubernetes namespace
- Documentation URL
- Helm chart details (if applicable)
Application Preference:
- Prefer OpenTelemetry Demo (
opentelemetry_demo / Astronomy Shop) for most scenarios - it's richer, more comprehensive, and better maintained
- Use BookInfo (
book_info) only if the fault specifically requires its simpler architecture
Brainstorm questions:
- Which application best represents this incident scenario?
- What Kubernetes resources would be affected? (Deployment, Service, ConfigMap, NetworkPolicy, etc.)
- What changes would reproduce the incident? (image change, env var modification, resource limits, etc.)
- Reference similar faults checked in Step 1.0 for implementation patterns
Step 3: Identify Target Services
Dynamically identify services for the chosen application:
3.1 List all available applications
grep -E "^ [a-z_]+:" scenarios/sre/roles/applications/defaults/main/managers.yaml | sed 's/://g' | awk '{print $1}'
3.2 Get application configuration
grep -A 20 "^ <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml
3.3 Extract metadata
grep -A 20 "^ <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml | grep -E "url:|documentation:" | head -1
grep -A 20 "^ <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml | grep "namespace:" | head -1
3.4 Fetch service architecture from documentation (REQUIRED)
Using the documentation URL from step 3.3:
- Navigate to the application's documentation URL
- Study the architecture diagram - This is critical for:
- Understanding service dependencies and relationships
- Identifying which services communicate with each other
- Planning fault propagation paths for scenarios
- Look for:
- Architecture diagrams (most important!)
- Service listings and descriptions
- Component documentation and roles
- Deployment guides
- Identify available services and their roles
3.4.1 Optional: Ground in Real Deployment
Ask the user:
Would you like to deploy the application to a live cluster to get actual deployment names, service names, and resource details? This ensures accuracy but requires a running Kubernetes cluster.
If YES:
-
Ask for kubeconfig path:
What is the path to your kubeconfig file? (e.g., ~/.kube/config)
-
Set kubeconfig and deploy (outputs will be shown):
export KUBECONFIG=<path-from-user>
cd scenarios/sre
make deploy-tools
make deploy-applications
Note: Both commands will display their complete output including:
- Ansible playbook task execution
- Kubernetes resource creation status
- Any warnings or errors
-
Query live cluster for actual resource names:
NAMESPACE=<namespace>
kubectl get deployments -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}'
kubectl get services -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}'
kubectl get pods -n $NAMESPACE --show-labels
kubectl get configmaps -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}'
kubectl get deployment <deployment-name> -n $NAMESPACE -o jsonpath='{.spec.template.spec.containers[*].name}'
-
Use these actual names when implementing the fault in Step 4
If NO: Continue with documentation-based discovery
3.5 Identify target service
Consider:
- Which service exhibits the problem?
- Which service is the root cause?
- Are multiple services affected?
Step 4: Write Ansible Implementation
Create the injection task file following patterns from existing faults.
File Location
scenarios/sre/roles/faults/tasks/inject_<fault-id>.yaml
Important Guidelines
CRITICAL - Resource Naming:
- DO NOT create new resources with names that reveal the fault mechanism
- DO NOT use names like
fault-injector, chaos-config, memory-leak-pod, high-cpu-workload
- DO use neutral, application-appropriate names that blend in with existing resources
- GOOD:
app-config, sidecar-processor, cache-helper, data-processor
- BAD:
fault-config, crash-trigger, latency-injector
Rationale: The agent solving scenarios should diagnose the issue based on symptoms and observability, not by discovering obviously-named fault injection resources.
Standard Pattern
---
- name: Retrieve [resource-type]
kubernetes.core.k8s_info:
kubeconfig: "{{ faults_cluster.kubeconfig }}"
api_version: "{{ injection_task.args.kubernetesObject.apiVersion }}"
kind: "{{ injection_task.args.kubernetesObject.kind }}"
name: "{{ injection_task.args.kubernetesObject.metadata.name }}"
namespace: "{{ injection_task.args.kubernetesObject.metadata.namespace }}"
register: faults_workload
- name: Validate that [resource-type] exists
ansible.builtin.assert:
that:
- faults_workload.api_found
- faults_workload.resources | ansible.builtin.length == 1
fail_msg: Unable to find [resource-type]
success_msg: Found [resource-type]
- name: Wait for [resource-type] to update
kubernetes.core.k8s_info:
api_version: "{{ faults_workload.resources[0].apiVersion }}"
kubeconfig: "{{ faults_cluster.kubeconfig }}"
kind: "{{ faults_workload.resources[0].kind }}"
name: "{{ faults_workload.resources[0].metadata.name }}"
namespace: "{{ faults_workload.resources[0].metadata.namespace }}"
register: faults_patched_workload
until:
- faults_patched_workload.api_found
- faults_patched_workload.resources | ansible.builtin.length == 1
delay: 15
retries: 20
Common Fault Types - Discover Dynamically
Instead of hardcoded examples, discover existing fault patterns:
Step 1: List All Existing Fault Injection Tasks
ls scenarios/sre/roles/faults/tasks/inject_*.yaml | sort
Step 2: Search by Fault Category/Pattern
Find Image-Related Faults:
ls scenarios/sre/roles/faults/tasks/inject_*image*.yaml
grep -l "image:" scenarios/sre/roles/faults/tasks/inject_*.yaml
Find Configuration/ConfigMap Faults:
ls scenarios/sre/roles/faults/tasks/inject_*config*.yaml
grep -l "ConfigMap\|environment" scenarios/sre/roles/faults/tasks/inject_*.yaml
Find Resource Faults:
ls scenarios/sre/roles/faults/tasks/inject_*resource*.yaml
grep -l "ResourceQuota\|limits\|requests" scenarios/sre/roles/faults/tasks/inject_*.yaml
Find Network Faults:
ls scenarios/sre/roles/faults/tasks/inject_*network*.yaml
grep -l "NetworkPolicy" scenarios/sre/roles/faults/tasks/inject_*.yaml
Find Chaos Mesh Faults:
ls scenarios/sre/roles/faults/tasks/inject_*chaos*.yaml
grep -l "chaos-mesh.org" scenarios/sre/roles/faults/tasks/inject_*.yaml
Step 3: Read and Study Relevant Fault Files
cat scenarios/sre/roles/faults/tasks/inject_<fault-name>.yaml
grep -r "kind: Deployment" scenarios/sre/roles/faults/tasks/
grep -r "kind: ConfigMap" scenarios/sre/roles/faults/tasks/
grep -r "kind: NetworkPolicy" scenarios/sre/roles/faults/tasks/
Step 4: Find Faults by Tag
jq '.[] | select(.tags[] | contains("Networking"))' scenarios/sre/roles/documentation/files/library/faults/index.json
jq '.[] | select(.tags[] | contains("Performance"))' scenarios/sre/roles/documentation/files/library/faults/index.json
jq '.[] | select(.tags[] | contains("Deployment"))' scenarios/sre/roles/documentation/files/library/faults/index.json
Step 5: Match Fault Mechanism to Incident
Based on your incident analysis (Step 1), identify which existing faults have similar mechanisms:
- Image issues ā search for image-related faults
- Config problems ā search for ConfigMap/environment faults
- Network connectivity ā search for NetworkPolicy faults
- Resource exhaustion ā search for quota/limits faults
- Application crashes ā search for code/chaos faults
Important Implementation Notes
- Always add ITBench label:
app.kubernetes.io/managed-by: ITBench to created resources
- Use registered variable: Access original resource as
faults_workload.resources[0]
- Validate before modifying: Assert resource exists
- Wait for manifestation: Use
k8s_info with until conditions
- Reference similar faults: Search existing tasks for patterns
Completing the Fault Index - Dynamic Discovery
After implementing the Ansible task, complete the fault entry in:
File: scenarios/sre/roles/documentation/files/library/faults/index.json
Step 1: Verify Required Fields from Schema
jq '.required' scenarios/sre/roles/documentation/files/library/faults/schema.json
jq '.properties | to_entries[] | {key: .key, type: .value.type, required: .value.required}' \
scenarios/sre/roles/documentation/files/library/faults/schema.json
Step 2: Discover Argument Schema Patterns from Existing Faults
Find similar faults to use as templates:
jq '.[] | select(.arguments.jsonSchema.required[]? | contains("kubernetesObject")) | {id, required: .arguments.jsonSchema.required}' \
scenarios/sre/roles/documentation/files/library/faults/index.json
jq '.[] | select(.id == "<similar-fault-id>") | .arguments' \
scenarios/sre/roles/documentation/files/library/faults/index.json
jq '[.[] | .arguments.jsonSchema.required] | unique' \
scenarios/sre/roles/documentation/files/library/faults/index.json
Common patterns discovered:
jq '.[] | select(.arguments.jsonSchema.required == ["kubernetesObject"]) | .id' \
scenarios/sre/roles/documentation/files/library/faults/index.json
jq '.[] | select(.arguments.jsonSchema.required | contains(["kubernetesObject", "container"])) | .id' \
scenarios/sre/roles/documentation/files/library/faults/index.json
jq '.[] | select(.arguments.jsonSchema.required | length > 2) | {id, required: .arguments.jsonSchema.required}' \
scenarios/sre/roles/documentation/files/library/faults/index.json
Step 3: Discover Alert Types Dynamically
Get available alert types from schema:
jq '.properties.alerts.properties.application.items.enum' \
scenarios/sre/roles/documentation/files/library/faults/schema.json
jq '.properties.alerts.properties.goldenSignal.items.enum' \
scenarios/sre/roles/documentation/files/library/faults/schema.json
Find which faults use which alerts:
jq '.[] | select(.alerts.application[]? == "KubePodCrashLooping") | .id' \
scenarios/sre/roles/documentation/files/library/faults/index.json
jq '[.[] | .alerts] | unique' \
scenarios/sre/roles/documentation/files/library/faults/index.json
Step 3.1: Registering New Alerts
If your fault introduces a NEW alert (not in the schema enum), you must register it in THREE locations:
-
Fault Schema - Add to alert enum:
-
Alerts Monitoring Playbook - Add to alert detection (3 locations):
-
PrometheusRules Template - Define the actual alert rule:
Example: For KafkaConsumerGroupInactive alert, you would:
- Add to fault schema enum (alphabetically)
- Add to monitoring playbook (3 locations)
- Define PrometheusRule with expression:
kafka_consumergroup_members{namespace="..."} == 0
See scenario-scaffolding skill section 3.6.1 for detailed checklist and examples.
Step 4: Discover Solution Patterns from Existing Faults
Find common solution templates:
jq '.[] | select(.solutions.templates[].steps[].command? | contains("rollout undo")) | .id' \
scenarios/sre/roles/documentation/files/library/faults/index.json
jq '.[] | select(.id == "<similar-fault-id>") | .solutions' \
scenarios/sre/roles/documentation/files/library/faults/index.json
jq '[.[] | .solutions.templates[].steps[].command] | unique' \
scenarios/sre/roles/documentation/files/library/faults/index.json
Step 5: Use Similar Fault as Template
Complete workflow:
jq '.[] | select(.name | contains("ConfigMap") or contains("Image")) | {id, name}' \
scenarios/sre/roles/documentation/files/library/faults/index.json
jq '.[] | select(.id == "<similar-fault-id>")' \
scenarios/sre/roles/documentation/files/library/faults/index.json > /tmp/template.json
Reference Examples - Discover Dynamically
Find Faults to Study
Discover simple faults (good starting points):
find scenarios/sre/roles/faults/tasks -name "inject_*.yaml" -exec wc -l {} \; | sort -n | head -10
ls scenarios/sre/roles/faults/tasks/inject_*image*.yaml
ls scenarios/sre/roles/faults/tasks/inject_*environment*.yaml
ls scenarios/sre/roles/faults/tasks/inject_*node*.yaml
Discover complex faults (advanced patterns):
find scenarios/sre/roles/faults/tasks -name "inject_*.yaml" -exec wc -l {} \; | sort -n | tail -10
grep -l "kubernetes.core.k8s:" scenarios/sre/roles/faults/tasks/inject_*.yaml | xargs grep -c "kubernetes.core.k8s:" | grep -v ":1$"
grep -l "chaos-mesh.org" scenarios/sre/roles/faults/tasks/inject_*.yaml
grep -l "node\|cordon\|drain" scenarios/sre/roles/faults/tasks/inject_*.yaml
Study faults by complexity:
for file in scenarios/sre/roles/faults/tasks/inject_*.yaml; do
echo "$(grep -c "^- name:" "$file") steps: $(basename "$file")"
done | sort -n
Anti-Patterns
ā Don't:
- Skip brainstorming with available applications
- Guess service names without checking documentation
- Create faults without checking for similar existing ones
- Forget the
app.kubernetes.io/managed-by: ITBench label
- Use hardcoded namespace/service names in fault index (use Jinja2 templates)
- Create resources with obvious fault-revealing names (e.g.,
fault-injector, chaos-config, crash-trigger)
ā
Do:
- Start with incident description
- Map to available applications and their documentation
- Reference existing similar fault implementations
- Test in a real cluster before finalizing
- Follow consistent Ansible patterns
- Use Jinja2 templates in solutions for reusability
- Use neutral, application-appropriate names for any new resources (e.g.,
app-config, cache-helper, data-processor)
Automatic Transition to Scenario Creation
IMPORTANT: After completing fault scaffolding, automatically proceed to scenario creation using the scenario-scaffolding skill.
Do not wait for user prompt - transition immediately to:
- Apply this fault to the identified service/component
- Populate scenario files with application, faults, and tools
- Generate groundtruth.yaml with DSL format
This creates a complete end-to-end workflow: Incident ā Fault ā Scenario ā Ground Truth