with one click
scenario-scaffolding
// Assists with creating complete ITBench scenarios by applying fault mechanisms to specific services, populating scenario files, and generating groundtruth DSL with fault propagations and alert predictions.
// Assists with creating complete ITBench scenarios by applying fault mechanisms to specific services, populating scenario files, and generating groundtruth DSL with fault propagations and alert predictions.
| name | scenario-scaffolding |
| description | Assists with creating complete ITBench scenarios by applying fault mechanisms to specific services, populating scenario files, and generating groundtruth DSL with fault propagations and alert predictions. |
This skill guides you through the complete scenario creation workflow:
This skill auto-activates when:
**/scenarios/index.json**/scenarios/files/scenario_*/Before starting scenario scaffolding:
Dynamically identify available services for your chosen application:
# Extract all application keys from managers.yaml
grep -E "^ [a-z_]+:" scenarios/sre/roles/applications/defaults/main/managers.yaml | sed 's/://g' | awk '{print $1}'
Application Preference:
opentelemetry_demo / Astronomy Shop) for most scenarios - it's richer, more comprehensive, and better maintainedbook_info) only if the fault specifically requires its simpler architecture# Replace <app-key> with your chosen application
# Get namespace
grep -A 15 "^ <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml | grep "namespace:" | awk '{print $2}'
# Get documentation URL
grep -A 15 "^ <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml | grep -E "url:|documentation:"
# Get the namespace from step 1.2
NAMESPACE="<namespace>"
# Find all Deployments in the application
grep -r "kind: Deployment" scenarios/sre/roles/applications/templates/kubernetes/ | grep "$NAMESPACE" | grep -oP 'name: \K[a-z0-9-]+'
# Find all Services
grep -r "kind: Service" scenarios/sre/roles/applications/templates/kubernetes/ | grep "$NAMESPACE" | grep -oP 'name: \K[a-z0-9-]+'
# Find all StatefulSets
grep -r "kind: StatefulSet" scenarios/sre/roles/applications/templates/kubernetes/ | grep "$NAMESPACE" | grep -oP 'name: \K[a-z0-9-]+'
Ask the user:
Would you like to deploy the application to a live cluster to get actual deployment names, service names, and resource details? This ensures accuracy but requires a running Kubernetes cluster.
If YES:
Ask for kubeconfig path:
What is the path to your kubeconfig file? (e.g., ~/.kube/config)
Set kubeconfig and deploy (outputs will be shown):
# Set the kubeconfig
export KUBECONFIG=<path-from-user>
# Navigate to scenarios directory
cd scenarios/sre
# Deploy tools - outputs will be displayed
make deploy-tools
# Deploy applications - outputs will be displayed
make deploy-applications
Note: Both commands will display their complete output including:
Query live cluster for actual resource names:
# Get actual deployments
kubectl get deployments -n <namespace> -o jsonpath='{.items[*].metadata.name}'
# Get actual services
kubectl get services -n <namespace> -o jsonpath='{.items[*].metadata.name}'
# Get actual pods (with labels)
kubectl get pods -n <namespace> --show-labels
# Get actual configmaps
kubectl get configmaps -n <namespace> -o jsonpath='{.items[*].metadata.name}'
Use these actual names in your scenario instead of guessing from manifests
If NO: Continue with manifest-based discovery from Step 1.3
Consult the documentation URL from step 1.2 to understand:
IMPORTANT: Study the architecture diagram carefully - it's essential for:
Select the service that best demonstrates the fault mechanism.
The scenario consists of multiple files in scenarios/sre/roles/scenarios/files/scenario_<ID>/:
File: scenarios/sre/roles/documentation/files/library/scenarios/index.json
Structure (discovered dynamically):
First, get available tags and platforms:
# Get valid tags
jq '.properties.tags.items.enum' scenarios/sre/roles/documentation/files/library/faults/schema.json
# Get valid platforms
jq '.properties.platforms.items.enum' scenarios/sre/roles/documentation/files/library/faults/schema.json
# Get valid categories
jq '.properties.category.enum' scenarios/sre/roles/documentation/files/library/scenarios/schema.json
Then construct the scenario entry:
{
"id": <next-available-id>,
"category": "<category-from-schema>", // e.g., "sre", "finops", "ciso"
"complexity": "<complexity>", // "low", "medium", "high"
"description": "User-facing incident description",
"environment": {
"applications": [
{
"id": "<application-id-from-step-1>" // Prefer "opentelemetry-demo" over "book-info"
}
]
},
"platforms": ["<platform-from-schema>"],
"tags": ["<tag1>", "<tag2>"], // Match fault tags from schema
// Populated in this step:
"disruptions": [
{
"injections": [
{
"id": "<fault-id-from-faults-index>",
"args": {
// Arguments matching the fault's JSON schema
"kubernetesObject": {
"apiVersion": "apps/v1",
"kind": "<kind>",
"metadata": {
"name": "<service-name-from-step-1>",
"namespace": "<namespace-from-step-1>"
}
}
// Additional args based on fault schema
}
}
],
// Optional: post-injection hooks
"waitFor": {
"postInjection": [
{
"id": "restart-kubernetes-workload",
"args": { /* restart args */ }
}
]
}
}
],
"alerts": [], // Usually empty, unless specific alert requirements
"solutions": [
[
{
"steps": [
{
"text": "Step description",
"command": "kubectl -n <namespace> <action> <kind>/<name>"
}
]
}
]
]
}
Discover disruption patterns from existing scenarios:
# Find scenarios using your chosen fault
FAULT_ID="<your-fault-id>"
jq --arg fault "$FAULT_ID" '.[] | select(.disruptions[].injections[].id == $fault) | {id, disruptions}' \
scenarios/sre/roles/documentation/files/library/scenarios/index.json
# Examine a specific scenario's disruptions
SCENARIO_ID="<scenario-number>"
jq --arg id "$SCENARIO_ID" '.[] | select(.id == ($id | tonumber)) | .disruptions' \
scenarios/sre/roles/documentation/files/library/scenarios/index.json
Single Fault Injection (template):
{
"disruptions": [
{
"injections": [
{
"id": "<fault-id>",
"args": {
"kubernetesObject": {
"apiVersion": "apps/v1",
"kind": "<Deployment|StatefulSet>",
"metadata": {
"name": "<service-name>",
"namespace": "<namespace>"
}
},
"container": {
"name": "<container-name>"
}
}
}
]
}
]
}
Multiple Fault Injections (template):
{
"disruptions": [
{
"injections": [
{
"id": "<fault-id>",
"args": {
"kubernetesObject": {
"apiVersion": "apps/v1",
"kind": "Deployment",
"metadata": {"name": "<service-1>", "namespace": "<namespace>"}
}
}
},
{
"id": "<fault-id>",
"args": {
"kubernetesObject": {
"apiVersion": "apps/v1",
"kind": "Deployment",
"metadata": {"name": "<service-2>", "namespace": "<namespace>"}
}
}
}
]
}
]
}
With waitFor Hooks (find examples dynamically):
# Find scenarios using waitFor patterns
jq '.[] | select(.disruptions[].waitFor != null) | {id, disruptions}' \
scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -50
When to use waitFor:
Common waitFor IDs:
restart-kubernetes-workload - Restart a deploymentwait-kubernetes-workload-ready - Wait for pod readinessDiscover solution patterns from the fault definition:
# Get solutions from the fault entry
FAULT_ID="<your-fault-id>"
jq --arg fault "$FAULT_ID" '.[] | select(.id == $fault) | .solutions' \
scenarios/sre/roles/documentation/files/library/faults/index.json
Adapt fault solutions to scenario context (replace Jinja2 templates with actual values):
Single-step solution (template):
{
"solutions": [
[
{
"steps": [
{
"text": "Revert the last change done to the manifest.",
"command": "kubectl -n <namespace> rollout undo <kind>/<name>"
}
]
}
],
[
{
"steps": [
{
"text": "Manually edit the manifest and fix the issue.",
"command": "kubectl -n <namespace> edit <kind> <name>"
}
]
}
]
]
}
Multi-step solutions (find examples):
# Find scenarios with multi-step solutions
jq '.[] | select(.solutions[][].steps | length > 1) | {id, solutions}' \
scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -100
Based on disruptions, identify tools needed (captured by scaffolding from scenarios/sre/roles/scaffolding/tasks/generate_new_scenario_files.yaml):
Chaos Mesh Detection:
enable_chaos_mesh: |-
{{
scaffolding_skeleton_scenario.disruptions |
map(attribute="injections") |
ansible.builtin.flatten |
ansible.builtin.selectattr("id", "==", "scheduled-chaos-mesh-experiment") |
ansible.builtin.length > 0
}}
This is automatically determined when scenario files are generated.
Scenarios require two ground truth files in different formats:
File: scenarios/sre/roles/scenarios/files/scenario_<ID>/groundtruth.yaml
This is a simplified format that focuses on affected entities and solutions.
Structure:
---
apiVersion: itbench.io/v2
kind: GroundTruth
metadata:
name: scenario-<ID>
spec:
# List of expected alerts
alerts:
- labels: {}
name: <alert-name> # e.g., KubePodNotReady, KubePodCrashLooping
# List of affected Kubernetes resources (usually root cause)
entities:
- apiVersion: <api-version>
kind: <kind>
metadata:
name: <resource-name>
namespace: <namespace>
# Solution paths (same as scenario index)
solutions:
- - steps:
- command: <kubectl command>
text: <step description>
Example (find real examples dynamically):
# View an existing groundtruth.yaml for reference
cat scenarios/sre/roles/scenarios/files/scenario_20/groundtruth.yaml
# Or examine multiple scenarios
ls scenarios/sre/roles/scenarios/files/scenario_*/groundtruth.yaml | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && cat {}'
Template:
---
apiVersion: itbench.io/v2
kind: GroundTruth
metadata:
name: scenario-<ID>
spec:
alerts:
- labels: {}
name: <alert-name> # From fault's alerts.application
entities:
- apiVersion: apps/v1
kind: <Deployment|StatefulSet|etc>
metadata:
name: <service-name>
namespace: <namespace>
solutions:
- - steps:
- command: kubectl -n <namespace> rollout undo <kind>/<name>
text: Revert the last change done to the manifest.
- steps:
- command: kubectl -n <namespace> edit <kind> <name>
text: Manually edit the manifest and fix the issue.
Key Points:
File: scenarios/sre/roles/scenarios/files/scenario_<ID>/groundtruth_v1.yaml
Ground truth uses DSL format (groups) to define fault propagation chains.
---
apiVersion: itbench.io/v1
kind: GroundTruth
metadata:
name: scenario-<ID>
spec:
# Predicted alerts
alerts:
- group_id: <group-id>
id: <alert-name>
metadata:
description: <alert description>
# Resource groups
groups:
- id: <unique-group-id>
kind: <Pod|Service|Deployment|ConfigMap|etc>
namespace: <namespace>
name: <resource-name> # OR filter: ["regex-pattern"]
root_cause: true # At least one group must be root cause
# Logical relationships
aliases:
- [<group-id-1>, <group-id-2>, <group-id-3>]
# Fault propagations
propagations:
- source: <group-id>
target: <group-id>
condition: <what causes propagation>
effect: <what happens>
# Optional: fault metadata
fault:
- category: Change # or Create, Delete
condition: <fault condition>
entity:
group_id: <group-id>
kind: <kind>
name: <name>
fault_mechanism: <mechanism>
# Recommended actions
recommendedActions:
- solution:
actions:
- <action description>
id: <solution-id>
Groups represent sets of Kubernetes resources.
Required fields:
id: Unique identifierkind: Kubernetes kind (Pod, Service, Deployment, etc.)namespace: Kubernetes namespacename: Exact resource name, ORfilter: List of regex patternsOptional fields:
root_cause: boolean (at least one group must be true)Discover group patterns from existing scenarios:
# View groups from a specific scenario
cat scenarios/sre/roles/scenarios/files/scenario_<ID>/groundtruth_v1.yaml | grep -A 10 "^ groups:"
# Find scenarios with ConfigMap root causes
grep -r "kind: ConfigMap" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml
Pod group with filter (template):
groups:
- id: <service-name>-pod-1
kind: Pod
namespace: <namespace>
filter:
- <service-name>-.*
root_cause: true
Service group with filter (template):
groups:
- id: <service-name>-service-1
kind: Service
namespace: <namespace>
filter:
- <service-name>\b # \b = word boundary
ConfigMap group (root cause) (template):
groups:
- id: <config-name>-cm
kind: ConfigMap
namespace: <namespace>
name: <configmap-name>
root_cause: true
Aliases link related groups logically.
aliases:
- [<group-id-1>, <group-id-2>, <group-id-3>]
Discover alias patterns:
# Find scenarios with aliases
grep -A 5 "^ aliases:" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | head -20
Template:
aliases:
- - <service-name>-pod-1
- <service-name>-service-1
- - <dependent-service>-pod-1
- <dependent-service>-service-1
Propagations describe how faults spread through the system.
IMPORTANT: Use the architecture diagram from the application's documentation to:
Required fields:
source: Group ID where propagation startstarget: Group ID where propagation endscondition: What causes the propagation (based on architecture)effect: What observable impact resultsDiscover propagation patterns:
# Find propagation examples
grep -A 10 "^ propagations:" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | head -50
Template:
propagations:
- source: <root-cause-group-id>
target: <affected-service-group-id>
condition: <what triggers propagation>
effect: <observable impact>
- source: <affected-service-group-id>
target: <dependent-service-group-id>
condition: <dependency relationship>
effect: <downstream impact>
IMPORTANT: Always read alert definitions from actual source files to get up-to-date alert rules.
Use the Architecture Diagram:
Alert Sources:
Application-Specific Alerts - Read from local PrometheusRules:
OpenTelemetry Demo:
cat scenarios/sre/roles/applications/templates/kubernetes/otel_demo/prometheusrules.j2
BookInfo:
cat scenarios/sre/roles/applications/templates/kubernetes/book_info/prometheusrules.j2
Kubernetes Platform Alerts - Check:
a. Local schema (available alerts in ITBench):
jq '.properties.alerts.properties.application.items.enum' \
scenarios/sre/roles/documentation/files/library/faults/schema.json
b. Prometheus Community Rules (canonical source):
KubePodCrashLooping, KubePodNotReady)How to predict alerts:
Alert format (template):
alerts:
- group_id: <affected-service-group-id>
id: <alert-name-from-prometheus-rules>
metadata:
description: <alert description matching PrometheusRules>
- group_id: <affected-service-group-id>
id: <another-alert-name>
metadata:
description: <another alert description>
View real examples dynamically:
# View a complete groundtruth_v1.yaml file
cat scenarios/sre/roles/scenarios/files/scenario_20/groundtruth_v1.yaml
# Compare multiple scenarios for patterns
for scenario in 20 30 40; do
echo "=== Scenario $scenario ==="
cat "scenarios/sre/roles/scenarios/files/scenario_${scenario}/groundtruth_v1.yaml" 2>/dev/null || echo "Not found"
echo ""
done
Complete Template:
---
apiVersion: itbench.io/v1
kind: GroundTruth
metadata:
name: scenario-<ID>
spec:
alerts:
- group_id: <affected-service-group-id>
id: <alert-name>
metadata:
description: <alert description from PrometheusRules>
groups:
- id: <root-cause-group-id>
kind: <Pod|Deployment|ConfigMap>
namespace: <namespace>
filter:
- <name-pattern-regex>
root_cause: true
- id: <affected-service-group-id>
kind: Service
namespace: <namespace>
filter:
- <service-name>\b
- id: <dependent-service-group-id>
kind: Service
namespace: <namespace>
filter:
- <dependent-service-name>\b
aliases:
- - <root-cause-group-id>
- <affected-service-group-id>
propagations:
- source: <root-cause-group-id>
target: <affected-service-group-id>
condition: <fault condition>
effect: <direct impact>
- source: <affected-service-group-id>
target: <dependent-service-group-id>
condition: <dependency relationship>
effect: <downstream impact>
fault:
- category: <Change|Create|Delete>
condition: <fault description>
entity:
group_id: <root-cause-group-id>
kind: <kind>
name: <name>
fault_mechanism: <mechanism>
recommendedActions:
- solution:
actions:
- <action description>
id: <solution-id>
- solution:
actions:
- <alternative action>
id: <alternative-solution-id>
IMPORTANT: You must manually create both groundtruth files before running validation:
After creating both files manually, validate with:
cd scenarios/sre
make regenerate-scenario-files
This validates and generates:
scenario_<ID>/scenario.yaml - Scenario spec (auto-generated from index)Both groundtruth files must be created manually - they are not auto-generated.
Discover patterns dynamically by analyzing existing scenarios:
# Get all scenarios for a specific application
APP_ID="opentelemetry-demo" # Prefer opentelemetry-demo over book-info
jq --arg app "$APP_ID" '.[] | select(.environment.applications[].id == $app) | {id, description}' \
scenarios/sre/roles/documentation/files/library/scenarios/index.json
# Analyze groundtruth patterns for that application
for scenario_id in $(jq --arg app "$APP_ID" '.[] | select(.environment.applications[].id == $app) | .id' \
scenarios/sre/roles/documentation/files/library/scenarios/index.json); do
echo "=== Scenario $scenario_id ==="
cat "scenarios/sre/roles/scenarios/files/scenario_${scenario_id}/groundtruth_v1.yaml" | grep -E "^ (groups|propagations):" -A 20
done
Find typical propagation chains:
# Find Pod → Service propagations
grep -A 4 "source:.*pod" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | grep "target:" | head -10
# Find Service → Service propagations
grep -A 4 "source:.*service" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | grep "target:" | head -10
Generic propagation chain template:
<Root Cause Resource> (root_cause) → <Affected Service> → <Dependent Service> → Alerts
Discover ConfigMap scenarios:
# Find scenarios with ConfigMap root causes
grep -r "kind: ConfigMap" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml -l | \
xargs -I {} sh -c 'echo "=== {} ===" && cat {} | head -50'
Template:
groups:
- id: <config-name>-cm
kind: ConfigMap
namespace: <namespace>
name: <configmap-name>
root_cause: true
- id: <affected-workload>-pod
kind: Pod
namespace: <namespace>
filter: [<workload-name>-.*]
propagations:
- source: <config-name>-cm
target: <affected-workload>-pod
condition: ConfigMap contains <type of issue>
effect: <workload> pod <impact>
❌ Don't:
root_cause: true on at least one groupentities format (use groups instead)faultId in final scenario (it's a scaffolding hint)malformed-config, fault-injector, crash-trigger, chaos-configmap)✅ Do:
make regenerate-scenario-filesfaultId after using it to build disruptionsapp-config, recommendation-features, sidecar-processor, cache-helper) - Rationale: Agents should diagnose issues based on symptoms and observability, not by discovering obviously-named fault injection resourcesDiscover reference scenarios dynamically by complexity:
# Find simple scenarios (low complexity)
jq '.[] | select(.complexity == "low") | {id, description, faults: [.disruptions[].injections[].id]}' \
scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -50
# Find complex scenarios (high complexity)
jq '.[] | select(.complexity == "high") | {id, description, faults: [.disruptions[].injections[].id]}' \
scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -50
# Find scenarios using specific fault mechanisms
FAULT_ID="<your-fault-id>"
jq --arg fault "$FAULT_ID" '.[] | select(.disruptions[].injections[].id == $fault) | {id, description}' \
scenarios/sre/roles/documentation/files/library/scenarios/index.json
View groundtruth files for reference:
# List all available groundtruth files
ls scenarios/sre/roles/scenarios/files/scenario_*/groundtruth_v1.yaml | sort -V
# View specific scenarios
cat scenarios/sre/roles/scenarios/files/scenario_1/groundtruth_v1.yaml # Feature flag pattern
cat scenarios/sre/roles/scenarios/files/scenario_20/groundtruth_v1.yaml # Image pull error
cat scenarios/sre/roles/scenarios/files/scenario_40/groundtruth_v1.yaml # Code change pattern
make regenerate-scenario-files to validateAfter scenario scaffolding:
git add . && git commit -m "feat: add scenario <ID>"