Run any Skill in Manus with one click

$pwd:

scenario-scaffolding

Name: Scenario Scaffolding
Author: itbench-hub

// Assists with creating complete ITBench scenarios by applying fault mechanisms to specific services, populating scenario files, and generating groundtruth DSL with fault propagations and alert predictions.

Run Skill in Manus

$ git log --oneline --stat

stars:324

forks:35

updated:March 24, 2026 at 22:59

SKILL.md

readonly

related-skills.json

same repository

fault-scaffolding.md

from "itbench-hub/ITBench"

Assists with creating new fault mechanisms for ITBench scenarios, from incident description through Ansible implementation. Guides brainstorming, service selection, and code implementation consistent with existing fault patterns.

2026-03-24324

package.json

"author": "itbench-hub"

"repository": "itbench-hub/ITBench"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

Run any Skill with one click

name	scenario-scaffolding
description	Assists with creating complete ITBench scenarios by applying fault mechanisms to specific services, populating scenario files, and generating groundtruth DSL with fault propagations and alert predictions.

Purpose

This skill guides you through the complete scenario creation workflow:

Fault Application - Apply fault mechanism to a specific service/component
Scenario File Population - Define which application, faults, and tools
Ground Truth Generation - Create DSL with groups, propagations, and alerts

When to Use This Skill

This skill auto-activates when:

Working with files matching **/scenarios/index.json
Editing files in **/scenarios/files/scenario_*/
User mentions "scenario scaffold", "ground truth", "disruptions", "DSL"
Creating or modifying scenario definitions

Prerequisites

Before starting scenario scaffolding:

✅ Fault mechanism is fully implemented (Ansible task + fault index complete)
✅ You know the target application (prefer OpenTelemetry Demo over BookInfo)
✅ You know the specific service/component to target

Workflow

Step 1: Apply Fault to Service/Component

Dynamically identify available services for your chosen application:

1.1 List Available Applications

# Extract all application keys from managers.yaml
grep -E "^  [a-z_]+:" scenarios/sre/roles/applications/defaults/main/managers.yaml | sed 's/://g' | awk '{print $1}'

Application Preference:

Prefer OpenTelemetry Demo (opentelemetry_demo / Astronomy Shop) for most scenarios - it's richer, more comprehensive, and better maintained
Use BookInfo (book_info) only if the fault specifically requires its simpler architecture

1.2 Get Application Metadata

# Replace <app-key> with your chosen application
# Get namespace
grep -A 15 "^  <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml | grep "namespace:" | awk '{print $2}'

# Get documentation URL
grep -A 15 "^  <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml | grep -E "url:|documentation:"

1.3 Discover Services from Manifests

# Get the namespace from step 1.2
NAMESPACE="<namespace>"

# Find all Deployments in the application
grep -r "kind: Deployment" scenarios/sre/roles/applications/templates/kubernetes/ | grep "$NAMESPACE" | grep -oP 'name: \K[a-z0-9-]+'

# Find all Services
grep -r "kind: Service" scenarios/sre/roles/applications/templates/kubernetes/ | grep "$NAMESPACE" | grep -oP 'name: \K[a-z0-9-]+'

# Find all StatefulSets
grep -r "kind: StatefulSet" scenarios/sre/roles/applications/templates/kubernetes/ | grep "$NAMESPACE" | grep -oP 'name: \K[a-z0-9-]+'

1.3.1 Optional: Ground in Real Deployment

Ask the user:

Would you like to deploy the application to a live cluster to get actual deployment names, service names, and resource details? This ensures accuracy but requires a running Kubernetes cluster.

If YES:

Ask for kubeconfig path:

What is the path to your kubeconfig file? (e.g., ~/.kube/config)

Set kubeconfig and deploy (outputs will be shown):

# Set the kubeconfig
export KUBECONFIG=<path-from-user>

# Navigate to scenarios directory
cd scenarios/sre

# Deploy tools - outputs will be displayed
make deploy-tools

# Deploy applications - outputs will be displayed
make deploy-applications

Note: Both commands will display their complete output including:

Ansible playbook task execution
Kubernetes resource creation status
Any warnings or errors

Query live cluster for actual resource names:

# Get actual deployments
kubectl get deployments -n <namespace> -o jsonpath='{.items[*].metadata.name}'

# Get actual services
kubectl get services -n <namespace> -o jsonpath='{.items[*].metadata.name}'

# Get actual pods (with labels)
kubectl get pods -n <namespace> --show-labels

# Get actual configmaps
kubectl get configmaps -n <namespace> -o jsonpath='{.items[*].metadata.name}'

Use these actual names in your scenario instead of guessing from manifests

If NO: Continue with manifest-based discovery from Step 1.3

1.4 Verify from Documentation

Consult the documentation URL from step 1.2 to understand:

Service architecture diagram - Shows how services connect and depend on each other
Component roles and responsibilities
Dependencies between services (who calls whom)

IMPORTANT: Study the architecture diagram carefully - it's essential for:

Understanding service dependencies (used in propagations)
Identifying which downstream services will be affected
Predicting which alerts will fire based on service relationships

Select the service that best demonstrates the fault mechanism.

Step 2: Populate Scenario Files

The scenario consists of multiple files in scenarios/sre/roles/scenarios/files/scenario_<ID>/:

2.1 Scenario Index Entry

File: scenarios/sre/roles/documentation/files/library/scenarios/index.json

Structure (discovered dynamically):

First, get available tags and platforms:

# Get valid tags
jq '.properties.tags.items.enum' scenarios/sre/roles/documentation/files/library/faults/schema.json

# Get valid platforms
jq '.properties.platforms.items.enum' scenarios/sre/roles/documentation/files/library/faults/schema.json

# Get valid categories
jq '.properties.category.enum' scenarios/sre/roles/documentation/files/library/scenarios/schema.json

Then construct the scenario entry:

{
  "id": <next-available-id>,
  "category": "<category-from-schema>",  // e.g., "sre", "finops", "ciso"
  "complexity": "<complexity>",  // "low", "medium", "high"
  "description": "User-facing incident description",
  "environment": {
    "applications": [
      {
        "id": "<application-id-from-step-1>"  // Prefer "opentelemetry-demo" over "book-info"
      }
    ]
  },
  "platforms": ["<platform-from-schema>"],
  "tags": ["<tag1>", "<tag2>"],  // Match fault tags from schema

  // Populated in this step:
  "disruptions": [
    {
      "injections": [
        {
          "id": "<fault-id-from-faults-index>",
          "args": {
            // Arguments matching the fault's JSON schema
            "kubernetesObject": {
              "apiVersion": "apps/v1",
              "kind": "<kind>",
              "metadata": {
                "name": "<service-name-from-step-1>",
                "namespace": "<namespace-from-step-1>"
              }
            }
            // Additional args based on fault schema
          }
        }
      ],
      // Optional: post-injection hooks
      "waitFor": {
        "postInjection": [
          {
            "id": "restart-kubernetes-workload",
            "args": { /* restart args */ }
          }
        ]
      }
    }
  ],

  "alerts": [],  // Usually empty, unless specific alert requirements

  "solutions": [
    [
      {
        "steps": [
          {
            "text": "Step description",
            "command": "kubectl -n <namespace> <action> <kind>/<name>"
          }
        ]
      }
    ]
  ]
}

2.2 Understanding Disruptions

Discover disruption patterns from existing scenarios:

# Find scenarios using your chosen fault
FAULT_ID="<your-fault-id>"
jq --arg fault "$FAULT_ID" '.[] | select(.disruptions[].injections[].id == $fault) | {id, disruptions}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json

# Examine a specific scenario's disruptions
SCENARIO_ID="<scenario-number>"
jq --arg id "$SCENARIO_ID" '.[] | select(.id == ($id | tonumber)) | .disruptions' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json

Single Fault Injection (template):

{
  "disruptions": [
    {
      "injections": [
        {
          "id": "<fault-id>",
          "args": {
            "kubernetesObject": {
              "apiVersion": "apps/v1",
              "kind": "<Deployment|StatefulSet>",
              "metadata": {
                "name": "<service-name>",
                "namespace": "<namespace>"
              }
            },
            "container": {
              "name": "<container-name>"
            }
          }
        }
      ]
    }
  ]
}

Multiple Fault Injections (template):

{
  "disruptions": [
    {
      "injections": [
        {
          "id": "<fault-id>",
          "args": {
            "kubernetesObject": {
              "apiVersion": "apps/v1",
              "kind": "Deployment",
              "metadata": {"name": "<service-1>", "namespace": "<namespace>"}
            }
          }
        },
        {
          "id": "<fault-id>",
          "args": {
            "kubernetesObject": {
              "apiVersion": "apps/v1",
              "kind": "Deployment",
              "metadata": {"name": "<service-2>", "namespace": "<namespace>"}
            }
          }
        }
      ]
    }
  ]
}

With waitFor Hooks (find examples dynamically):

# Find scenarios using waitFor patterns
jq '.[] | select(.disruptions[].waitFor != null) | {id, disruptions}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -50

2.3 waitFor Patterns

When to use waitFor:

ConfigMap/Secret changes → Restart affected workloads
Chaos Mesh experiments → Pause/delete schedules
Multi-step setup → Wait for readiness

Common waitFor IDs:

restart-kubernetes-workload - Restart a deployment
wait-kubernetes-workload-ready - Wait for pod readiness

2.4 Solutions Format

Discover solution patterns from the fault definition:

# Get solutions from the fault entry
FAULT_ID="<your-fault-id>"
jq --arg fault "$FAULT_ID" '.[] | select(.id == $fault) | .solutions' \
  scenarios/sre/roles/documentation/files/library/faults/index.json

Adapt fault solutions to scenario context (replace Jinja2 templates with actual values):

Single-step solution (template):

{
  "solutions": [
    [
      {
        "steps": [
          {
            "text": "Revert the last change done to the manifest.",
            "command": "kubectl -n <namespace> rollout undo <kind>/<name>"
          }
        ]
      }
    ],
    [
      {
        "steps": [
          {
            "text": "Manually edit the manifest and fix the issue.",
            "command": "kubectl -n <namespace> edit <kind> <name>"
          }
        ]
      }
    ]
  ]
}

Multi-step solutions (find examples):

# Find scenarios with multi-step solutions
jq '.[] | select(.solutions[][].steps | length > 1) | {id, solutions}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -100

2.5 Determine Required Tools

Based on disruptions, identify tools needed (captured by scaffolding from scenarios/sre/roles/scaffolding/tasks/generate_new_scenario_files.yaml):

Chaos Mesh Detection:

enable_chaos_mesh: |-
  {{
    scaffolding_skeleton_scenario.disruptions |
    map(attribute="injections") |
    ansible.builtin.flatten |
    ansible.builtin.selectattr("id", "==", "scheduled-chaos-mesh-experiment") |
    ansible.builtin.length > 0
  }}

This is automatically determined when scenario files are generated.

Step 3: Generate Ground Truth Files

Scenarios require two ground truth files in different formats:

groundtruth.yaml - v2 API (simplified entity-based format)
groundtruth_v1.yaml - v1 API (DSL format with groups and propagations)

3.0 Create groundtruth.yaml (v2 API)

File: scenarios/sre/roles/scenarios/files/scenario_<ID>/groundtruth.yaml

This is a simplified format that focuses on affected entities and solutions.

Structure:

---
apiVersion: itbench.io/v2
kind: GroundTruth
metadata:
  name: scenario-<ID>
spec:
  # List of expected alerts
  alerts:
    - labels: {}
      name: <alert-name>  # e.g., KubePodNotReady, KubePodCrashLooping

  # List of affected Kubernetes resources (usually root cause)
  entities:
    - apiVersion: <api-version>
      kind: <kind>
      metadata:
        name: <resource-name>
        namespace: <namespace>

  # Solution paths (same as scenario index)
  solutions:
    - - steps:
          - command: <kubectl command>
            text: <step description>

Example (find real examples dynamically):

# View an existing groundtruth.yaml for reference
cat scenarios/sre/roles/scenarios/files/scenario_20/groundtruth.yaml

# Or examine multiple scenarios
ls scenarios/sre/roles/scenarios/files/scenario_*/groundtruth.yaml | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && cat {}'

Template:

---
apiVersion: itbench.io/v2
kind: GroundTruth
metadata:
  name: scenario-<ID>
spec:
  alerts:
    - labels: {}
      name: <alert-name>  # From fault's alerts.application
  entities:
    - apiVersion: apps/v1
      kind: <Deployment|StatefulSet|etc>
      metadata:
        name: <service-name>
        namespace: <namespace>
  solutions:
    - - steps:
          - command: kubectl -n <namespace> rollout undo <kind>/<name>
            text: Revert the last change done to the manifest.
      - steps:
          - command: kubectl -n <namespace> edit <kind> <name>
            text: Manually edit the manifest and fix the issue.

Key Points:

alerts: List primary alerts that will fire (from fault index)
entities: List the root cause resource(s) being modified by the fault
solutions: Copy from scenario index (remove Jinja2 templates, use actual values)

3.1 Create groundtruth_v1.yaml (v1 API - DSL Format)

File: scenarios/sre/roles/scenarios/files/scenario_<ID>/groundtruth_v1.yaml

Ground truth uses DSL format (groups) to define fault propagation chains.

3.2 DSL Structure

---
apiVersion: itbench.io/v1
kind: GroundTruth
metadata:
  name: scenario-<ID>
spec:
  # Predicted alerts
  alerts:
    - group_id: <group-id>
      id: <alert-name>
      metadata:
        description: <alert description>

  # Resource groups
  groups:
    - id: <unique-group-id>
      kind: <Pod|Service|Deployment|ConfigMap|etc>
      namespace: <namespace>
      name: <resource-name>  # OR filter: ["regex-pattern"]
      root_cause: true  # At least one group must be root cause

  # Logical relationships
  aliases:
    - [<group-id-1>, <group-id-2>, <group-id-3>]

  # Fault propagations
  propagations:
    - source: <group-id>
      target: <group-id>
      condition: <what causes propagation>
      effect: <what happens>

  # Optional: fault metadata
  fault:
    - category: Change  # or Create, Delete
      condition: <fault condition>
      entity:
        group_id: <group-id>
        kind: <kind>
        name: <name>
      fault_mechanism: <mechanism>

  # Recommended actions
  recommendedActions:
    - solution:
        actions:
          - <action description>
        id: <solution-id>

3.3 Defining Groups

Groups represent sets of Kubernetes resources.

Required fields:

id: Unique identifier
kind: Kubernetes kind (Pod, Service, Deployment, etc.)
namespace: Kubernetes namespace
One of:
- name: Exact resource name, OR
- filter: List of regex patterns

Optional fields:

root_cause: boolean (at least one group must be true)

Discover group patterns from existing scenarios:

# View groups from a specific scenario
cat scenarios/sre/roles/scenarios/files/scenario_<ID>/groundtruth_v1.yaml | grep -A 10 "^  groups:"

# Find scenarios with ConfigMap root causes
grep -r "kind: ConfigMap" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml

Pod group with filter (template):

groups:
  - id: <service-name>-pod-1
    kind: Pod
    namespace: <namespace>
    filter:
      - <service-name>-.*
    root_cause: true

Service group with filter (template):

groups:
  - id: <service-name>-service-1
    kind: Service
    namespace: <namespace>
    filter:
      - <service-name>\b  # \b = word boundary

ConfigMap group (root cause) (template):

groups:
  - id: <config-name>-cm
    kind: ConfigMap
    namespace: <namespace>
    name: <configmap-name>
    root_cause: true

3.4 Defining Aliases

Aliases link related groups logically.

aliases:
  - [<group-id-1>, <group-id-2>, <group-id-3>]

Discover alias patterns:

# Find scenarios with aliases
grep -A 5 "^  aliases:" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | head -20

Template:

aliases:
  - - <service-name>-pod-1
    - <service-name>-service-1
  - - <dependent-service>-pod-1
    - <dependent-service>-service-1

3.5 Defining Propagations

Propagations describe how faults spread through the system.

IMPORTANT: Use the architecture diagram from the application's documentation to:

Identify which services depend on the affected service
Map the propagation path (root cause → affected services → downstream services)
Understand the call chain and service relationships

Required fields:

source: Group ID where propagation starts
target: Group ID where propagation ends
condition: What causes the propagation (based on architecture)
effect: What observable impact results

Discover propagation patterns:

# Find propagation examples
grep -A 10 "^  propagations:" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | head -50

Template:

propagations:
  - source: <root-cause-group-id>
    target: <affected-service-group-id>
    condition: <what triggers propagation>
    effect: <observable impact>
  - source: <affected-service-group-id>
    target: <dependent-service-group-id>
    condition: <dependency relationship>
    effect: <downstream impact>

3.6 Predicting Alerts

IMPORTANT: Always read alert definitions from actual source files to get up-to-date alert rules.

Use the Architecture Diagram:

Review the application's architecture diagram from documentation
Identify which services are affected by the fault (directly and indirectly)
Map fault symptoms to services in the diagram to predict which alerts will fire

Alert Sources:

Application-Specific Alerts - Read from local PrometheusRules:

OpenTelemetry Demo:

cat scenarios/sre/roles/applications/templates/kubernetes/otel_demo/prometheusrules.j2

BookInfo:

cat scenarios/sre/roles/applications/templates/kubernetes/book_info/prometheusrules.j2

Kubernetes Platform Alerts - Check:

a. Local schema (available alerts in ITBench):
```
jq '.properties.alerts.properties.application.items.enum' \
  scenarios/sre/roles/documentation/files/library/faults/schema.json
```
b. Prometheus Community Rules (canonical source):
- Reference: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack/templates/prometheus/rules-1.14
- Search for alerts matching your fault symptoms (e.g., KubePodCrashLooping, KubePodNotReady)

How to predict alerts:

Read PrometheusRules files for the target application
Match fault symptoms to alert expressions
Identify which alerts will trigger based on fault behavior

Alert format (template):

alerts:
  - group_id: <affected-service-group-id>
    id: <alert-name-from-prometheus-rules>
    metadata:
      description: <alert description matching PrometheusRules>
  - group_id: <affected-service-group-id>
    id: <another-alert-name>
    metadata:
      description: <another alert description>

3.7 Complete Example (groundtruth_v1.yaml)

View real examples dynamically:

# View a complete groundtruth_v1.yaml file
cat scenarios/sre/roles/scenarios/files/scenario_20/groundtruth_v1.yaml

# Compare multiple scenarios for patterns
for scenario in 20 30 40; do
  echo "=== Scenario $scenario ==="
  cat "scenarios/sre/roles/scenarios/files/scenario_${scenario}/groundtruth_v1.yaml" 2>/dev/null || echo "Not found"
  echo ""
done

Complete Template:

---
apiVersion: itbench.io/v1
kind: GroundTruth
metadata:
  name: scenario-<ID>
spec:
  alerts:
    - group_id: <affected-service-group-id>
      id: <alert-name>
      metadata:
        description: <alert description from PrometheusRules>

  groups:
    - id: <root-cause-group-id>
      kind: <Pod|Deployment|ConfigMap>
      namespace: <namespace>
      filter:
        - <name-pattern-regex>
      root_cause: true

    - id: <affected-service-group-id>
      kind: Service
      namespace: <namespace>
      filter:
        - <service-name>\b

    - id: <dependent-service-group-id>
      kind: Service
      namespace: <namespace>
      filter:
        - <dependent-service-name>\b

  aliases:
    - - <root-cause-group-id>
      - <affected-service-group-id>

  propagations:
    - source: <root-cause-group-id>
      target: <affected-service-group-id>
      condition: <fault condition>
      effect: <direct impact>
    - source: <affected-service-group-id>
      target: <dependent-service-group-id>
      condition: <dependency relationship>
      effect: <downstream impact>

  fault:
    - category: <Change|Create|Delete>
      condition: <fault description>
      entity:
        group_id: <root-cause-group-id>
        kind: <kind>
        name: <name>
      fault_mechanism: <mechanism>

  recommendedActions:
    - solution:
        actions:
          - <action description>
        id: <solution-id>
    - solution:
        actions:
          - <alternative action>
        id: <alternative-solution-id>

Validation & Generation

IMPORTANT: You must manually create both groundtruth files before running validation:

Create groundtruth.yaml (v2 API - simplified format)
Create groundtruth_v1.yaml (v1 API - DSL format)

After creating both files manually, validate with:

cd scenarios/sre
make regenerate-scenario-files

This validates and generates:

scenario_<ID>/scenario.yaml - Scenario spec (auto-generated from index)

Both groundtruth files must be created manually - they are not auto-generated.

Common Patterns

Discover patterns dynamically by analyzing existing scenarios:

Identify Application-Specific Patterns

# Get all scenarios for a specific application
APP_ID="opentelemetry-demo"  # Prefer opentelemetry-demo over book-info
jq --arg app "$APP_ID" '.[] | select(.environment.applications[].id == $app) | {id, description}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json

# Analyze groundtruth patterns for that application
for scenario_id in $(jq --arg app "$APP_ID" '.[] | select(.environment.applications[].id == $app) | .id' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json); do
  echo "=== Scenario $scenario_id ==="
  cat "scenarios/sre/roles/scenarios/files/scenario_${scenario_id}/groundtruth_v1.yaml" | grep -E "^  (groups|propagations):" -A 20
done

Common Propagation Patterns

Find typical propagation chains:

# Find Pod → Service propagations
grep -A 4 "source:.*pod" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | grep "target:" | head -10

# Find Service → Service propagations
grep -A 4 "source:.*service" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | grep "target:" | head -10

Generic propagation chain template:

<Root Cause Resource> (root_cause) → <Affected Service> → <Dependent Service> → Alerts

ConfigMap Fault Patterns

Discover ConfigMap scenarios:

# Find scenarios with ConfigMap root causes
grep -r "kind: ConfigMap" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml -l | \
  xargs -I {} sh -c 'echo "=== {} ===" && cat {} | head -50'

Template:

groups:
  - id: <config-name>-cm
    kind: ConfigMap
    namespace: <namespace>
    name: <configmap-name>
    root_cause: true

  - id: <affected-workload>-pod
    kind: Pod
    namespace: <namespace>
    filter: [<workload-name>-.*]

propagations:
  - source: <config-name>-cm
    target: <affected-workload>-pod
    condition: ConfigMap contains <type of issue>
    effect: <workload> pod <impact>

Anti-Patterns

❌ Don't:

Forget to set root_cause: true on at least one group
Use duplicate group IDs
Create propagations with undefined group IDs
Skip alert prediction
Use entities format (use groups instead)
Leave faultId in final scenario (it's a scaffolding hint)
Use obvious fault-revealing names for injected resources (e.g., malformed-config, fault-injector, crash-trigger, chaos-configmap)

✅ Do:

Mark the actual fault injection point as root cause
Create groups for all affected resources
Define propagations showing fault spread
Predict alerts based on PrometheusRules
Use regex filters for pod groups
Test scenario generation with make regenerate-scenario-files
Remove faultId after using it to build disruptions
Use non-obvious names for injected resources (e.g., app-config, recommendation-features, sidecar-processor, cache-helper) - Rationale: Agents should diagnose issues based on symptoms and observability, not by discovering obviously-named fault injection resources

Reference Examples

Discover reference scenarios dynamically by complexity:

# Find simple scenarios (low complexity)
jq '.[] | select(.complexity == "low") | {id, description, faults: [.disruptions[].injections[].id]}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -50

# Find complex scenarios (high complexity)
jq '.[] | select(.complexity == "high") | {id, description, faults: [.disruptions[].injections[].id]}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -50

# Find scenarios using specific fault mechanisms
FAULT_ID="<your-fault-id>"
jq --arg fault "$FAULT_ID" '.[] | select(.disruptions[].injections[].id == $fault) | {id, description}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json

View groundtruth files for reference:

# List all available groundtruth files
ls scenarios/sre/roles/scenarios/files/scenario_*/groundtruth_v1.yaml | sort -V

# View specific scenarios
cat scenarios/sre/roles/scenarios/files/scenario_1/groundtruth_v1.yaml   # Feature flag pattern
cat scenarios/sre/roles/scenarios/files/scenario_20/groundtruth_v1.yaml  # Image pull error
cat scenarios/sre/roles/scenarios/files/scenario_40/groundtruth_v1.yaml  # Code change pattern

Tips

Start with the architecture diagram - Review the application's architecture diagram from documentation to understand service relationships
Identify the fault mechanism - What resource is being modified?
Identify the root cause group - The resource directly affected by the fault
Map dependencies using the diagram - Follow the architecture to see which services call the affected service
Predict observables - Use the diagram to identify which services will be impacted and what alerts will fire
Define propagation chain - Trace the fault spread through the architecture (root cause → dependencies → downstream effects)
Test generation - Always run make regenerate-scenario-files to validate

Next Steps

After scenario scaffolding:

Test the complete scenario in a development cluster
Verify alerts fire as predicted
Validate solutions actually work
Commit changes: git add . && git commit -m "feat: add scenario <ID>"

name	scenario-scaffolding
description	Assists with creating complete ITBench scenarios by applying fault mechanisms to specific services, populating scenario files, and generating groundtruth DSL with fault propagations and alert predictions.

Purpose

This skill guides you through the complete scenario creation workflow:

Fault Application - Apply fault mechanism to a specific service/component
Scenario File Population - Define which application, faults, and tools
Ground Truth Generation - Create DSL with groups, propagations, and alerts

When to Use This Skill

This skill auto-activates when:

Working with files matching **/scenarios/index.json
Editing files in **/scenarios/files/scenario_*/
User mentions "scenario scaffold", "ground truth", "disruptions", "DSL"
Creating or modifying scenario definitions

Prerequisites

Before starting scenario scaffolding:

✅ Fault mechanism is fully implemented (Ansible task + fault index complete)
✅ You know the target application (prefer OpenTelemetry Demo over BookInfo)
✅ You know the specific service/component to target

Workflow

Step 1: Apply Fault to Service/Component

Dynamically identify available services for your chosen application:

1.1 List Available Applications

# Extract all application keys from managers.yaml
grep -E "^  [a-z_]+:" scenarios/sre/roles/applications/defaults/main/managers.yaml | sed 's/://g' | awk '{print $1}'

Application Preference:

Prefer OpenTelemetry Demo (opentelemetry_demo / Astronomy Shop) for most scenarios - it's richer, more comprehensive, and better maintained
Use BookInfo (book_info) only if the fault specifically requires its simpler architecture

1.2 Get Application Metadata

# Replace <app-key> with your chosen application
# Get namespace
grep -A 15 "^  <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml | grep "namespace:" | awk '{print $2}'

# Get documentation URL
grep -A 15 "^  <app-key>:" scenarios/sre/roles/applications/defaults/main/managers.yaml | grep -E "url:|documentation:"

1.3 Discover Services from Manifests

# Get the namespace from step 1.2
NAMESPACE="<namespace>"

# Find all Deployments in the application
grep -r "kind: Deployment" scenarios/sre/roles/applications/templates/kubernetes/ | grep "$NAMESPACE" | grep -oP 'name: \K[a-z0-9-]+'

# Find all Services
grep -r "kind: Service" scenarios/sre/roles/applications/templates/kubernetes/ | grep "$NAMESPACE" | grep -oP 'name: \K[a-z0-9-]+'

# Find all StatefulSets
grep -r "kind: StatefulSet" scenarios/sre/roles/applications/templates/kubernetes/ | grep "$NAMESPACE" | grep -oP 'name: \K[a-z0-9-]+'

1.3.1 Optional: Ground in Real Deployment

Ask the user:

Would you like to deploy the application to a live cluster to get actual deployment names, service names, and resource details? This ensures accuracy but requires a running Kubernetes cluster.

If YES:

Ask for kubeconfig path:

What is the path to your kubeconfig file? (e.g., ~/.kube/config)

Set kubeconfig and deploy (outputs will be shown):

# Set the kubeconfig
export KUBECONFIG=<path-from-user>

# Navigate to scenarios directory
cd scenarios/sre

# Deploy tools - outputs will be displayed
make deploy-tools

# Deploy applications - outputs will be displayed
make deploy-applications

Note: Both commands will display their complete output including:

Ansible playbook task execution
Kubernetes resource creation status
Any warnings or errors

Query live cluster for actual resource names:

# Get actual deployments
kubectl get deployments -n <namespace> -o jsonpath='{.items[*].metadata.name}'

# Get actual services
kubectl get services -n <namespace> -o jsonpath='{.items[*].metadata.name}'

# Get actual pods (with labels)
kubectl get pods -n <namespace> --show-labels

# Get actual configmaps
kubectl get configmaps -n <namespace> -o jsonpath='{.items[*].metadata.name}'

Use these actual names in your scenario instead of guessing from manifests

If NO: Continue with manifest-based discovery from Step 1.3

1.4 Verify from Documentation

Consult the documentation URL from step 1.2 to understand:

Service architecture diagram - Shows how services connect and depend on each other
Component roles and responsibilities
Dependencies between services (who calls whom)

IMPORTANT: Study the architecture diagram carefully - it's essential for:

Understanding service dependencies (used in propagations)
Identifying which downstream services will be affected
Predicting which alerts will fire based on service relationships

Select the service that best demonstrates the fault mechanism.

Step 2: Populate Scenario Files

The scenario consists of multiple files in scenarios/sre/roles/scenarios/files/scenario_<ID>/:

2.1 Scenario Index Entry

File: scenarios/sre/roles/documentation/files/library/scenarios/index.json

Structure (discovered dynamically):

First, get available tags and platforms:

# Get valid tags
jq '.properties.tags.items.enum' scenarios/sre/roles/documentation/files/library/faults/schema.json

# Get valid platforms
jq '.properties.platforms.items.enum' scenarios/sre/roles/documentation/files/library/faults/schema.json

# Get valid categories
jq '.properties.category.enum' scenarios/sre/roles/documentation/files/library/scenarios/schema.json

Then construct the scenario entry:

{
  "id": <next-available-id>,
  "category": "<category-from-schema>",  // e.g., "sre", "finops", "ciso"
  "complexity": "<complexity>",  // "low", "medium", "high"
  "description": "User-facing incident description",
  "environment": {
    "applications": [
      {
        "id": "<application-id-from-step-1>"  // Prefer "opentelemetry-demo" over "book-info"
      }
    ]
  },
  "platforms": ["<platform-from-schema>"],
  "tags": ["<tag1>", "<tag2>"],  // Match fault tags from schema

  // Populated in this step:
  "disruptions": [
    {
      "injections": [
        {
          "id": "<fault-id-from-faults-index>",
          "args": {
            // Arguments matching the fault's JSON schema
            "kubernetesObject": {
              "apiVersion": "apps/v1",
              "kind": "<kind>",
              "metadata": {
                "name": "<service-name-from-step-1>",
                "namespace": "<namespace-from-step-1>"
              }
            }
            // Additional args based on fault schema
          }
        }
      ],
      // Optional: post-injection hooks
      "waitFor": {
        "postInjection": [
          {
            "id": "restart-kubernetes-workload",
            "args": { /* restart args */ }
          }
        ]
      }
    }
  ],

  "alerts": [],  // Usually empty, unless specific alert requirements

  "solutions": [
    [
      {
        "steps": [
          {
            "text": "Step description",
            "command": "kubectl -n <namespace> <action> <kind>/<name>"
          }
        ]
      }
    ]
  ]
}

2.2 Understanding Disruptions

Discover disruption patterns from existing scenarios:

# Find scenarios using your chosen fault
FAULT_ID="<your-fault-id>"
jq --arg fault "$FAULT_ID" '.[] | select(.disruptions[].injections[].id == $fault) | {id, disruptions}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json

# Examine a specific scenario's disruptions
SCENARIO_ID="<scenario-number>"
jq --arg id "$SCENARIO_ID" '.[] | select(.id == ($id | tonumber)) | .disruptions' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json

Single Fault Injection (template):

{
  "disruptions": [
    {
      "injections": [
        {
          "id": "<fault-id>",
          "args": {
            "kubernetesObject": {
              "apiVersion": "apps/v1",
              "kind": "<Deployment|StatefulSet>",
              "metadata": {
                "name": "<service-name>",
                "namespace": "<namespace>"
              }
            },
            "container": {
              "name": "<container-name>"
            }
          }
        }
      ]
    }
  ]
}

Multiple Fault Injections (template):

{
  "disruptions": [
    {
      "injections": [
        {
          "id": "<fault-id>",
          "args": {
            "kubernetesObject": {
              "apiVersion": "apps/v1",
              "kind": "Deployment",
              "metadata": {"name": "<service-1>", "namespace": "<namespace>"}
            }
          }
        },
        {
          "id": "<fault-id>",
          "args": {
            "kubernetesObject": {
              "apiVersion": "apps/v1",
              "kind": "Deployment",
              "metadata": {"name": "<service-2>", "namespace": "<namespace>"}
            }
          }
        }
      ]
    }
  ]
}

With waitFor Hooks (find examples dynamically):

# Find scenarios using waitFor patterns
jq '.[] | select(.disruptions[].waitFor != null) | {id, disruptions}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -50

2.3 waitFor Patterns

When to use waitFor:

ConfigMap/Secret changes → Restart affected workloads
Chaos Mesh experiments → Pause/delete schedules
Multi-step setup → Wait for readiness

Common waitFor IDs:

restart-kubernetes-workload - Restart a deployment
wait-kubernetes-workload-ready - Wait for pod readiness

2.4 Solutions Format

Discover solution patterns from the fault definition:

# Get solutions from the fault entry
FAULT_ID="<your-fault-id>"
jq --arg fault "$FAULT_ID" '.[] | select(.id == $fault) | .solutions' \
  scenarios/sre/roles/documentation/files/library/faults/index.json

Adapt fault solutions to scenario context (replace Jinja2 templates with actual values):

Single-step solution (template):

{
  "solutions": [
    [
      {
        "steps": [
          {
            "text": "Revert the last change done to the manifest.",
            "command": "kubectl -n <namespace> rollout undo <kind>/<name>"
          }
        ]
      }
    ],
    [
      {
        "steps": [
          {
            "text": "Manually edit the manifest and fix the issue.",
            "command": "kubectl -n <namespace> edit <kind> <name>"
          }
        ]
      }
    ]
  ]
}

Multi-step solutions (find examples):

# Find scenarios with multi-step solutions
jq '.[] | select(.solutions[][].steps | length > 1) | {id, solutions}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -100

2.5 Determine Required Tools

Based on disruptions, identify tools needed (captured by scaffolding from scenarios/sre/roles/scaffolding/tasks/generate_new_scenario_files.yaml):

Chaos Mesh Detection:

enable_chaos_mesh: |-
  {{
    scaffolding_skeleton_scenario.disruptions |
    map(attribute="injections") |
    ansible.builtin.flatten |
    ansible.builtin.selectattr("id", "==", "scheduled-chaos-mesh-experiment") |
    ansible.builtin.length > 0
  }}

This is automatically determined when scenario files are generated.

Step 3: Generate Ground Truth Files

Scenarios require two ground truth files in different formats:

groundtruth.yaml - v2 API (simplified entity-based format)
groundtruth_v1.yaml - v1 API (DSL format with groups and propagations)

3.0 Create groundtruth.yaml (v2 API)

File: scenarios/sre/roles/scenarios/files/scenario_<ID>/groundtruth.yaml

This is a simplified format that focuses on affected entities and solutions.

Structure:

---
apiVersion: itbench.io/v2
kind: GroundTruth
metadata:
  name: scenario-<ID>
spec:
  # List of expected alerts
  alerts:
    - labels: {}
      name: <alert-name>  # e.g., KubePodNotReady, KubePodCrashLooping

  # List of affected Kubernetes resources (usually root cause)
  entities:
    - apiVersion: <api-version>
      kind: <kind>
      metadata:
        name: <resource-name>
        namespace: <namespace>

  # Solution paths (same as scenario index)
  solutions:
    - - steps:
          - command: <kubectl command>
            text: <step description>

Example (find real examples dynamically):

# View an existing groundtruth.yaml for reference
cat scenarios/sre/roles/scenarios/files/scenario_20/groundtruth.yaml

# Or examine multiple scenarios
ls scenarios/sre/roles/scenarios/files/scenario_*/groundtruth.yaml | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && cat {}'

Template:

---
apiVersion: itbench.io/v2
kind: GroundTruth
metadata:
  name: scenario-<ID>
spec:
  alerts:
    - labels: {}
      name: <alert-name>  # From fault's alerts.application
  entities:
    - apiVersion: apps/v1
      kind: <Deployment|StatefulSet|etc>
      metadata:
        name: <service-name>
        namespace: <namespace>
  solutions:
    - - steps:
          - command: kubectl -n <namespace> rollout undo <kind>/<name>
            text: Revert the last change done to the manifest.
      - steps:
          - command: kubectl -n <namespace> edit <kind> <name>
            text: Manually edit the manifest and fix the issue.

Key Points:

alerts: List primary alerts that will fire (from fault index)
entities: List the root cause resource(s) being modified by the fault
solutions: Copy from scenario index (remove Jinja2 templates, use actual values)

3.1 Create groundtruth_v1.yaml (v1 API - DSL Format)

File: scenarios/sre/roles/scenarios/files/scenario_<ID>/groundtruth_v1.yaml

Ground truth uses DSL format (groups) to define fault propagation chains.

3.2 DSL Structure

---
apiVersion: itbench.io/v1
kind: GroundTruth
metadata:
  name: scenario-<ID>
spec:
  # Predicted alerts
  alerts:
    - group_id: <group-id>
      id: <alert-name>
      metadata:
        description: <alert description>

  # Resource groups
  groups:
    - id: <unique-group-id>
      kind: <Pod|Service|Deployment|ConfigMap|etc>
      namespace: <namespace>
      name: <resource-name>  # OR filter: ["regex-pattern"]
      root_cause: true  # At least one group must be root cause

  # Logical relationships
  aliases:
    - [<group-id-1>, <group-id-2>, <group-id-3>]

  # Fault propagations
  propagations:
    - source: <group-id>
      target: <group-id>
      condition: <what causes propagation>
      effect: <what happens>

  # Optional: fault metadata
  fault:
    - category: Change  # or Create, Delete
      condition: <fault condition>
      entity:
        group_id: <group-id>
        kind: <kind>
        name: <name>
      fault_mechanism: <mechanism>

  # Recommended actions
  recommendedActions:
    - solution:
        actions:
          - <action description>
        id: <solution-id>

3.3 Defining Groups

Groups represent sets of Kubernetes resources.

Required fields:

id: Unique identifier
kind: Kubernetes kind (Pod, Service, Deployment, etc.)
namespace: Kubernetes namespace
One of:
- name: Exact resource name, OR
- filter: List of regex patterns

Optional fields:

root_cause: boolean (at least one group must be true)

Discover group patterns from existing scenarios:

# View groups from a specific scenario
cat scenarios/sre/roles/scenarios/files/scenario_<ID>/groundtruth_v1.yaml | grep -A 10 "^  groups:"

# Find scenarios with ConfigMap root causes
grep -r "kind: ConfigMap" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml

Pod group with filter (template):

groups:
  - id: <service-name>-pod-1
    kind: Pod
    namespace: <namespace>
    filter:
      - <service-name>-.*
    root_cause: true

Service group with filter (template):

groups:
  - id: <service-name>-service-1
    kind: Service
    namespace: <namespace>
    filter:
      - <service-name>\b  # \b = word boundary

ConfigMap group (root cause) (template):

groups:
  - id: <config-name>-cm
    kind: ConfigMap
    namespace: <namespace>
    name: <configmap-name>
    root_cause: true

3.4 Defining Aliases

Aliases link related groups logically.

aliases:
  - [<group-id-1>, <group-id-2>, <group-id-3>]

Discover alias patterns:

# Find scenarios with aliases
grep -A 5 "^  aliases:" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | head -20

Template:

aliases:
  - - <service-name>-pod-1
    - <service-name>-service-1
  - - <dependent-service>-pod-1
    - <dependent-service>-service-1

3.5 Defining Propagations

Propagations describe how faults spread through the system.

IMPORTANT: Use the architecture diagram from the application's documentation to:

Identify which services depend on the affected service
Map the propagation path (root cause → affected services → downstream services)
Understand the call chain and service relationships

Required fields:

source: Group ID where propagation starts
target: Group ID where propagation ends
condition: What causes the propagation (based on architecture)
effect: What observable impact results

Discover propagation patterns:

# Find propagation examples
grep -A 10 "^  propagations:" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | head -50

Template:

propagations:
  - source: <root-cause-group-id>
    target: <affected-service-group-id>
    condition: <what triggers propagation>
    effect: <observable impact>
  - source: <affected-service-group-id>
    target: <dependent-service-group-id>
    condition: <dependency relationship>
    effect: <downstream impact>

3.6 Predicting Alerts

IMPORTANT: Always read alert definitions from actual source files to get up-to-date alert rules.

Use the Architecture Diagram:

Review the application's architecture diagram from documentation
Identify which services are affected by the fault (directly and indirectly)
Map fault symptoms to services in the diagram to predict which alerts will fire

Alert Sources:

Application-Specific Alerts - Read from local PrometheusRules:

OpenTelemetry Demo:

cat scenarios/sre/roles/applications/templates/kubernetes/otel_demo/prometheusrules.j2

BookInfo:

cat scenarios/sre/roles/applications/templates/kubernetes/book_info/prometheusrules.j2

Kubernetes Platform Alerts - Check:

a. Local schema (available alerts in ITBench):
```
jq '.properties.alerts.properties.application.items.enum' \
  scenarios/sre/roles/documentation/files/library/faults/schema.json
```
b. Prometheus Community Rules (canonical source):
- Reference: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack/templates/prometheus/rules-1.14
- Search for alerts matching your fault symptoms (e.g., KubePodCrashLooping, KubePodNotReady)

How to predict alerts:

Read PrometheusRules files for the target application
Match fault symptoms to alert expressions
Identify which alerts will trigger based on fault behavior

Alert format (template):

alerts:
  - group_id: <affected-service-group-id>
    id: <alert-name-from-prometheus-rules>
    metadata:
      description: <alert description matching PrometheusRules>
  - group_id: <affected-service-group-id>
    id: <another-alert-name>
    metadata:
      description: <another alert description>

3.7 Complete Example (groundtruth_v1.yaml)

View real examples dynamically:

# View a complete groundtruth_v1.yaml file
cat scenarios/sre/roles/scenarios/files/scenario_20/groundtruth_v1.yaml

# Compare multiple scenarios for patterns
for scenario in 20 30 40; do
  echo "=== Scenario $scenario ==="
  cat "scenarios/sre/roles/scenarios/files/scenario_${scenario}/groundtruth_v1.yaml" 2>/dev/null || echo "Not found"
  echo ""
done

Complete Template:

---
apiVersion: itbench.io/v1
kind: GroundTruth
metadata:
  name: scenario-<ID>
spec:
  alerts:
    - group_id: <affected-service-group-id>
      id: <alert-name>
      metadata:
        description: <alert description from PrometheusRules>

  groups:
    - id: <root-cause-group-id>
      kind: <Pod|Deployment|ConfigMap>
      namespace: <namespace>
      filter:
        - <name-pattern-regex>
      root_cause: true

    - id: <affected-service-group-id>
      kind: Service
      namespace: <namespace>
      filter:
        - <service-name>\b

    - id: <dependent-service-group-id>
      kind: Service
      namespace: <namespace>
      filter:
        - <dependent-service-name>\b

  aliases:
    - - <root-cause-group-id>
      - <affected-service-group-id>

  propagations:
    - source: <root-cause-group-id>
      target: <affected-service-group-id>
      condition: <fault condition>
      effect: <direct impact>
    - source: <affected-service-group-id>
      target: <dependent-service-group-id>
      condition: <dependency relationship>
      effect: <downstream impact>

  fault:
    - category: <Change|Create|Delete>
      condition: <fault description>
      entity:
        group_id: <root-cause-group-id>
        kind: <kind>
        name: <name>
      fault_mechanism: <mechanism>

  recommendedActions:
    - solution:
        actions:
          - <action description>
        id: <solution-id>
    - solution:
        actions:
          - <alternative action>
        id: <alternative-solution-id>

Validation & Generation

IMPORTANT: You must manually create both groundtruth files before running validation:

Create groundtruth.yaml (v2 API - simplified format)
Create groundtruth_v1.yaml (v1 API - DSL format)

After creating both files manually, validate with:

cd scenarios/sre
make regenerate-scenario-files

This validates and generates:

scenario_<ID>/scenario.yaml - Scenario spec (auto-generated from index)

Both groundtruth files must be created manually - they are not auto-generated.

Common Patterns

Discover patterns dynamically by analyzing existing scenarios:

Identify Application-Specific Patterns

# Get all scenarios for a specific application
APP_ID="opentelemetry-demo"  # Prefer opentelemetry-demo over book-info
jq --arg app "$APP_ID" '.[] | select(.environment.applications[].id == $app) | {id, description}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json

# Analyze groundtruth patterns for that application
for scenario_id in $(jq --arg app "$APP_ID" '.[] | select(.environment.applications[].id == $app) | .id' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json); do
  echo "=== Scenario $scenario_id ==="
  cat "scenarios/sre/roles/scenarios/files/scenario_${scenario_id}/groundtruth_v1.yaml" | grep -E "^  (groups|propagations):" -A 20
done

Common Propagation Patterns

Find typical propagation chains:

# Find Pod → Service propagations
grep -A 4 "source:.*pod" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | grep "target:" | head -10

# Find Service → Service propagations
grep -A 4 "source:.*service" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml | grep "target:" | head -10

Generic propagation chain template:

<Root Cause Resource> (root_cause) → <Affected Service> → <Dependent Service> → Alerts

ConfigMap Fault Patterns

Discover ConfigMap scenarios:

# Find scenarios with ConfigMap root causes
grep -r "kind: ConfigMap" scenarios/sre/roles/scenarios/files/*/groundtruth_v1.yaml -l | \
  xargs -I {} sh -c 'echo "=== {} ===" && cat {} | head -50'

Template:

groups:
  - id: <config-name>-cm
    kind: ConfigMap
    namespace: <namespace>
    name: <configmap-name>
    root_cause: true

  - id: <affected-workload>-pod
    kind: Pod
    namespace: <namespace>
    filter: [<workload-name>-.*]

propagations:
  - source: <config-name>-cm
    target: <affected-workload>-pod
    condition: ConfigMap contains <type of issue>
    effect: <workload> pod <impact>

Anti-Patterns

❌ Don't:

Forget to set root_cause: true on at least one group
Use duplicate group IDs
Create propagations with undefined group IDs
Skip alert prediction
Use entities format (use groups instead)
Leave faultId in final scenario (it's a scaffolding hint)
Use obvious fault-revealing names for injected resources (e.g., malformed-config, fault-injector, crash-trigger, chaos-configmap)

✅ Do:

Mark the actual fault injection point as root cause
Create groups for all affected resources
Define propagations showing fault spread
Predict alerts based on PrometheusRules
Use regex filters for pod groups
Test scenario generation with make regenerate-scenario-files
Remove faultId after using it to build disruptions
Use non-obvious names for injected resources (e.g., app-config, recommendation-features, sidecar-processor, cache-helper) - Rationale: Agents should diagnose issues based on symptoms and observability, not by discovering obviously-named fault injection resources

Reference Examples

Discover reference scenarios dynamically by complexity:

# Find simple scenarios (low complexity)
jq '.[] | select(.complexity == "low") | {id, description, faults: [.disruptions[].injections[].id]}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -50

# Find complex scenarios (high complexity)
jq '.[] | select(.complexity == "high") | {id, description, faults: [.disruptions[].injections[].id]}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json | head -50

# Find scenarios using specific fault mechanisms
FAULT_ID="<your-fault-id>"
jq --arg fault "$FAULT_ID" '.[] | select(.disruptions[].injections[].id == $fault) | {id, description}' \
  scenarios/sre/roles/documentation/files/library/scenarios/index.json

View groundtruth files for reference:

# List all available groundtruth files
ls scenarios/sre/roles/scenarios/files/scenario_*/groundtruth_v1.yaml | sort -V

# View specific scenarios
cat scenarios/sre/roles/scenarios/files/scenario_1/groundtruth_v1.yaml   # Feature flag pattern
cat scenarios/sre/roles/scenarios/files/scenario_20/groundtruth_v1.yaml  # Image pull error
cat scenarios/sre/roles/scenarios/files/scenario_40/groundtruth_v1.yaml  # Code change pattern

Tips

Start with the architecture diagram - Review the application's architecture diagram from documentation to understand service relationships
Identify the fault mechanism - What resource is being modified?
Identify the root cause group - The resource directly affected by the fault
Map dependencies using the diagram - Follow the architecture to see which services call the affected service
Predict observables - Use the diagram to identify which services will be impacted and what alerts will fire
Define propagation chain - Trace the fault spread through the architecture (root cause → dependencies → downstream effects)
Test generation - Always run make regenerate-scenario-files to validate

Next Steps

After scenario scaffolding:

Test the complete scenario in a development cluster
Verify alerts fire as predicted
Validate solutions actually work
Commit changes: git add . && git commit -m "feat: add scenario <ID>"

scenario-scaffolding

More from this repository

More from this repository

Purpose

When to Use This Skill

Prerequisites

Workflow

Step 1: Apply Fault to Service/Component

1.1 List Available Applications

1.2 Get Application Metadata

1.3 Discover Services from Manifests

1.3.1 Optional: Ground in Real Deployment

1.4 Verify from Documentation

Step 2: Populate Scenario Files

2.1 Scenario Index Entry

2.2 Understanding Disruptions

2.3 waitFor Patterns

2.4 Solutions Format

2.5 Determine Required Tools

Step 3: Generate Ground Truth Files

3.0 Create groundtruth.yaml (v2 API)

3.1 Create groundtruth_v1.yaml (v1 API - DSL Format)

3.2 DSL Structure

3.3 Defining Groups

3.4 Defining Aliases

3.5 Defining Propagations

3.6 Predicting Alerts

3.7 Complete Example (groundtruth_v1.yaml)

Validation & Generation

Common Patterns

Identify Application-Specific Patterns

Common Propagation Patterns

ConfigMap Fault Patterns

Anti-Patterns

Reference Examples

Tips

Next Steps

Purpose

When to Use This Skill

Prerequisites

Workflow

Step 1: Apply Fault to Service/Component

1.1 List Available Applications

1.2 Get Application Metadata

1.3 Discover Services from Manifests

1.3.1 Optional: Ground in Real Deployment

1.4 Verify from Documentation

Step 2: Populate Scenario Files

2.1 Scenario Index Entry

2.2 Understanding Disruptions

2.3 waitFor Patterns

2.4 Solutions Format

2.5 Determine Required Tools

Step 3: Generate Ground Truth Files

3.0 Create groundtruth.yaml (v2 API)

3.1 Create groundtruth_v1.yaml (v1 API - DSL Format)

3.2 DSL Structure

3.3 Defining Groups

3.4 Defining Aliases

3.5 Defining Propagations

3.6 Predicting Alerts

3.7 Complete Example (groundtruth_v1.yaml)

Validation & Generation

Common Patterns

Identify Application-Specific Patterns

Common Propagation Patterns

ConfigMap Fault Patterns

Anti-Patterns

Reference Examples

Tips

Next Steps