| name | debug-buttercup |
| description | All pods run in namespace crs. Use when pods in the crs namespace are in CrashLoopBackOff, OOMKilled, or restarting, multiple services restart simultaneously (cascade failure), or redis is unresponsive or showing AOF warnings. |
| risk | unknown |
| source | community |
Debug Buttercup
When to Use
- Pods in the
crs namespace are in CrashLoopBackOff, OOMKilled, or restarting
- Multiple services restart simultaneously (cascade failure)
- Redis is unresponsive or showing AOF warnings
- Queues are growing but tasks are not progressing
- Nodes show DiskPressure, MemoryPressure, or PID pressure
- Build-bot cannot reach the Docker daemon (DinD failures)
- Scheduler is stuck and not advancing task state
- Health check probes are failing unexpectedly
- Deployed Helm values don't match actual pod configuration
When NOT to Use
- Deploying or upgrading Buttercup (use Helm and deployment guides)
- Debugging issues outside the
crs Kubernetes namespace
- Performance tuning that doesn't involve a failure symptom
Namespace and Services
All pods run in namespace crs. Key services:
| Layer | Services |
|---|
| Infra | redis, dind, litellm, registry-cache |
| Orchestration | scheduler, task-server, task-downloader, scratch-cleaner |
| Fuzzing | build-bot, fuzzer-bot, coverage-bot, tracer-bot, merger-bot |
| Analysis | patcher, seed-gen, program-model, pov-reproducer |
| Interface | competition-api, ui |
Triage Workflow
Always start with triage. Run these three commands first:
kubectl get pods -n crs -o wide
kubectl get events -n crs --sort-by='.lastTimestamp'
kubectl get events -n crs --field-selector type=Warning --sort-by='.lastTimestamp'
Then narrow down:
kubectl describe pod -n crs <pod-name> | grep -A8 'Last State:'
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
kubectl logs -n crs <pod-name> --previous --tail=200
kubectl logs -n crs <pod-name> --tail=200
Historical vs Ongoing Issues
High restart counts don't necessarily mean an issue is ongoing -- restarts accumulate over a pod's lifetime. Always distinguish:
--tail shows the end of the log buffer, which may contain old messages. Use --since=300s to confirm issues are actively happening now.
--timestamps on log output helps correlate events across services.
- Check
Last State timestamps in describe pod to see when the most recent crash actually occurred.
Cascade Detection
When many pods restart around the same time, check for a shared-dependency failure before investigating individual pods. The most common cascade: Redis goes down -> every service gets ConnectionError/ConnectionRefusedError -> mass restarts. Look for the same error across multiple --previous logs -- if they all say redis.exceptions.ConnectionError, debug Redis, not the individual services.
Log Analysis
kubectl logs -n crs -l app=fuzzer-bot --tail=100 --prefix
kubectl logs -n crs -l app.kubernetes.io/name=redis -f
bash deployment/collect-logs.sh
Resource Pressure
kubectl top pods -n crs
kubectl top nodes
kubectl describe node <node> | grep -A5 Conditions
kubectl exec -n crs <pod> -- df -h
kubectl exec -n crs <pod> -- sh -c 'du -sh /corpus/* 2>/dev/null'
kubectl exec -n crs <pod> -- sh -c 'du -sh /scratch/* 2>/dev/null'
Redis Debugging
Redis is the backbone. When it goes down, everything cascades.
kubectl get pods -n crs -l app.kubernetes.io/name=redis
kubectl logs -n crs -l app.kubernetes.io/name=redis --tail=200
kubectl exec -n crs <redis-pod> -- redis-cli
INFO memory
INFO persistence
INFO clients
INFO stats
CLIENT LIST
DBSIZE
CONFIG GET appendonly
CONFIG GET appendfsync
kubectl exec -n crs <redis-pod> -- mount | grep /data
kubectl exec -n crs <redis-pod> -- du -sh /data/
Queue Inspection
Buttercup uses Redis streams with consumer groups. Queue names:
| Queue | Stream Key |
|---|
| Build | fuzzer_build_queue |
| Build Output | fuzzer_build_output_queue |
| Crash | fuzzer_crash_queue |
| Confirmed Vulns | confirmed_vulnerabilities_queue |
| Download Tasks | orchestrator_download_tasks_queue |
| Ready Tasks | tasks_ready_queue |
| Patches | patches_queue |
| Index | index_queue |
| Index Output | index_output_queue |
| Traced Vulns | traced_vulnerabilities_queue |
| POV Requests | pov_reproducer_requests_queue |
| POV Responses | pov_reproducer_responses_queue |
| Delete Task | orchestrator_delete_task_queue |
kubectl exec -n crs <redis-pod> -- redis-cli XLEN fuzzer_build_queue
kubectl exec -n crs <redis-pod> -- redis-cli XINFO GROUPS fuzzer_build_queue
kubectl exec -n crs <redis-pod> -- redis-cli XPENDING fuzzer_build_queue build_bot_consumers - + 10
kubectl exec -n crs <redis-pod> -- redis-cli HLEN tasks_registry
kubectl exec -n crs <redis-pod> -- redis-cli SCARD cancelled_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD succeeded_tasks
kubectl exec -n crs <redis-pod> -- redis-cli SCARD errored_tasks
Consumer groups: build_bot_consumers, orchestrator_group, patcher_group, index_group, tracer_bot_group.
Health Checks
Pods write timestamps to /tmp/health_check_alive. The liveness probe checks file freshness.
kubectl exec -n crs <pod> -- stat /tmp/health_check_alive
kubectl exec -n crs <pod> -- cat /tmp/health_check_alive
If a pod is restart-looping, the health check file is likely going stale because the main process is blocked (e.g. waiting on Redis, stuck on I/O).
Telemetry (OpenTelemetry / Signoz)
All services export traces and metrics via OpenTelemetry. If Signoz is deployed (global.signoz.deployed: true), use its UI for distributed tracing across services.
kubectl exec -n crs <pod> -- env | grep OTEL
kubectl get pods -n platform -l app.kubernetes.io/name=signoz
Traces are especially useful for diagnosing slow task processing, identifying which service in a pipeline is the bottleneck, and correlating events across the scheduler -> build-bot -> fuzzer-bot chain.
Volume and Storage
kubectl get pvc -n crs
kubectl exec -n crs <pod> -- mount | grep corpus_tmpfs
kubectl exec -n crs <pod> -- df -h /corpus_tmpfs 2>/dev/null
kubectl exec -n crs <pod> -- env | grep CORPUS
kubectl exec -n crs <pod> -- df -h
CORPUS_TMPFS_PATH is set when global.volumes.corpusTmpfs.enabled: true. This affects fuzzer-bot, coverage-bot, seed-gen, and merger-bot.
Deployment Config Verification
When behavior doesn't match expectations, verify Helm values actually took effect:
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.containers[0].resources}'
kubectl get pod -n crs <pod-name> -o jsonpath='{.spec.volumes}'
Helm values template typos (e.g. wrong key names) silently fall back to chart defaults. If deployed resources don't match the values template, check for key name mismatches.
Service-Specific Debugging
For detailed per-service symptoms, root causes, and fixes, see references/failure-patterns.md.
Quick reference:
- DinD:
kubectl logs -n crs -l app=dind --tail=100 -- look for docker daemon crashes, storage driver errors
- Build-bot: check build queue depth, DinD connectivity, OOM during compilation
- Fuzzer-bot: corpus disk usage, CPU throttling, crash queue backlog
- Patcher: LiteLLM connectivity, LLM timeout, patch queue depth
- Scheduler: the central brain --
kubectl logs -n crs -l app=scheduler --tail=-1 --prefix | grep "WAIT_PATCH_PASS\|ERROR\|SUBMIT"
Diagnostic Script
Run the automated triage snapshot:
bash {baseDir}/scripts/diagnose.sh
Pass --full to also dump recent logs from all pods:
bash {baseDir}/scripts/diagnose.sh --full
This collects pod status, events, resource usage, Redis health, and queue depths in one pass.
Limitations
- Use this skill only when the task clearly matches the scope described above.
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.