with one click
graph-loop
// TDD iteration loop across 4 environments (local Kind, custom HyperShift, CI Kind, CI HyperShift) with test matrix tracking and log analysis
// TDD iteration loop across 4 environments (local Kind, custom HyperShift, CI Kind, CI HyperShift) with test matrix tracking and log analysis
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | graph-loop |
| description | TDD iteration loop across 4 environments (local Kind, custom HyperShift, CI Kind, CI HyperShift) with test matrix tracking and log analysis |
Iterate on OpenShell E2E tests across all 4 environments until the test matrix is green. Track iterations, detect regressions, and report status tables.
This skill is designed for /loop — it MUST be idempotent and always progress forward.
$LOG_DIR/test-matrix-tracking.md
(or create it). Know which iteration you're on and what passed last time.The graph loop has two modes — use the quick debug loop to fix individual failures fast, then switch to the full iteration to verify everything.
For fixing specific failing tests on a LIVE cluster. No full redeploy.
# LiteLLM config change:
kubectl apply -f - <<EOF ... EOF && kubectl rollout restart deploy/litellm-model-proxy -n team1
# Test code change (no redeploy needed — pytest reads from disk):
# just edit and rerun
# Agent manifest change:
kubectl apply -f deployments/openshell/agents/<agent>.yaml -n team1
# Gateway change:
kubectl delete sts openshell-gateway -n openshell-system --wait=false
kubectl apply -k deployments/openshell/
OPENSHELL_LLM_AVAILABLE=true uv run pytest \
kagenti/tests/e2e/openshell/test_12_litellm_claude_sandbox.py \
-v --tb=short -k "test_name_pattern" \
> $LOG_DIR/quick-debug.log 2>&1; echo "EXIT:$?"
OPENSHELL_LLM_AVAILABLE=true uv run pytest \
kagenti/tests/e2e/openshell/test_12_litellm_claude_sandbox.py \
kagenti/tests/e2e/openshell/test_07_skill_execution.py \
-v --tb=short -k "claude or litellm or waypoint" \
> $LOG_DIR/quick-regression.log 2>&1; echo "EXIT:$?"
Runs the complete openshell-full-test.sh end-to-end. Use AFTER quick debug
fixes are committed. Produces the matrix row with all categories.
The flow:
Quick debug (fix A) → Quick debug (fix B) → Commit → Full iteration → Matrix update
↑ |
└──────────── if regression detected ──────────────────────────────────┘
| ID | Environment | How to run | Credentials |
|---|---|---|---|
kind | Local Kind | openshell-full-test.sh --skip-cluster-create --skip-cluster-destroy | .env.maas |
hcp | Custom HyperShift | Same script with --platform ocp, uses KUBECONFIG=~/clusters/hcp/<cluster>/auth/kubeconfig | .env.kagenti-hypershift-custom + .env.maas |
ci-kind | CI Kind | /run-e2e-openshell comment on PR | OPENAI_API_KEY GH secret |
ci-hcp | CI HyperShift | Same trigger, runs e2e-openshell-hypershift.yaml | OPENAI_API_KEY GH secret |
The e2e-openshell-kind.yaml workflow fires on both pull_request and issue_comment.
The pull_request run has NO secrets (fork PRs) so LLM tests skip (~79/0/65).
The issue_comment run (from /run-e2e-openshell) has full secrets (~114/0/30).
Always analyze the issue_comment-triggered run, not the pull_request one:
# Find the CORRECT CI Kind run (issue_comment with secrets, not pull_request without)
CORRECT_RUN=$(gh run list --workflow e2e-openshell-kind.yaml --limit 10 \
--json event,conclusion,databaseId \
-q '[.[] | select(.event=="issue_comment" and .conclusion=="success")][0].databaseId')
gh run view "$CORRECT_RUN" --log 2>&1 | grep -E "PASSED|FAILED|SKIPPED|=====.*passed"
The pull_request run is useful only for infra-only validation (no LLM). For agent
coverage (Claude Code, OpenCode, skills), the issue_comment run is authoritative.
Every agent MUST be tested for the same baseline capabilities. The graph-loop matrix is organized as Capability × Agent × Model, not by test file. If an agent is missing a capability test, that is a gap to fix — not an expected skip.
openshell_claude) — CLI sandboxopenshell_opencode) — CLI sandboxnemoclaw_openclaw) — NemoClaw gatewayEvery agent must have tests for each of these (19 capabilities, 4 tiers):
Tier 1: Infrastructure (no LLM)
| # | Capability | Test pattern | Validates |
|---|---|---|---|
| 1 | Connectivity | test_connectivity__<agent> | Agent responds to basic request |
| 2 | Credential security | test_credential_security__<agent> | No hardcoded secrets |
| 3 | Sandbox lifecycle | test_sandbox_lifecycle__<agent> | Create, list, delete sandbox |
| 4 | Workspace | test_workspace__<agent> | Data persists across pod restarts |
| 5 | Resource limits | test_resource_limits__<agent> | Respects CPU/memory budgets |
Tier 2: Capabilities (requires LLM)
| # | Capability | Test pattern | Validates |
|---|---|---|---|
| 6 | Multiturn | test_multiturn__<agent> | Stateful 3+ turn conversation |
| 7 | Context isolation | test_context_isolation__<agent> | Sessions don't leak |
| 8 | Session resume | test_session_resume__<agent> | Survives pod restart |
| 9 | Cross-session memory | test_cross_session_memory__<agent> | Remembers previous sessions |
| 10 | Streaming | test_streaming__<agent> | Real-time response delivery |
| 11 | Tool calling | test_tool_calling__<agent> | Invokes tools (function calling) |
| 12 | MCP direct | test_mcp_direct__<agent> | Calls MCP server directly |
| 13 | MCP via gateway | test_mcp_gateway__<agent> | Calls MCP through gateway proxy |
| 14 | MCP discovery | test_mcp_discovery__<agent> | Discovers available MCP servers |
| 15 | Concurrent sessions | test_concurrent_sessions__<agent> | Multiple users don't interfere |
Tier 3: Skills (per-model parametrized)
| # | Capability | Test pattern | Validates |
|---|---|---|---|
| 1 | Skill: PR review | test_skill_pr_review__<agent>__<model> | LLM reviews code |
| 2 | Skill: RCA | test_skill_rca__<agent>__<model> | LLM diagnoses failures |
| 3 | Skill: Security | test_skill_security__<agent>__<model> | LLM finds vulnerabilities |
| 4 | Skill: GitHub PR | test_skill_github_pr__<agent>__<model> | Clones and reviews live PR |
Tier 4: Security & Policy
| # | Capability | Test pattern | Validates |
|---|---|---|---|
| 1 | HITL: Network | test_hitl_network__<agent> | Unauthorized egress blocked |
| 2 | HITL: Tool approval | test_hitl_tool_approval__<agent> | Requires permission before tool use |
| 3 | HITL: MCP approval | test_hitl_mcp__<agent> | MCP server requires approval before executing |
| 4 | Audit logging | test_audit_logging__<agent> | Actions produce OTel spans |
MISS = test doesn't exist yet (gap to file as issue). SKIP with clear reason = acceptable temporarily. SKIP without reason = failure.
Design spec: docs/superpowers/specs/2026-05-02-agent-capability-test-matrix-design.md
All skill tests (PR review, RCA, security review, real GitHub PR) MUST be parametrized across the configured LLM models. The matrix shows results per model so we catch model-specific regressions.
Current models:
llama-scout-17b (primary, via LiteMaaS)deepseek-r1 (deepseek-r1-distill-qwen-14b, via LiteMaaS)mistral-small (mistral-small-24b, via LiteMaaS)The test fixture receives the model name and passes it to the LLM call.
LiteLLM proxy routes to the correct backend. Test names look like:
test_pr_review__openshell_claude__llama_scout_17b
test_pr_review__openshell_claude__deepseek_r1
Every iteration MUST end with these 7 tables. Use — for environments not run.
Legend: P=pass, F=fail, S=skip, —=miss, FL=flaky, CI=CI Kind, CH=CI HyperShift, LK=Local Kind, HCP=Custom HCP
Run on each available environment. Use background tasks for independence.
Local Kind (requires running Kind cluster with agents deployed):
export LOG_DIR=/tmp/kagenti/tdd-iter<N> && mkdir -p $LOG_DIR
.github/scripts/local-setup/openshell-full-test.sh \
--skip-cluster-create --skip-cluster-destroy \
> $LOG_DIR/kind-fulltest.log 2>&1; echo "EXIT:$?"
Custom HyperShift (requires ospoc or similar cluster):
cd /path/to/main/repo # NOT worktree — credentials live here
source .env.kagenti-hypershift-custom
export KUBECONFIG=~/clusters/hcp/kagenti-hypershift-custom-ospoc/auth/kubeconfig
/path/to/worktree/.github/scripts/local-setup/openshell-full-test.sh \
--platform ocp --skip-cluster-create --skip-cluster-destroy \
> $LOG_DIR/hcp-fulltest.log 2>&1; echo "EXIT:$?"
CI (push + comment):
git push
gh pr comment <PR> --body "/run-e2e-openshell"
# Wait for completion, then find the CORRECT run:
# IMPORTANT: CI Kind has two triggers. The pull_request run has NO secrets
# (LLM tests skip). Always use the issue_comment run for full coverage.
CI_KIND_RUN=$(gh run list --workflow e2e-openshell-kind.yaml --limit 10 \
--json event,conclusion,databaseId \
-q '[.[] | select(.event=="issue_comment" and .conclusion=="success")][0].databaseId')
gh run view "$CI_KIND_RUN" --log 2>&1 | grep -E "PASSED|FAILED|SKIPPED|=====.*passed" > $LOG_DIR/ci-kind-results.log
CI_HCP_RUN=$(gh run list --workflow e2e-openshell-hypershift.yaml --limit 5 \
--json event,conclusion,databaseId \
-q '[.[] | select(.conclusion=="success")][0].databaseId')
gh run view "$CI_HCP_RUN" --log 2>&1 | grep -E "PASSED|FAILED|SKIPPED|=====.*passed" > $LOG_DIR/ci-hcp-results.log
Use a subagent per log file:
Agent(subagent_type='Explore'):
"Read $LOG_DIR/<env>-fulltest.log. Report:
1. Final pytest summary (passed/failed/skipped)
2. ALL FAILED test names
3. ALL tests matching: claude|opencode|adk|litellm|waypoint|gateway
Use grep -E, do NOT read full file. Under 200 words."
Fill in the iteration row in the tracking file. Mark each category:
Check if we improved:
Fix the root cause of failures. Commit with descriptive message. Do NOT commit if tests regressed from previous iteration.
Maintain at /tmp/kagenti/test-matrix-tracking.md (or docs/plans/ for persistence).
Format:
# OpenShell E2E Test Matrix
## Iteration N — YYYY-MM-DD HH:MM
Commits: `<short-sha> <message>`
| Category | Local Kind | Custom HCP | CI Kind | CI HCP |
|----------|-----------|-----------|---------|--------|
| Waypoint | PASS | PASS | PASS | PASS |
| LiteLLM secure | PASS | SKIP | SKIP | SKIP |
| Anthropic passthrough | PASS | PASS | SKIP | SKIP |
| Claude Code sandbox | PASS | PASS | PASS | PASS |
| Claude Code skills | PASS(3/4) | SKIP | SKIP | SKIP |
| OpenCode sandbox | PASS | SKIP | SKIP | SKIP |
| ADK agent | PASS | PASS | PASS | SKIP |
| Claude SDK agent | PASS | PASS | PASS | PASS |
| Gateway | PASS | PASS | PASS | FAIL(5) |
| Platform | PASS | PASS | PASS | PASS |
| **Total** | **X/Y/Z** | **X/Y/Z** | **X/Y/Z** | **X/Y/Z** |
Changes from previous iteration:
- [+] Claude Code sandbox: SKIP → PASS (shared pod fix)
- [-] Gateway: PASS → FAIL (image pull issue on HCP)
Common causes and fixes:
| Symptom | Cause | Fix |
|---|---|---|
| All LLM tests skip | OPENSHELL_LLM_AVAILABLE not set | Script doesn't detect OPENAI_API_KEY from GH secrets — need .env.maas file fallback |
| Sandbox tests skip | sandbox_crd_installed() returns False | CRD not applied before pytest collection |
| OpenCode/Claude skip | run_*_in_sandbox() returns None | Sandbox pod failed to start — check image pull, namespace, secrets |
| ADK skip on HCP | Port-forward fails | Supervisor netns blocks — use port-bridge sidecar |
| Gateway tests fail on HCP | Gateway pod not running | Image pull auth, SCC, or StatefulSet immutable field |
Run at least 5 iterations before giving up on aligning the matrix. Each iteration runs environments in parallel at their natural speed:
/run-e2e-openshell comment)superpowers:brainstorming or superpowers:systematic-debugging to
rethink the approach before the next iterationThe "graph" in graph-loop means reconstructing the flow of events across components and validating the expected sequence occurred. This goes beyond pass/fail — it verifies the ARCHITECTURE is working correctly.
Claude Code → ANTHROPIC_BASE_URL → LiteLLM /v1/messages
→ hosted_vllm translation → LiteMaaS /v1/chat/completions
→ response back through LiteLLM → Anthropic format → Claude Code
POST /v1/messages received, model resolved, upstream call madekubectl set env deploy/litellm-model-proxy -n team1 LITELLM_LOG=DEBUG
kubectl set env deploy/claude-sdk-agent -n team1 -c agent LOG_LEVEL=DEBUG
[litellm] INFO POST /v1/messages model=claude-sonnet-4-20250514
[litellm] INFO Using chat_completions path (anthropic messages translation)
[litellm] INFO Upstream call to hosted_vllm/llama-scout-17b
[litellm] INFO Response 200 tokens=N
If any line is missing → the test should flag it even if the response was correct.
Every iteration must also analyze component logs for errors and warnings. Target: 0 errors, minimum warnings.
for COMP in openshell-gateway litellm-model-proxy; do
kubectl logs deploy/$COMP -n ${NS:-team1} --tail=500 > $LOG_DIR/${COMP}.log 2>&1 || true
done
for COMP in claude-sdk-agent adk-agent-supervised weather-agent-supervised; do
kubectl logs deploy/$COMP -n team1 -c agent --tail=200 > $LOG_DIR/${COMP}.log 2>&1 || true
done
kubectl logs -n istio-system -l app=ztunnel --tail=100 > $LOG_DIR/ztunnel.log 2>&1 || true
kubectl logs deploy/waypoint -n team1 --tail=100 > $LOG_DIR/waypoint.log 2>&1 || true
Agent(subagent_type='Explore'):
"Grep $LOG_DIR/*.log for ERROR|WARN|error|warn|panic|fatal.
Categorize by component and severity.
Exclude known noise: 'deprecated', 'liveness probe'.
Report: component, count of errors, count of warnings, sample messages.
Under 200 words."
| Component | Errors | Warnings | Notes |
|---|---|---|---|
| openshell-gateway | 0 | 2 | deprecation warnings (known) |
| litellm-model-proxy | 0 | 0 | clean |
| claude-sdk-agent | 0 | 1 | reconnect warning |
| ztunnel | 0 | 0 | clean |
| waypoint | 0 | 0 | clean |
Verify agents emit structured JSON logs with OTel fields:
trace_id, span_id in log entries (when tracing enabled)level, msg, component fieldsAll 4 environments show:
After 5 iterations or when progress stalls, present: final matrix, resolved items, remaining blockers with options, and batched questions for user to unblock the next cycle. User answers all at once → clear direction.