mit einem Klick
graph-loop
// TDD iteration loop across 4 environments (local Kind, custom HyperShift, CI Kind, CI HyperShift) with test matrix tracking and log analysis
// TDD iteration loop across 4 environments (local Kind, custom HyperShift, CI Kind, CI HyperShift) with test matrix tracking and log analysis
Full release lifecycle for Kagenti — alpha, RC iteration loop, GA, and patch releases with multi-repo coordination
Create and manage git worktrees for parallel development and testing
Deploy and manage Kagenti operator, agents, and tools on Kubernetes. Handles installer, CRDs, pipelines, and demo deployments.
Deploy the weather agent and MCP tool demo via CLI (no UI required). Uses pre-built ghcr.io images, deploys to team1. Optimized for speed (~15s).
Manage Kind clusters for local Kagenti testing. Create, destroy, deploy platform, and run E2E tests.
Run full end-to-end test workflows for Kagenti. Supports Kind and HyperShift clusters with automated setup and testing.
| name | graph-loop |
| description | TDD iteration loop across 4 environments (local Kind, custom HyperShift, CI Kind, CI HyperShift) with test matrix tracking and log analysis |
Iterate on OpenShell E2E tests across all 4 environments until the test matrix is green. Track iterations, detect regressions, and report status tables.
This skill is designed for /loop — it MUST be idempotent and always progress forward.
$LOG_DIR/test-matrix-tracking.md
(or create it). Know which iteration you're on and what passed last time.The graph loop has two modes — use the quick debug loop to fix individual failures fast, then switch to the full iteration to verify everything.
For fixing specific failing tests on a LIVE cluster. No full redeploy.
# LiteLLM config change:
kubectl apply -f - <<EOF ... EOF && kubectl rollout restart deploy/litellm-model-proxy -n team1
# Test code change (no redeploy needed — pytest reads from disk):
# just edit and rerun
# Agent manifest change:
kubectl apply -f deployments/openshell/agents/<agent>.yaml -n team1
# Gateway change:
kubectl delete sts openshell-gateway -n openshell-system --wait=false
kubectl apply -k deployments/openshell/
OPENSHELL_LLM_AVAILABLE=true uv run pytest \
kagenti/tests/e2e/openshell/test_12_litellm_claude_sandbox.py \
-v --tb=short -k "test_name_pattern" \
> $LOG_DIR/quick-debug.log 2>&1; echo "EXIT:$?"
OPENSHELL_LLM_AVAILABLE=true uv run pytest \
kagenti/tests/e2e/openshell/test_12_litellm_claude_sandbox.py \
kagenti/tests/e2e/openshell/test_07_skill_execution.py \
-v --tb=short -k "claude or litellm or waypoint" \
> $LOG_DIR/quick-regression.log 2>&1; echo "EXIT:$?"
Runs the complete openshell-full-test.sh end-to-end. Use AFTER quick debug
fixes are committed. Produces the matrix row with all categories.
The flow:
Quick debug (fix A) → Quick debug (fix B) → Commit → Full iteration → Matrix update
↑ |
└──────────── if regression detected ──────────────────────────────────┘
| ID | Environment | How to run | Credentials |
|---|---|---|---|
kind | Local Kind | openshell-full-test.sh --skip-cluster-create --skip-cluster-destroy | .env.maas |
hcp | Custom HyperShift | Same script with --platform ocp, uses KUBECONFIG=~/clusters/hcp/<cluster>/auth/kubeconfig | .env.kagenti-hypershift-custom + .env.maas |
ci-kind | CI Kind | /run-e2e-openshell comment on PR | OPENAI_API_KEY GH secret |
ci-hcp | CI HyperShift | Same trigger, runs e2e-openshell-hypershift.yaml | OPENAI_API_KEY GH secret |
The e2e-openshell-kind.yaml workflow fires on both pull_request and issue_comment.
The pull_request run has NO secrets (fork PRs) so LLM tests skip (~79/0/65).
The issue_comment run (from /run-e2e-openshell) has full secrets (~114/0/30).
Always analyze the issue_comment-triggered run, not the pull_request one:
# Find the CORRECT CI Kind run (issue_comment with secrets, not pull_request without)
CORRECT_RUN=$(gh run list --workflow e2e-openshell-kind.yaml --limit 10 \
--json event,conclusion,databaseId \
-q '[.[] | select(.event=="issue_comment" and .conclusion=="success")][0].databaseId')
gh run view "$CORRECT_RUN" --log 2>&1 | grep -E "PASSED|FAILED|SKIPPED|=====.*passed"
The pull_request run is useful only for infra-only validation (no LLM). For agent
coverage (Claude Code, OpenCode, skills), the issue_comment run is authoritative.
Every agent MUST be tested for the same baseline capabilities. The graph-loop matrix is organized as Capability × Agent × Model, not by test file. If an agent is missing a capability test, that is a gap to fix — not an expected skip.
openshell_claude) — CLI sandboxopenshell_opencode) — CLI sandboxnemoclaw_openclaw) — NemoClaw gatewayEvery agent must have tests for each of these (19 capabilities, 4 tiers):
Tier 1: Infrastructure (no LLM)
| # | Capability | Test pattern | Validates |
|---|---|---|---|
| 1 | Connectivity | test_connectivity__<agent> | Agent responds to basic request |
| 2 | Credential security | test_credential_security__<agent> | No hardcoded secrets |
| 3 | Sandbox lifecycle | test_sandbox_lifecycle__<agent> | Create, list, delete sandbox |
| 4 | Workspace | test_workspace__<agent> | Data persists across pod restarts |
| 5 | Resource limits | test_resource_limits__<agent> | Respects CPU/memory budgets |
Tier 2: Capabilities (requires LLM)
| # | Capability | Test pattern | Validates |
|---|---|---|---|
| 6 | Multiturn | test_multiturn__<agent> | Stateful 3+ turn conversation |
| 7 | Context isolation | test_context_isolation__<agent> | Sessions don't leak |
| 8 | Session resume | test_session_resume__<agent> | Survives pod restart |
| 9 | Cross-session memory | test_cross_session_memory__<agent> | Remembers previous sessions |
| 10 | Streaming | test_streaming__<agent> | Real-time response delivery |
| 11 | Tool calling | test_tool_calling__<agent> | Invokes tools (function calling) |
| 12 | MCP direct | test_mcp_direct__<agent> | Calls MCP server directly |
| 13 | MCP via gateway | test_mcp_gateway__<agent> | Calls MCP through gateway proxy |
| 14 | MCP discovery | test_mcp_discovery__<agent> | Discovers available MCP servers |
| 15 | Concurrent sessions | test_concurrent_sessions__<agent> | Multiple users don't interfere |
Tier 3: Skills (per-model parametrized)
| # | Capability | Test pattern | Validates |
|---|---|---|---|
| 1 | Skill: PR review | test_skill_pr_review__<agent>__<model> | LLM reviews code |
| 2 | Skill: RCA | test_skill_rca__<agent>__<model> | LLM diagnoses failures |
| 3 | Skill: Security | test_skill_security__<agent>__<model> | LLM finds vulnerabilities |
| 4 | Skill: GitHub PR | test_skill_github_pr__<agent>__<model> | Clones and reviews live PR |
Tier 4: Security & Policy
| # | Capability | Test pattern | Validates |
|---|---|---|---|
| 1 | HITL: Network | test_hitl_network__<agent> | Unauthorized egress blocked |
| 2 | HITL: Tool approval | test_hitl_tool_approval__<agent> | Requires permission before tool use |
| 3 | HITL: MCP approval | test_hitl_mcp__<agent> | MCP server requires approval before executing |
| 4 | Audit logging | test_audit_logging__<agent> | Actions produce OTel spans |
Tier 7: Teleport (requires LLM + Sandbox CRD)
| # | Capability | Test pattern | Validates |
|---|---|---|---|
| 1 | Teleport: Package | test_teleport__package | Context bundled into ConfigMap |
| 2 | Teleport: Deploy | test_teleport__deploy | Sandbox created with context |
| 3 | Teleport: Context | test_teleport__context_unpacked | CLAUDE.md unpacked in pod |
| 4 | Teleport: Prompt | test_teleport__prompt_with_context | Claude uses teleported context |
| 5 | Teleport: Cleanup | test_teleport__cleanup | Resources cleaned up |
MISS = test doesn't exist yet (gap to file as issue). SKIP with clear reason = acceptable temporarily. SKIP without reason = failure.
Design spec: docs/superpowers/specs/2026-05-02-agent-capability-test-matrix-design.md
All skill tests (PR review, RCA, security review, real GitHub PR) MUST be parametrized across the configured LLM models. The matrix shows results per model so we catch model-specific regressions.
Current models:
llama-scout-17b (primary, via LiteMaaS)deepseek-r1 (deepseek-r1-distill-qwen-14b, via LiteMaaS)mistral-small (mistral-small-24b, via LiteMaaS)The test fixture receives the model name and passes it to the LLM call.
LiteLLM proxy routes to the correct backend. Test names look like:
test_pr_review__openshell_claude__llama_scout_17b
test_pr_review__openshell_claude__deepseek_r1
Every iteration MUST end with these 7 tables. Use — for environments not run.
Legend: P=pass, F=fail, S=skip, —=miss, FL=flaky, CI=CI Kind, CH=CI HyperShift, LK=Local Kind, HCP=Custom HCP
Run on each available environment. Use background tasks for independence.
Local Kind (requires running Kind cluster with agents deployed):
export LOG_DIR=/tmp/kagenti/tdd-iter<N> && mkdir -p $LOG_DIR
.github/scripts/local-setup/openshell-full-test.sh \
--skip-cluster-create --skip-cluster-destroy \
> $LOG_DIR/kind-fulltest.log 2>&1; echo "EXIT:$?"
Custom HyperShift (requires ospoc or similar cluster):
cd /path/to/main/repo # NOT worktree — credentials live here
source .env.kagenti-hypershift-custom
export KUBECONFIG=~/clusters/hcp/kagenti-hypershift-custom-ospoc/auth/kubeconfig
/path/to/worktree/.github/scripts/local-setup/openshell-full-test.sh \
--platform ocp --skip-cluster-create --skip-cluster-destroy \
> $LOG_DIR/hcp-fulltest.log 2>&1; echo "EXIT:$?"
CI (push + comment):
git push
gh pr comment <PR> --body "/run-e2e-openshell"
# Wait for completion, then find the CORRECT run:
# IMPORTANT: CI Kind has two triggers. The pull_request run has NO secrets
# (LLM tests skip). Always use the issue_comment run for full coverage.
CI_KIND_RUN=$(gh run list --workflow e2e-openshell-kind.yaml --limit 10 \
--json event,conclusion,databaseId \
-q '[.[] | select(.event=="issue_comment" and .conclusion=="success")][0].databaseId')
gh run view "$CI_KIND_RUN" --log 2>&1 | grep -E "PASSED|FAILED|SKIPPED|=====.*passed" > $LOG_DIR/ci-kind-results.log
CI_HCP_RUN=$(gh run list --workflow e2e-openshell-hypershift.yaml --limit 5 \
--json event,conclusion,databaseId \
-q '[.[] | select(.conclusion=="success")][0].databaseId')
gh run view "$CI_HCP_RUN" --log 2>&1 | grep -E "PASSED|FAILED|SKIPPED|=====.*passed" > $LOG_DIR/ci-hcp-results.log
Use a subagent per log file:
Agent(subagent_type='Explore'):
"Read $LOG_DIR/<env>-fulltest.log. Report:
1. Final pytest summary (passed/failed/skipped)
2. ALL FAILED test names
3. ALL tests matching: claude|opencode|adk|litellm|waypoint|gateway
Use grep -E, do NOT read full file. Under 200 words."
Fill in the iteration row in the tracking file. Mark each category:
Check if we improved:
Fix the root cause of failures. Commit with descriptive message. Do NOT commit if tests regressed from previous iteration.
Maintain at /tmp/kagenti/test-matrix-tracking.md (or docs/plans/ for persistence).
Format:
# OpenShell E2E Test Matrix
## Iteration N — YYYY-MM-DD HH:MM
Commits: `<short-sha> <message>`
| Category | Local Kind | Custom HCP | CI Kind | CI HCP |
|----------|-----------|-----------|---------|--------|
| Waypoint | PASS | PASS | PASS | PASS |
| LiteLLM secure | PASS | SKIP | SKIP | SKIP |
| Anthropic passthrough | PASS | PASS | SKIP | SKIP |
| Claude Code sandbox | PASS | PASS | PASS | PASS |
| Claude Code skills | PASS(3/4) | SKIP | SKIP | SKIP |
| OpenCode sandbox | PASS | SKIP | SKIP | SKIP |
| ADK agent | PASS | PASS | PASS | SKIP |
| Claude SDK agent | PASS | PASS | PASS | PASS |
| Gateway | PASS | PASS | PASS | FAIL(5) |
| Platform | PASS | PASS | PASS | PASS |
| **Total** | **X/Y/Z** | **X/Y/Z** | **X/Y/Z** | **X/Y/Z** |
Changes from previous iteration:
- [+] Claude Code sandbox: SKIP → PASS (shared pod fix)
- [-] Gateway: PASS → FAIL (image pull issue on HCP)
Common causes and fixes:
| Symptom | Cause | Fix |
|---|---|---|
| All LLM tests skip | OPENSHELL_LLM_AVAILABLE not set | Script doesn't detect OPENAI_API_KEY from GH secrets — need .env.maas file fallback |
| Sandbox tests skip | sandbox_crd_installed() returns False | CRD not applied before pytest collection |
| OpenCode/Claude skip | run_*_in_sandbox() returns None | Sandbox pod failed to start — check image pull, namespace, secrets |
| ADK skip on HCP | Port-forward fails | Supervisor netns blocks — use port-bridge sidecar |
| Gateway tests fail on HCP | Gateway pod not running | Image pull auth, SCC, or StatefulSet immutable field |
Run at least 5 iterations before giving up on aligning the matrix. Each iteration runs environments in parallel at their natural speed:
/run-e2e-openshell comment)superpowers:brainstorming or superpowers:systematic-debugging to
rethink the approach before the next iterationThe "graph" in graph-loop means reconstructing the flow of events across components and validating the expected sequence occurred. This goes beyond pass/fail — it verifies the ARCHITECTURE is working correctly.
Claude Code → ANTHROPIC_BASE_URL → LiteLLM /v1/messages
→ hosted_vllm translation → LiteMaaS /v1/chat/completions
→ response back through LiteLLM → Anthropic format → Claude Code
POST /v1/messages received, model resolved, upstream call madekubectl set env deploy/litellm-model-proxy -n team1 LITELLM_LOG=DEBUG
kubectl set env deploy/claude-sdk-agent -n team1 -c agent LOG_LEVEL=DEBUG
[litellm] INFO POST /v1/messages model=claude-sonnet-4-20250514
[litellm] INFO Using chat_completions path (anthropic messages translation)
[litellm] INFO Upstream call to hosted_vllm/llama-scout-17b
[litellm] INFO Response 200 tokens=N
If any line is missing → the test should flag it even if the response was correct.
Every iteration must also analyze component logs for errors and warnings. Target: 0 errors, minimum warnings.
for COMP in openshell-gateway litellm-model-proxy; do
kubectl logs deploy/$COMP -n ${NS:-team1} --tail=500 > $LOG_DIR/${COMP}.log 2>&1 || true
done
for COMP in claude-sdk-agent adk-agent-supervised weather-agent-supervised; do
kubectl logs deploy/$COMP -n team1 -c agent --tail=200 > $LOG_DIR/${COMP}.log 2>&1 || true
done
kubectl logs -n istio-system -l app=ztunnel --tail=100 > $LOG_DIR/ztunnel.log 2>&1 || true
kubectl logs deploy/waypoint -n team1 --tail=100 > $LOG_DIR/waypoint.log 2>&1 || true
Agent(subagent_type='Explore'):
"Grep $LOG_DIR/*.log for ERROR|WARN|error|warn|panic|fatal.
Categorize by component and severity.
Exclude known noise: 'deprecated', 'liveness probe'.
Report: component, count of errors, count of warnings, sample messages.
Under 200 words."
| Component | Errors | Warnings | Notes |
|---|---|---|---|
| openshell-gateway | 0 | 2 | deprecation warnings (known) |
| litellm-model-proxy | 0 | 0 | clean |
| claude-sdk-agent | 0 | 1 | reconnect warning |
| ztunnel | 0 | 0 | clean |
| waypoint | 0 | 0 | clean |
Verify agents emit structured JSON logs with OTel fields:
trace_id, span_id in log entries (when tracing enabled)level, msg, component fieldsAll 4 environments show:
After 5 iterations or when progress stalls, present: final matrix, resolved items, remaining blockers with options, and batched questions for user to unblock the next cycle. User answers all at once → clear direction.