Run any Skill in Manus with one click

$pwd:

debug-openshell-cluster

Name: Debug Openshell Cluster
Author: NVIDIA

// Debug why an OpenShell gateway deployment is unhealthy, unreachable, or unable to create sandboxes. Use when the user has a gateway health failure, Docker/Podman runtime issue, Helm install failure, Kubernetes scheduling issue, TLS secret issue, VM driver issue, or sandbox startup problem. Trigger keywords - debug gateway, gateway failing, deployment failing, helm install failing, cluster health, gateway health, gateway not starting, health check failed, sandbox pending, docker driver, podman driver, vm driver.

Run Skill in Manus

$ git log --oneline --stat

stars:6,193

forks:734

updated:May 22, 2026 at 00:58

SKILL.md

readonly

related-skills.json

same repository

test-release-canary.md

from "NVIDIA/OpenShell"

Manually dispatch and iterate on the Release Canary workflow that smoke-tests published OpenShell artifacts (install.sh on macOS/Ubuntu/Fedora, Helm chart on kind) after each Release Dev publish. Use when changing `.github/workflows/release-canary.yml`, validating a release before tagging, debugging a canary failure, or reproducing a canary job locally. Trigger keywords - release canary, release-canary, canary failed, canary dispatch, test release canary, post-release smoke, install.sh canary, helm chart canary, kind canary, dispatch canary.

2026-05-226.2k

helm-dev-environment.md

from "NVIDIA/OpenShell"

Start up, tear down, and configure the local Kubernetes development environment for OpenShell. Uses k3d (Docker-backed k3s) + Skaffold + Helm. Covers cluster lifecycle, optional add-ons (Keycloak OIDC, Envoy Gateway), and port mappings. Trigger keywords - local k8s, local cluster, k3d, skaffold, helm dev, start cluster, stop cluster, tear down cluster, delete cluster, create cluster, helm:k3s, helm:skaffold, local dev environment, dev cluster, k8s dev, envoy gateway local, keycloak local.

2026-05-226.2k

build-from-issue.md

from "NVIDIA/OpenShell"

Given a GitHub issue number, plan and implement the work described in the issue. Operates iteratively - creates an implementation plan, responds to feedback, and only builds when the 'state:agent-ready' label is applied. Includes tests, documentation updates, and PR creation. Trigger keywords - build from issue, implement issue, work on issue, build issue, start issue.

2026-05-206.2k

create-spike.md

from "NVIDIA/OpenShell"

Investigate a plain-language problem description by deeply exploring the codebase, then create a structured GitHub issue with technical findings. Prequel to build-from-issue — maps vague ideas to concrete, buildable issues. Trigger keywords - spike, investigate, explore, research issue, technical investigation, create spike, new spike, feasibility, codebase exploration.

2026-05-206.2k

openshell-cli.md

from "NVIDIA/OpenShell"

Guide agents through using the OpenShell CLI (openshell) for sandbox management, gateway registration, provider configuration, policy iteration, BYOC workflows, and inference routing. Covers basic through advanced multi-step workflows. Trigger keywords - openshell, sandbox create, sandbox connect, logs, provider create, policy set, policy get, image push, forward, port forward, BYOC, bring your own container, use openshell, run openshell, CLI usage, manage sandbox, manage provider, gateway add, gateway select.

2026-05-146.2k

create-github-pr.md

from "NVIDIA/OpenShell"

Create GitHub pull requests using the gh CLI. Use when the user wants to create a new PR, submit code for review, or open a pull request. Trigger keywords - create PR, pull request, new PR, submit for review, code review.

2026-05-076.2k

package.json

"author": "NVIDIA"

"repository": "NVIDIA/OpenShell"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name

debug-openshell-cluster

description

Debug why an OpenShell gateway deployment is unhealthy, unreachable, or unable to create sandboxes. Use when the user has a gateway health failure, Docker/Podman runtime issue, Helm install failure, Kubernetes scheduling issue, TLS secret issue, VM driver issue, or sandbox startup problem. Trigger keywords - debug gateway, gateway failing, deployment failing, helm install failing, cluster health, gateway health, gateway not starting, health check failed, sandbox pending, docker driver, podman driver, vm driver.

Debug OpenShell Gateway Deployment

Diagnose a gateway and its selected compute platform. Do not assume OpenShell provisions Kubernetes or runs a k3s container. OpenShell targets a reachable gateway endpoint backed by Docker, Podman, Kubernetes, or the experimental VM driver.

Use openshell first to identify the active endpoint. Then use the platform tools that match the gateway's compute driver: docker, podman, kubectl/helm, or VM driver logs.

Overview

The target deployment flow is:

Operator starts or deploys the gateway with system packages, systemd, Helm, or a development task. The CLI does not start, stop, or destroy gateway services.
Operator configures the compute driver.
Operator provides TLS and SSH relay material for the deployment mode.
The CLI registers a reachable gateway endpoint with openshell gateway add.
The gateway creates sandboxes through the selected compute driver.

For local evaluation only, TLS may be disabled and the gateway can be reached through http://127.0.0.1:<port>.

Prerequisites

The openshell CLI must be available for endpoint checks.
Know the active gateway name and endpoint, or be able to inspect local gateway metadata.
Know the compute platform: Docker, Podman, Kubernetes, or VM.
For Kubernetes: kubectl must target the cluster that hosts OpenShell and Helm version 3 or later must be available.
For Docker or Podman: the runtime socket must be reachable from the gateway host.

Workflow

Run diagnostics in order and stop once the root cause is clear.

Step 1: Check CLI Reachability

openshell gateway info
openshell status

Common findings:

No active gateway: register one with openshell gateway add <endpoint>.
Connection refused: gateway process is not running, service exposure is wrong, or a port-forward/proxy is not active.
TLS/certificate errors: CLI mTLS bundle does not match the gateway CA, or the gateway is running with unexpected TLS settings.

Step 2: Identify the Compute Platform

Use gateway metadata, deployment values, or the user's setup notes to identify the driver.

Platform	Primary checks
Docker	Gateway process logs, Docker daemon health, sandbox containers, image pulls.
Podman	Podman socket, rootless networking, sandbox containers, image pulls.
Kubernetes	Helm release, StatefulSet, service, secrets, sandbox pods, events.
VM	VM driver logs, rootfs availability, host virtualization support.

Step 3: Check Docker-Backed Gateways

docker info
docker ps --filter name=openshell
docker logs <container> --tail=200
docker run --rm --entrypoint /openshell-sandbox "${OPENSHELL_DOCKER_SUPERVISOR_IMAGE:-ghcr.io/nvidia/openshell/supervisor:latest}" --version
openshell status

For Docker GPU failures, check CDI support and NVIDIA CDI discovery separately:

docker info --format '{{json .CDISpecDirs}}'
docker info --format '{{json .DiscoveredDevices}}'
for dir in /etc/cdi /var/run/cdi; do
  if [ -d "$dir" ]; then
    find "$dir" -maxdepth 1 -type f \( -name '*.yaml' -o -name '*.json' \) -print
  else
    echo "$dir missing"
  fi
done
systemctl is-enabled nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true
systemctl is-active nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true
systemctl status nvidia-cdi-refresh.service nvidia-cdi-refresh.path --no-pager --lines=50
journalctl -u nvidia-cdi-refresh.service --no-pager --lines=100

When the NVIDIA Container Toolkit CDI refresh units are not enabled or no NVIDIA CDI spec has been generated, enable them and trigger a refresh:

sudo systemctl enable --now nvidia-cdi-refresh.path
sudo systemctl enable --now nvidia-cdi-refresh.service
sudo systemctl restart nvidia-cdi-refresh.service
docker info --format '{{json .DiscoveredDevices}}'

Common findings:

Docker daemon unavailable: start Docker Desktop or Docker Engine.
Gateway process stopped: inspect exit status and logs.
Sandbox image missing or pull denied: verify image reference and registry credentials.
Docker driver cannot initialize because it cannot find openshell-sandbox: verify OPENSHELL_DOCKER_SUPERVISOR_BIN, the sibling binary next to openshell-gateway, or the configured supervisor image contains /openshell-sandbox.
Sandbox never registers: check gateway logs and supervisor callback endpoint.
Supervisor image exits before printing openshell-sandbox --version: the image should be the scratch supervisor image from deploy/docker/Dockerfile.supervisor and must contain a static executable at /openshell-sandbox.
mise run e2e:docker:gpu fails with docker info --format json did not report any discovered NVIDIA CDI GPU devices: Docker may report CDISpecDirs while still having no generated NVIDIA CDI specs. Verify .DiscoveredDevices contains entries such as nvidia.com/gpu=all, verify /etc/cdi or /var/run/cdi contains a generated NVIDIA spec, and check that nvidia-cdi-refresh.service and nvidia-cdi-refresh.path from NVIDIA Container Toolkit are enabled and healthy. The service is a one-shot unit, so inactive (dead) can be normal after a successful run; use systemctl status and journalctl to distinguish success from a skipped or failed refresh. NVIDIA recommends enabling the path and service units, and restarting nvidia-cdi-refresh.service to regenerate missing or stale CDI specs. If specs are generated but Docker still reports no discovered devices, restart Docker or reload the daemon and re-check docker info.

For source checkout development, restart the local gateway with:

mise run gateway:docker

Step 4: Check Podman-Backed Gateways

podman info
podman ps --filter name=openshell
podman logs <container> --tail=200
openshell status

Common findings:

Podman socket unavailable: start or expose the user socket.
Rootless networking unavailable: inspect Podman network configuration.
Sandbox image missing or pull denied: verify image reference and registry credentials.
Supervisor cannot call back: check callback endpoint and gateway logs.

Step 5: Check Kubernetes Helm Gateways

helm -n openshell status openshell
helm -n openshell get values openshell
kubectl -n openshell get statefulset,pod,svc,pvc
kubectl -n openshell logs statefulset/openshell --tail=200
kubectl -n openshell rollout status statefulset/openshell

Look for failed installs, unexpected values, missing namespace, wrong image tag, TLS settings that do not match the registered endpoint, and scheduling failures.

Check required Helm deployment secrets:

kubectl -n openshell get secret \
  openshell-server-tls \
  openshell-server-client-ca \
  openshell-client-tls \
  openshell-jwt-keys

If the gateway exits with failed to read sandbox JWT signing key from /etc/openshell-jwt/signing.pem, verify that openshell-jwt-keys contains signing.pem, public.pem, and kid, and that the StatefulSet mounts the sandbox-jwt secret at /etc/openshell-jwt. The sandbox JWT mount is required even when local Helm values disable TLS.

Check the image references currently used by the gateway deployment:

kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'

The gateway image built from deploy/docker/Dockerfile.gateway and the scratch supervisor image built from deploy/docker/Dockerfile.supervisor should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.

For local/external pull mode (the default local path via mise run cluster), local images are tagged to the configured local registry base, pushed to that registry, and pulled by k3s via the registries.yaml mirror endpoint. The cluster task pushes prebuilt local tags (openshell/*:dev, falling back to localhost:5000/openshell/*:dev or 127.0.0.1:5000/openshell/*:dev).

Gateway image builds stage a partial Rust workspace from deploy/docker/Dockerfile.images. If cargo fails with a missing manifest under /build/crates/..., or an imported symbol exists locally but is missing in the image build, verify that every current gateway dependency crate, including openshell-driver-docker, openshell-driver-kubernetes, and openshell-ocsf, is copied into the staged workspace there.

For plaintext local evaluation, confirm the chart has:

helm -n openshell get values openshell | grep -E 'disableTls|grpcEndpoint'

Expected shape:

server:
  disableTls: true
  grpcEndpoint: http://openshell.openshell.svc.cluster.local:8080

Check service exposure:

kubectl -n openshell get svc openshell -o wide
kubectl -n openshell get endpoints openshell

For local port-forward testing:

kubectl -n openshell port-forward svc/openshell 8080:8080
openshell gateway add http://127.0.0.1:8080 --local --name local
openshell status

If the gateway is healthy but sandbox creation fails:

kubectl -n openshell get pods
kubectl -n openshell get events --sort-by=.lastTimestamp | tail -n 50
kubectl -n openshell logs statefulset/openshell --tail=200

Check the configured sandbox namespace:

helm -n openshell get values openshell | grep sandboxNamespace

Then inspect sandbox resources in that namespace.

Check the configured sandbox service account when TokenReview bootstrap or sandbox registration fails. Helm creates a dedicated sandbox service account by default and writes it to [openshell.drivers.kubernetes].service_account_name; the gateway rejects projected tokens from other service accounts.

helm -n openshell get values openshell | grep -A3 sandboxServiceAccount
kubectl -n <sandbox-namespace> get serviceaccount openshell-sandbox
kubectl -n openshell get configmap openshell-config -o jsonpath='{.data.gateway\.toml}'
kubectl -n <sandbox-namespace> get sandbox <sandbox-name> -o jsonpath='{.spec.template.spec.serviceAccountName}{"\n"}'

Step 6: Check VM-Backed Gateways

Use the VM driver logs and host diagnostics available in the user's environment. Verify:

The VM driver process is running and reachable by the gateway.
The runtime rootfs exists and matches the expected architecture.
Host virtualization support is enabled.
The sandbox supervisor can establish its callback connection to the gateway.

Then run:

openshell status
openshell logs <sandbox-name>

Common Failure Patterns

Symptom	Likely cause	Check
`openshell status` fails	Gateway endpoint unreachable or auth mismatch	`openshell gateway info`, gateway logs
Gateway starts but sandbox create fails	Compute driver cannot reach runtime	Docker/Podman/Kubernetes/VM driver logs
Docker or Podman sandbox never registers	Wrong callback endpoint or supervisor startup failure	Gateway logs and sandbox container logs
Docker GPU e2e fails before GPU sandbox comparison	NVIDIA CDI specs are missing or Docker has not discovered them	`docker info --format '{{json .DiscoveredDevices}}'`, `/etc/cdi`, `/var/run/cdi`, `nvidia-cdi-refresh.service`
Kubernetes gateway pod pending	PVC unbound, taint, selector, or insufficient resources	`kubectl -n openshell describe pod <pod>`
Kubernetes gateway pod crash loops	Missing secret, bad DB URL, bad TLS config	`kubectl -n openshell logs statefulset/openshell`
CLI TLS error	Local mTLS bundle does not match server cert/CA	Check `~/.config/openshell/gateways/<name>/mtls/`
Image pull failure	Gateway or sandbox image cannot be pulled	Runtime events and image pull credentials
`K8s namespace not ready` with `envoy-gateway-openshell.yaml: the server could not find the requested resource`	Optional Gateway API manifest was applied without Envoy Gateway CRDs, or k3s Helm controller startup exceeded the namespace wait	Apply `deploy/kube/manifests/envoy-gateway-openshell.yaml` manually only after Envoy Gateway is installed and `grpcRoute` is enabled

Reporting

When handing results back to the user, include:

Active gateway endpoint and auth mode.
Compute platform and driver.
Gateway process or workload status.
Recent gateway log summary.
Missing or malformed TLS or SSH relay material.
Service exposure status.
Sandbox workload status.
The exact command that failed and the shortest fix.

debug-openshell-cluster

More from this repository

More from this repository

Debug OpenShell Gateway Deployment

Overview

Prerequisites

Workflow

Step 1: Check CLI Reachability

Step 2: Identify the Compute Platform

Step 3: Check Docker-Backed Gateways

Step 4: Check Podman-Backed Gateways

Step 5: Check Kubernetes Helm Gateways

Step 6: Check VM-Backed Gateways

Common Failure Patterns

Reporting

Debug OpenShell Gateway Deployment

Overview

Prerequisites

Workflow

Step 1: Check CLI Reachability

Step 2: Identify the Compute Platform

Step 3: Check Docker-Backed Gateways

Step 4: Check Podman-Backed Gateways

Step 5: Check Kubernetes Helm Gateways

Step 6: Check VM-Backed Gateways

Common Failure Patterns

Reporting