name	Debugging Omnistrate Deployments
description	Systematically debug failed Omnistrate instance deployments using a progressive workflow that identifies root causes efficiently while avoiding token limits. Applies to deployment failures, probe issues, and helm-based resources.

Debugging Omnistrate Deployments

When to Use This Skill

Instance deployments showing FAILED or DEPLOYING status
Resources with unhealthy pod statuses or deployment errors
Startup/readiness probe failures (HTTP 503, timeouts)
Helm releases with unclear deployment states
Need to identify root cause of deployment failures

Progressive Debugging Workflow

1. Get Deployment Status

Tool: mcp__omnistrate-platform__omnistrate-ctl_instance_describe Flags: --deployment-status --output json

Extract:

Overall instance status
Resources with deployment errors or unhealthy pod statuses
Focus subsequent analysis on problematic resources only

Key Benefit: Returns concise status, significantly reduces token usage vs full describe

2. Identify Workflows

Tool: mcp__omnistrate-platform__omnistrate-ctl_workflow_list Flags: --instance-id <id> --output json

Extract workflow IDs, types, and start/end times for failed deployments.

3. Analyze Workflow Events (Two-Phase)

Phase 1 - Summary (Always Start Here):

omctl workflow events <workflow-id> --service-id <id> --environment-id <id> --output json

Extract:

All resources with workflow step status (failed/in-progress/success)
Step duration analysis and event count patterns
Identify specific failed/stuck steps

Phase 2 - Detail (Only for Failed Steps):

omctl workflow events <workflow-id> --service-id <id> --environment-id <id> \
  --resource-key <name> --step-types <type> --detail --output json

Use parameters:

--resource-key: Target specific resource
--step-types: Filter to specific step (Bootstrap, Compute, Deployment, Network, Storage, Monitoring)
--detail: Include full event details (use sparingly)
--since/--until: Time-bound queries

Extract from detail view:

WorkflowStepDebug error messages
VM allocation failures and constraints
Pod scheduling issues
Container readiness failures

Pod Event Timeline: Create ASCII visualizations showing deployment progression:

HH:MM:SS ┬─── ✗ FailedScheduling
         │    pod/app-0: Insufficient memory
         │
HH:MM:SS ├─── ⚡ TriggeredScaleUp
         │    nodegroup-1: adding 2 nodes
         │
HH:MM:SS ├─── 📥 Pulling image:latest
         │    (duration: 2m15s)
         │
HH:MM:SS └─── ✅ Started
              3/3 pods Running

Symbols: ✗ failed, ✅ success, ⚡ autoscaler, 💾 storage, 📥 image, 🚀 runtime, ⚠️ warning

4. Application-Level Investigation

When: Resource DEPLOYING with probe failures, containers Running but not Ready, no conclusive evidence from previous steps

Tool: mcp__omnistrate-platform__omnistrate-ctl_deployment-cell_update-kubeconfig + kubectl

omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig
kubectl get pods -n <instance-id> --kubeconfig /tmp/kubeconfig
kubectl logs <pod-name> -c service -n <instance-id> --kubeconfig /tmp/kubeconfig --tail=50

Look for:

Database connection failures
Application syntax/runtime errors (Python SyntaxError, Java compilation errors)
Service dependency failures
Configuration issues

5. Helm-Specific Verification

When: Helm resources with conflicting status, need application credentials, deployment state unclear

Tool: Same kubeconfig setup with --role cluster-admin + helm

omctl deployment-cell update-kubeconfig <cell-id> --kubeconfig /tmp/kubeconfig --role cluster-admin
helm list -n <instance-id> --kubeconfig /tmp/kubeconfig
helm status <release-name> -n <instance-id> --kubeconfig /tmp/kubeconfig

Extract:

Release status (deployed/failed/pending)
Revision number and last deployed time
Application credentials from release notes
Pod health ratio (Running vs Failed)

6. Broader Context (If Needed)

Tool: mcp__omnistrate-platform__omnistrate-ctl_operations_events

Use time windows from workflow analysis, filter by relevant event types.

Common Failure Patterns

Infrastructure Constraints

VM allocation failures with restrictive constraints (AZ + Network + Size)
PersistentVolumeClaim not found
Node taints/affinity issues

Container Lifecycle

Back-off restarting failed container
ProcessLiveness: UNHEALTHY
Image pull failures

Probe Failures

Startup/readiness probe HTTP 503
Database connectivity timeouts
Application syntax errors preventing startup
Service dependency unavailability

Resource Prioritization

Core infrastructure: databases, message queues, storage
Application services: web servers, APIs
Support services: monitoring, logging

Response Management

Always use --output json
If token limit exceeded: add more specific filters, use smaller time windows, target specific resources
Provide analysis in template format (see Failure Analysis Template in OMNISTRATE_SRE_REFERENCE.md)

Reference

See OMNISTRATE_SRE_REFERENCE.md for:

Detailed tool parameter documentation
Complete failure analysis template
Extended examples
Tool alternatives table

name	Debugging Omnistrate Deployments
description	Systematically debug failed Omnistrate instance deployments using a progressive workflow that identifies root causes efficiently while avoiding token limits. Applies to deployment failures, probe issues, and helm-based resources.