Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

debug-inference

Sterne39

Forks26

Aktualisiert26. Mai 2026 um 11:17

Troubleshoot failed or slow InferenceService deployments on OpenShift AI. Use when: - "My InferenceService won't start" - "Model deployment is stuck" - "Inference endpoint returns errors" - "Model is slow / high latency" - "GPU scheduling failed for my model" Progressive diagnosis: status conditions, events, pod logs, GPU health, and observability analysis. NOT for deploying models (use /model-deploy). NOT for creating runtimes (use /serving-runtime-config).

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

RHEcosystemAppEng

RHEcosystemAppEng/agentic-collections

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

SoftwareentwicklerInformatik- und Mathematikberufe·SOC 15-1252

Datei-Explorer

4 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

cve-impact

RHEcosystemAppEng/agentic-collections

**CRITICAL**: Use for ALL CVE discovery and listing. DO NOT call get_cves directly. Use when: "show critical CVEs", "CVEs on hostname X", "remediatable vulnerabilities", "impact of CVE-X", risk assessment. NOT for remediation (use `/remediation`). System-level: FIRST reply = pagination prompt (Step -1). Parsing: references/01-cve-response-parser.py.

2026-06-2339

fleet-inventory

RHEcosystemAppEng/agentic-collections

Query and display Red Hat Lightspeed managed system inventory. This skill focuses on discovery and listing only - for remediation actions, transition to the `/remediation` skill. Use when: - "Show the managed fleet" - "List all systems registered in Lightspeed" - "What systems are affected by CVE-X?" - "How many RHEL 8 systems do we have?" - "Show me production systems" **When NOT to use this skill** (use `/remediation` skill instead): - "Remediate CVE-X on these systems" - "Create a playbook for..." - "Patch system Y" This skill orchestrates MCP tools from lightspeed-mcp for fleet visibility and system inventory management.

2026-06-2339

mcp-lightspeed-validator

RHEcosystemAppEng/agentic-collections

Validate Red Hat Lightspeed MCP server connectivity. Use when the user asks to "validate Lightspeed MCP", "check Lightspeed connection", or when other skills need to verify lightspeed-mcp availability before CVE operations.

2026-06-2239

agentic-contribution-skill

RHEcosystemAppEng/agentic-collections

Interactive skill creation and import with automated validation and marketplace compliance. Use when: - "Create a new skill" - "Import an existing skill" - "Create a new agentic pack" - "Add skill to <pack>" - "Build skill for <rh-product>" - User mentions "skill builder", "contribute", "new skill", "import skill", or "new pack" Two modes: create from scratch or import existing SKILL.md. Guides through discovery, definition, generation, and validation. Enforces SKILL_DESIGN_PRINCIPLES.md and agentskills.io spec.

2026-06-1639

collection-compliance

RHEcosystemAppEng/agentic-collections

Diagnose and fix `.catalog/` validation failures (schema, roster, banners, sample workflows, JSON mirror). Use when: - `make validate` or CI reports collection compliance errors - A PR adds skills but catalog was not updated - `collection.json` is out of sync with `collection.yaml` - Catalog metadata/fragments might have drifted from README/CLAUDE/SKILL golden sources Remediation is via the create-collection workflow and `catalog_yaml_to_json.py`—not by weakening checks.

2026-06-1639

create-collection

RHEcosystemAppEng/agentic-collections

Author or refresh `<pack>/.catalog/collection.yaml` and related `.catalog/` artifacts from golden sources (SKILL.md, README, AGENTS.md, Lola marketplace). Use when: - Adding a new pack or refreshing the collection catalog for GitHub Pages / tooling - Aligning catalog narrative, sample workflows, and decision guide with skills on disk - Preparing a PR after changing skills or marketplace metadata Outputs only under `<pack>/.catalog/` (never overwrite README, SKILL, CLAUDE, or marketplace YAML).

2026-06-1639

name	debug-inference
description	Troubleshoot failed or slow InferenceService deployments on OpenShift AI. Use when: - "My InferenceService won't start" - "Model deployment is stuck" - "Inference endpoint returns errors" - "Model is slow / high latency" - "GPU scheduling failed for my model" Progressive diagnosis: status conditions, events, pod logs, GPU health, and observability analysis. NOT for deploying models (use /model-deploy). NOT for creating runtimes (use /serving-runtime-config).
model	inherit
color	yellow
license	Apache-2.0
allowed-tools	resources_get resources_list pods_list pods_log events_list list_inference_services get_inference_service get_model_endpoint get_deployment_info analyze_vllm chat_vllm get_gpu_info analyze_openshift query_tempo_tool get_trace_details_tool execute_promql korrel8r_get_correlated

/debug-inference Skill

Troubleshoot failed, stuck, or slow InferenceService deployments on Red Hat OpenShift AI. Performs progressive diagnosis through status conditions, events, pod logs, related resources, and optional observability analysis. Follows a 6-step diagnosis pattern with human-in-the-loop confirmation at each step.

Prerequisites

Required MCP Server: openshift (OpenShift MCP Server)

Required MCP Tools (from openshift):

resources_get - Get ServingRuntime, NIM Account CR, InferenceService details
resources_list - List InferenceServices (OpenShift fallback)
pods_list - Find predictor/transformer pods
pods_log - Retrieve container logs
events_list - Check events for errors

Preferred MCP Server: rhoai (RHOAI MCP Server) — used when available, automatic OpenShift fallback on failure

Preferred MCP Tools (from rhoai):

list_inference_services - List deployed models with structured status data
get_inference_service - Get detailed deployment status (conditions, endpoint, ready state)
get_model_endpoint - Quick check if endpoint is available (early diagnostic)

Optional MCP Server: ai-observability (AI Observability MCP)

Optional MCP Tools (from ai-observability):

get_deployment_info - Check model initialization status
analyze_vllm - Analyze vLLM performance bottlenecks (latency, throughput, errors, token rates)
chat_vllm - Conversational follow-up on vLLM metrics during diagnosis
get_gpu_info - GPU inventory and utilization
analyze_openshift - Check GPU health with "GPU & Accelerators" category
query_tempo_tool - Trace request latency by service/operation/time range
get_trace_details_tool - Get detailed span-level info for a specific trace ID
execute_promql - Run custom PromQL queries for metrics not covered by standard analysis
korrel8r_get_correlated - Correlate signals (logs, traces, metrics, alerts) across a pod/namespace for root cause analysis

Common prerequisites (KUBECONFIG, OpenShift+RHOAI cluster, KServe, verification protocol): See skill-conventions.md.

Fallback templates: See openshift-fallback-templates.md for OpenShift YAML templates used when RHOAI tools are unavailable.

Additional cluster requirements:

An existing InferenceService deployment to debug

When to Use This Skill

Use this skill when you need to:

Troubleshoot an InferenceService that won't start, is stuck, or shows errors
Diagnose slow inference latency or high error rates
Investigate GPU scheduling failures or OOMKilled pods
Perform root cause analysis on model deployment issues

Do NOT use this skill when:

You want to deploy a new model (use /model-deploy)
You want to analyze ongoing model performance (use /ai-observability)
You need to create or fix a ServingRuntime (use /serving-runtime-config)
You need to set up NIM credentials (use /nim-setup)

Workflow

Step 1: Identify Target InferenceService

Ask the user:

Which InferenceService is having issues? (name or "list all")
What namespace is it in?
What is the symptom? (won't start / slow / errors / other)

If user says "list all" or is unsure:

MCP Tool: list_inference_services (from rhoai)

Parameters:

namespace: user-specified namespace - REQUIRED
verbosity: "standard" - OPTIONAL

If rhoai unavailable or returns error: Use resources_list (from openshift) with apiVersion: serving.kserve.io/v1beta1, kind: InferenceService, namespace: [namespace].

Present InferenceServices with their status:

Name	Runtime	Ready	URL	Age
[name]	[runtime]	[True/False/Unknown]	[url or "N/A"]	[age]

WAIT for user to select which InferenceService to debug.

Step 2: Status Overview

MCP Tool: get_inference_service (from rhoai)

Parameters:

name: the InferenceService name - REQUIRED
namespace: user-specified namespace - REQUIRED
verbosity: "full" - REQUIRED

If rhoai unavailable or returns error: Use resources_get (from openshift) with apiVersion: serving.kserve.io/v1beta1, kind: InferenceService, name: [name], namespace: [namespace]. Extract status from .status.conditions.

Early endpoint check:

MCP Tool: get_model_endpoint (from rhoai)

name: the InferenceService name, namespace: user-specified namespace

If rhoai unavailable or returns error: Extract endpoint from .status.url of the InferenceService obtained via resources_get (from openshift).

An empty or error URL indicates deployment issues. Report endpoint availability status.

Present status conditions:

Condition	Status	Reason	Message
Ready	[True/False/Unknown]	[reason]	[message]
PredictorReady	[True/False/Unknown]	[reason]	[message]
IngressReady	[True/False/Unknown]	[reason]	[message]

Quick Assessment: Based on conditions, provide initial assessment (e.g., "PredictorReady is False -- the model container is not running. Likely a pod-level issue.")

Ask: "Continue with deep analysis of events and pods? (yes/no)"

WAIT for user confirmation.

Step 3: Events and Pod Analysis

MCP Tool: events_list (from openshift)

Parameters:

namespace: user-specified namespace - REQUIRED

Filter events related to the InferenceService name.

MCP Tool: pods_list (from openshift)

Parameters:

namespace: user-specified namespace - REQUIRED
labelSelector: "serving.kserve.io/inferenceservice=[isvc-name]" - REQUIRED

Present findings:

Events:

Time	Type	Reason	Message
[time]	[Normal/Warning]	[reason]	[message]

Predictor Pods:

Pod	Status	Restarts	Node	GPU
[pod-name]	[status]	[count]	[node]	[gpu-count]

Issues Found:

[Issue from events or pod status]

Ask: "Continue to view pod logs? (yes/no)"

WAIT for user confirmation.

Step 4: Pod Logs Review

MCP Tool: pods_log (from openshift)

Parameters:

namespace: user-specified namespace - REQUIRED
name: predictor pod name from Step 3 - REQUIRED
container: "kserve-container" - REQUIRED (main serving container)

If the container has restarted, also retrieve previous logs.

Present log analysis:

Log Analysis:

[Error pattern identified, e.g., "CUDA out of memory", "S3 access denied", "Model not found"]
[Relevant log line with explanation]

For NIM-specific deployments, also check:

NGC authentication errors in logs
TensorRT engine compilation status
GPU compatibility messages

If the error is unrecognized -> Trigger live doc lookup:

Action: Read live-doc-lookup.md using the Read tool
Use WebFetch to look up the error message in RHOAI documentation
Output to user: "I looked up this error on [source]: [explanation and fix]"

Ask: "Continue to check related resources and observability? (yes/no)"

WAIT for user confirmation.

Step 5: Related Resources and Observability

Check ServingRuntime:

MCP Tool: resources_get (from openshift)

Parameters:

apiVersion: "serving.kserve.io/v1alpha1" - REQUIRED
kind: "ServingRuntime" - REQUIRED
namespace: user-specified namespace - REQUIRED
name: runtime name from the InferenceService spec - REQUIRED

Verify the runtime exists and its model format matches the InferenceService.

For NIM deployments -- Check Account CR:

MCP Tool: resources_get (from openshift)

Parameters:

apiVersion: "nim.opendatahub.io/v1alpha1" - REQUIRED
kind: "Account" - REQUIRED
namespace: user-specified namespace - REQUIRED
name: "nim-account" - REQUIRED

If ai-observability MCP is available:

get_deployment_info: Check if the model appears in monitoring and its initialization status
analyze_vllm: Analyze performance metrics for slow inference (latency, throughput, errors, token rates)
chat_vllm: Ask follow-up questions about analyzed metrics (e.g., "Why is latency spiking?")
analyze_openshift with category "GPU & Accelerators": Check GPU health and utilization
query_tempo_tool: Trace request latency if the symptom is slow responses
get_trace_details_tool: Drill into a specific trace ID to see span-level timing
execute_promql: Run custom PromQL queries for deeper metric investigation (e.g., vllm:request_success:ratio, GPU memory utilization)
korrel8r_get_correlated: Correlate signals across the inference stack -- find related logs, traces, metrics, and alerts for the failing pod/namespace (query example: k8s:Pod:{"namespace":"[ns]","name":"[pod-name]"}, goals: ["log:application", "metric:metric", "trace:span"])

If ai-observability not available: Skip with note: "Observability analysis skipped (ai-observability MCP not configured)."

Present findings:

ServingRuntime status and compatibility
NIM Account CR status (if applicable)
Observability insights (if available)

Ask: "Continue to diagnosis summary? (yes/no)"

WAIT for user confirmation.

Step 6: Diagnosis Summary

Present a structured diagnosis:

## Diagnosis Summary: [isvc-name]

### Root Cause

**Primary Issue:** [Categorized root cause]

| Category | Status | Details |
|----------|--------|---------|
| ServingRuntime | [OK/FAIL] | [details] |
| Pod Scheduling | [OK/FAIL] | [details] |
| Container Start | [OK/FAIL] | [details] |
| Model Loading | [OK/FAIL] | [details] |
| GPU Access | [OK/FAIL] | [details] |
| Endpoint Health | [OK/FAIL] | [details] |

### Evidence

- [Evidence 1 from events/logs/status]
- [Evidence 2]

### Recommended Actions

1. **[Action 1]** - [description]
2. **[Action 2]** - [description]
3. **[Action 3]** - [description]

### Verification Steps

After applying fixes:
1. Check InferenceService status: `resources_get` for the InferenceService
2. Verify pod is running: `pods_list` with label selector
3. Test endpoint: curl command to the inference URL

End with options:

Would you like me to:
1. Execute a recommended fix
2. Dig deeper into a specific area
3. Debug a related resource (ServingRuntime, pod, NIM Account)
4. Invoke /serving-runtime-config to fix runtime issues
5. Exit debugging

WAIT for user to select next action.

Common Issues

For common issues (GPU scheduling, OOMKilled, image pull errors, RBAC), see common-issues.md.

Issue 1: S3 Storage Access Denied

Error: Pod logs show "Access Denied" or "NoSuchBucket" when loading model weights

Cause: S3 credentials are missing, expired, or the bucket/path is incorrect.

Solution:

Verify the storageUri in the InferenceService spec
Check that the S3 credential Secret exists in the namespace
Verify the Secret is referenced by the ServiceAccount or data connection
Test S3 access independently to confirm credentials are valid

Issue 2: NIM Authentication / GPU Incompatibility

Error: NIM pod logs show NGC authentication failure, or TensorRT engine fails to compile for the available GPU

Cause: NGC API key is invalid/expired, or the GPU type is not supported by the NIM model profile.

Solution:

Check Account CR status for credential errors: resources_get for accounts.nim.opendatahub.io
Verify NGC API key is valid at https://ngc.nvidia.com
Check NIM supported GPU matrix via live doc lookup against NVIDIA NIM supported models
Re-run /nim-setup to refresh credentials if expired

Dependencies

MCP Tools

See Prerequisites for the complete list of required and optional MCP tools.

Related Skills

/model-deploy - Redeploy or modify the InferenceService after fixing issues
/serving-runtime-config - Fix or create ServingRuntime if runtime is the issue
/nim-setup - Re-run NIM platform setup if NIM credentials are the issue
/model-monitor - Check if TrustyAI monitoring detected issues before they became failures

Reference Documentation

known-model-profiles.md - Correct resource sizing for common models
supported-runtimes.md - Runtime capabilities and known limitations
live-doc-lookup.md - Protocol for looking up unrecognized errors

Critical: Human-in-the-Loop Requirements

See skill-conventions.md for general HITL and security conventions.

Skill-specific checkpoints:

After identifying target (Step 1): confirm which InferenceService to debug
After status overview (Step 2): confirm before deep analysis
After events/pod analysis (Step 3): confirm before viewing logs
After log review (Step 4): confirm before checking related resources
After diagnosis summary (Step 6): present options, wait for user decision
NEVER auto-delete or auto-modify InferenceService resources without user confirmation
NEVER execute remediation actions without presenting the plan and getting explicit approval