一键导入
gke-ai-troubleshooting-skill-creation-guide
// Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
// Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
Assists in preparing applications and clusters on GKE for production.
Workflows for containerizing and deploying applications to GKE for the first time.
Workflows for auditing and hardening the security of GKE workloads.
Answer natural language questions about GKE-related costs by leveraging BigQuery export and cost allocation data.
Guides the user through creating GKE clusters using pre-defined templates (Standard, Autopilot, GPU/AI).
| name | gke-ai-troubleshooting-skill-creation-guide |
| description | Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements. |
Use this guide to build high-quality troubleshooting skills that enable AI agents to diagnose complex failures in GKE workloads.
SKILL.md: The core diagnostic and resolution workflow.README.md: Public-facing overview and "When to use" guide.references/failure_signatures.md: Authentic log/metric signatures.scripts/validate_queries.sh: Automatic syntax validator for all
queries.TEST.md: Manual verification plan for humans.EVAL.textproto: Evaluation suite for performance tracking.BUILD: Build definition.kebab-case (e.g.,
gke-ai-troubleshooting-tpu-vbar-oom).Every skill MUST begin with a "Step 0" to acquire necessary context.
<project_id>, <location>, <cluster_name>,
<timestamp>.<node_name>, <workload_name>,
<workload_namespace>, <nodepool_name>.[T - 30m] to [T + 30m].<project_id> instead
of curly braces for placeholders to avoid template resolution errors.scripts/validate_queries.sh) that
uses query_logs or gcloud logging read ... --limit=1 to verify its LQL
queries.references/failure_signatures.md
in relevant diagnostic steps.