Run any Skill in Manus with one click

service-mesh-observability

Stars37,983

Forks4,081

UpdatedMay 22, 2026 at 12:18

Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

wshobson

wshobson/agents

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Network and Computer Systems AdministratorsComputer and Mathematical Occupations·SOC 15-1244

File Explorer

2 files

SKILL.md

readonly

name	service-mesh-observability
description	Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.

Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

When to Use This Skill

Setting up distributed tracing across services
Implementing service mesh metrics and dashboards
Debugging latency and error issues
Defining SLOs for service communication
Visualizing service dependencies
Troubleshooting mesh connectivity

Core Concepts

1. Three Pillars of Observability

┌─────────────────────────────────────────────────────┐
│                  Observability                       │
├─────────────────┬─────────────────┬─────────────────┤
│     Metrics     │     Traces      │      Logs       │
│                 │                 │                 │
│ • Request rate  │ • Span context  │ • Access logs   │
│ • Error rate    │ • Latency       │ • Error details │
│ • Latency P50   │ • Dependencies  │ • Debug info    │
│ • Saturation    │ • Bottlenecks   │ • Audit trail   │
└─────────────────┴─────────────────┴─────────────────┘

2. Golden Signals for Mesh

Signal	Description	Alert Threshold
Latency	Request duration P50, P99	P99 > 500ms
Traffic	Requests per second	Anomaly detection
Errors	5xx error rate	> 1%
Saturation	Resource utilization	> 80%

Templates and detailed worked examples

Full template library and detailed worked examples live in references/details.md. Read that file when you need the concrete templates.

Best Practices

Do's

Sample appropriately - 100% in dev, 1-10% in prod
Use trace context - Propagate headers consistently
Set up alerts - For golden signals
Correlate metrics/traces - Use exemplars
Retain strategically - Hot/cold storage tiers

Don'ts

Don't over-sample - Storage costs add up
Don't ignore cardinality - Limit label values
Don't skip dashboards - Visualize dependencies
Don't forget costs - Monitor observability costs

More from this repository

same repository

spark-environment-setup

wshobson/agents

Set up a working ML training/inference environment on NVIDIA DGX Spark (GB10, aarch64, CUDA 13). Use when installing PyTorch/Unsloth/TRL/vLLM on DGX Spark, hitting libcudart or wheel-ABI errors on aarch64, or choosing between NGC containers and bare pip installs.

2026-07-1438.0k

spark-memory-thermal-ops

wshobson/agents

Manage unified memory and thermals during long-running ML jobs on NVIDIA DGX Spark. Use when planning memory headroom for a training run on GB10, when a job OOMs on unified memory, or when monitoring temperature and power during multi-hour training.

2026-07-1438.0k

spark-training-gotchas

wshobson/agents

Preflight and diagnose the ten known failure modes for ML training on NVIDIA DGX Spark. Use when a training run on DGX Spark fails to start, OOMs below the 128GB limit, slows down mid-run, or before any multi-hour training job on GB10.

2026-07-1438.0k

checkpoint-promotion

wshobson/agents

Gate fine-tuned checkpoints with drift budgets, paired comparison, and forgetting checks before promotion. Use after a training run produces a checkpoint, when deciding whether a tuned model ships, or when a promoted model needs re-gating against updated goldens.

2026-07-1438.0k

dataset-curation

wshobson/agents

Prepare, format, and validate datasets for supervised fine-tuning and preference training. Use when converting raw data into training format, applying chat templates, configuring sequence packing, generating synthetic training data, or writing a dataset card before a run.

2026-07-1438.0k

eval-harness-first

wshobson/agents

Build the evaluation harness that gates every fine-tuning run — golden sets, per-failure-mode graders, judge calibration, and base-model baselines. Use when starting a fine-tuning effort, when converting traces into an eval set, or when calibrating a judge against human labels.

2026-07-1438.0k

name	service-mesh-observability
description	Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.

Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

When to Use This Skill

Setting up distributed tracing across services
Implementing service mesh metrics and dashboards
Debugging latency and error issues
Defining SLOs for service communication
Visualizing service dependencies
Troubleshooting mesh connectivity

Core Concepts

1. Three Pillars of Observability

┌─────────────────────────────────────────────────────┐
│                  Observability                       │
├─────────────────┬─────────────────┬─────────────────┤
│     Metrics     │     Traces      │      Logs       │
│                 │                 │                 │
│ • Request rate  │ • Span context  │ • Access logs   │
│ • Error rate    │ • Latency       │ • Error details │
│ • Latency P50   │ • Dependencies  │ • Debug info    │
│ • Saturation    │ • Bottlenecks   │ • Audit trail   │
└─────────────────┴─────────────────┴─────────────────┘

2. Golden Signals for Mesh

Signal	Description	Alert Threshold
Latency	Request duration P50, P99	P99 > 500ms
Traffic	Requests per second	Anomaly detection
Errors	5xx error rate	> 1%
Saturation	Resource utilization	> 80%

Templates and detailed worked examples

Full template library and detailed worked examples live in references/details.md. Read that file when you need the concrete templates.

Best Practices

Do's

Sample appropriately - 100% in dev, 1-10% in prod
Use trace context - Propagate headers consistently
Set up alerts - For golden signals
Correlate metrics/traces - Use exemplars
Retain strategically - Hot/cold storage tiers

Don'ts

Don't over-sample - Storage costs add up
Don't ignore cardinality - Limit label values
Don't skip dashboards - Visualize dependencies
Don't forget costs - Monitor observability costs