Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

sre-dashboards

Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.

Exécuter dans Manus

Étoiles30

Forks4

Mis à jour22 mai 2026 à 13:02

Source

BagelHole

BagelHole/DevOps-Security-Agent-Skills

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Commande d'installation

Téléchargement

Exécuter dans Manus

Utile pourSOC

Administrateurs de réseaux et de systèmes informatiquesProfessions informatiques et mathématiques15-1244L4

SKILL.md

readonly

Plus depuis ce dépôt

même dépôt

ai-pipeline-orchestration

BagelHole/DevOps-Security-Agent-Skills

Orchestrate AI/ML pipelines for data ingestion, model training, batch inference, and RAG indexing using Prefect, Airflow, or Dagster. Build reliable, observable, and retriable workflows for production AI systems.

2026-05-2230

llm-caching

BagelHole/DevOps-Security-Agent-Skills

Implement multi-layer LLM caching with exact match, semantic similarity, and provider-side prompt caching. Reduce API costs by 30–70%, cut latency, and improve throughput using Redis, GPTCache, and provider caching APIs.

2026-05-2230

llm-cost-optimization

BagelHole/DevOps-Security-Agent-Skills

Reduce LLM API and infrastructure costs through model selection, prompt caching, batching, caching, quantization, and self-hosting strategies. Track spend by team and model, set budgets, and implement cost-aware routing.

2026-05-2230

model-serving-kubernetes

BagelHole/DevOps-Security-Agent-Skills

Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.

2026-05-2230

vector-database-ops

BagelHole/DevOps-Security-Agent-Skills

Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.

2026-05-2230

openclaw-security-hardening

BagelHole/DevOps-Security-Agent-Skills

Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.

2026-05-2230

name	sre-dashboards
description	Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
license	MIT
metadata	{"author":"devops-skills","version":"1.0"}

SRE Dashboards

Build dashboards that help teams detect, triage, and prevent reliability incidents.

When to Use This Skill

Use this skill when:

Defining service-level dashboards for production systems
Tracking SLO health and error-budget burn
Creating incident command-center views
Standardizing dashboard patterns across teams

Prerequisites

Metrics pipeline (Prometheus, OpenTelemetry, or vendor equivalent)
Logs/traces linked to services and environments
Agreed service taxonomy (team, service, tier, environment)

Dashboard Architecture

Structure dashboards in layers:

Executive Reliability View: SLO attainment, incident counts, MTTR trends.
Service Health View: RED/USE metrics, dependency health, release markers.
Deep-Dive View: Per-endpoint latency, resource saturation, error categories.

Keep each view answer-oriented:

Are customers impacted?
What changed?
Where is the bottleneck?

Core SRE Panels

Golden Signals

Latency: p50/p95/p99 request duration by endpoint
Traffic: request throughput and queue depth
Errors: 5xx rate, failed jobs, timeout ratio
Saturation: CPU, memory, disk I/O, thread/connection pool exhaustion

SLO Panels

Current SLI value (rolling windows: 5m, 1h, 24h, 30d)
Error-budget remaining (%)
Burn-rate panels (fast and slow windows)
Multi-window burn alert status

Change Correlation

Deployment markers and config-change annotations
Feature flag state overlays
Upstream/downstream dependency error rates

Example PromQL Snippets

# API error rate (%)
100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# p95 latency by route
histogram_quantile(0.95,
  sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)

# Fast burn rate (5m / 1h)
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
)
/
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  / sum(rate(http_requests_total[1h]))
)

Operational Guidelines

Use consistent color semantics (green=healthy, yellow=degrading, red=breach)
Label units explicitly (ms, req/s, %, cores)
Default time windows to incident-friendly ranges (15m, 1h, 6h, 24h)
Minimize panel count per dashboard to reduce cognitive load
Add runbook links directly in panel descriptions

Troubleshooting

Panel appears flat or empty

Verify label cardinality and filters (service, env, region)
Confirm scrape/ingest latency is within expected range
Check metric rename regressions after instrumentation updates

High cardinality slows dashboards

Aggregate by stable dimensions (service, route_group) instead of raw IDs
Use recording rules for expensive percentile and ratio queries
Split deep-dive dashboards from NOC summary dashboards

Related Skills

prometheus-grafana - Dashboard implementation and PromQL
opentelemetry - Standardized telemetry instrumentation
alerting-oncall - Reliability alert routing and escalation
agent-observability - AI workload reliability telemetry