Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

alertguardian-intelligent-alert-life-cycle

Name: Alertguardian Intelligent Alert Life Cycle
Author: ndpvt-web

// Build intelligent alert lifecycle management systems for cloud infrastructure using graph-based denoising, RAG-powered summarization, and multi-agent rule refinement. Trigger phrases: - "reduce alert fatigue in our monitoring system" - "deduplicate and correlate alerts" - "summarize alerts for on-call engineers" - "refine our alerting rules automatically" - "build an alert denoising pipeline" - "too many alerts, help me triage"

Ejecutar en Manus

$ git log --oneline --stat

stars:4

forks:0

updated:13 de febrero de 2026, 13:35

SKILL.md

readonly

related-skills.json

mismo repositorio

a2rag-adaptive-agentic-graph.md

from "ndpvt-web/arxiv-claude-skills"

Build adaptive, cost-aware Graph-RAG pipelines that route queries through escalating retrieval stages (local -> bridge -> global) with triple-check verification and provenance map-back. Use when: 'build a graph RAG pipeline', 'implement adaptive retrieval for knowledge graphs', 'cost-aware multi-hop question answering', 'add evidence verification to RAG', 'handle mixed-difficulty queries efficiently', 'graph retrieval with source text grounding'.

2026-02-134

adaptbpe-general-purpose-specialized.md

from "ndpvt-web/arxiv-claude-skills"

Adapt general-purpose BPE tokenizers into domain- or language-specialized tokenizers using the AdaptBPE post-training strategy. Replaces low-utility tokens with high-frequency domain-specific tokens to improve tokenization efficiency without retraining from scratch. Trigger phrases: "adapt tokenizer to domain", "specialize BPE for medical text", "optimize tokenizer for French", "reduce token fertility for code", "adapt vocabulary for legal documents", "domain-specific tokenizer"

2026-02-134

addressing-explainability-generative-ai.md

from "ndpvt-web/arxiv-claude-skills"

Explain generative AI outputs using the gSMILE perturbation-based attribution framework. Builds local surrogate models from controlled input perturbations and Wasserstein distance to produce token-level or word-level importance scores for LLM and diffusion model outputs. Triggers: 'explain why the model generated this', 'token attribution for prompt', 'which words in my prompt matter most', 'interpret generative model output', 'build explainability for my LLM pipeline', 'debug prompt influence on generation'

2026-02-134

agent-based-software-artifact-evaluation.md

from "ndpvt-web/arxiv-claude-skills"

Automatically evaluate software research artifacts (code repositories with READMEs) by constructing dependency-aware command graphs, building containerized environments, and executing instructions with structured error recovery. Use when asked to: 'evaluate this artifact', 'reproduce this paper's results', 'run this repo's README instructions', 'check if this artifact builds and runs', 'automate artifact evaluation', 'verify research reproducibility'.

2026-02-134

agentcgroup-understanding-controlling-os.md

from "ndpvt-web/arxiv-claude-skills"

Design and implement OS-level resource controls for sandboxed AI agents using hierarchical cgroups, eBPF enforcement, and tool-call-level resource management. Use when: 'set up cgroups for AI agent containers', 'control memory for coding agents', 'isolate tool-call resources with eBPF', 'manage multi-tenant agent resource limits', 'prevent OOM kills in agent sandboxes', 'configure agent resource policies with cgroup v2'.

2026-02-134

ai-agent-systems-supply.md

from "ndpvt-web/arxiv-claude-skills"

Build LLM-based multi-agent systems for supply chain inventory management using structured decision prompts and memory-retrieval (AIM-RM). Implements the beer game multi-echelon supply chain simulation with per-stage agents that use stepwise ordering prompts, safety-stock calculations, and Euclidean-distance memory retrieval of similar historical episodes. Use when asked to: "build a supply chain agent", "implement inventory management with LLMs", "create a beer game simulation with AI agents", "multi-agent ordering system", "AIM-RM memory retrieval agent", "supply chain decision prompt design".

2026-02-134

package.json

"author": "ndpvt-web"

"repository": "ndpvt-web/arxiv-claude-skills"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Administradores de redes y sistemas informáticosOcupaciones informáticas y matemáticas15-1244L4

name

alertguardian-intelligent-alert-life-cycle

description

Build intelligent alert lifecycle management systems for cloud infrastructure using graph-based denoising, RAG-powered summarization, and multi-agent rule refinement. Trigger phrases: - "reduce alert fatigue in our monitoring system" - "deduplicate and correlate alerts" - "summarize alerts for on-call engineers" - "refine our alerting rules automatically" - "build an alert denoising pipeline" - "too many alerts, help me triage"

AlertGuardian: Intelligent Alert Life-Cycle Management

This skill enables Claude to design and implement alert lifecycle management systems inspired by the AlertGuardian framework (ASE 2025). The approach combines three complementary techniques — graph-based alert denoising with virtual noise injection, RAG-powered alert summarization, and multi-agent iterative rule refinement — to reduce alert volumes by up to 94.8% while maintaining 90.5% fault diagnosis accuracy. Use this skill when building or improving alerting pipelines for cloud-native infrastructure, Kubernetes clusters, microservice architectures, or any system drowning in monitoring noise.

When to Use

When the user has a monitoring system (Prometheus, Datadog, PagerDuty, Grafana, OpsGenie) producing too many alerts and wants to reduce noise
When building an alert correlation or deduplication pipeline that groups related alerts into incidents
When the user wants to generate actionable on-call summaries from raw alert streams using LLMs
When designing a system that automatically evaluates and refines alerting rules (PromQL, Datadog monitors, custom thresholds)
When the user asks to build a graph model that captures temporal and topological relationships between alerts
When implementing a multi-agent feedback loop where agents propose, validate, and optimize detection rules
When migrating from static threshold alerting to intelligent, adaptive alert management

Key Technique

AlertGuardian treats alert management as a three-phase lifecycle rather than a single-point problem. Phase 1 (Alert Denoise) constructs a heterogeneous alert graph where nodes are individual alert instances and edges encode temporal co-occurrence (alerts firing within a configurable time window) and topological affinity (alerts from the same service dependency chain). A graph neural network (GNN) propagates information across this graph to learn which alerts are correlated and should be grouped. The critical innovation is virtual noise injection: during training, synthetic noise alerts are injected into the graph to force the model to distinguish genuine correlations from spurious co-occurrences, improving generalization without requiring perfectly labeled training data.

Phase 2 (Alert Summary) applies Retrieval-Augmented Generation to produce concise, actionable summaries for on-call engineers. When a cluster of denoised alerts arrives, the system retrieves semantically similar historical incidents from a vector-indexed knowledge base, then prompts an LLM with the current alert context plus retrieved examples to generate a structured summary containing: root cause hypothesis, affected components, recommended actions, and severity justification. This replaces the manual work of reading dozens of individual alerts.

Phase 3 (Alert Rule Refinement) uses a multi-agent loop with four specialized roles — Rule Generator, Validator, Optimizer, and Reviewer — that iterate on alerting rule definitions. The Generator proposes candidate rules from observed alert patterns; the Validator tests them against held-out data measuring precision and recall; the Optimizer adjusts thresholds and conditions to reduce false positives; the Reviewer audits edge cases and interpretability. This loop cycles until convergence criteria are met, then presents refined rules for human approval. In production, 32% of refined rules were accepted by SREs, meaning the system does meaningful work while keeping humans in the loop.

Step-by-Step Workflow

Ingest and normalize alerts. Parse alert data from the monitoring source (Prometheus AlertManager JSON, PagerDuty webhooks, Datadog events) into a uniform schema: {alert_id, timestamp, source_service, severity, labels, description, status}. Store in a time-series-friendly format (e.g., sorted by timestamp with service indexing).
Build the alert correlation graph. For each time window (e.g., 5 minutes), create nodes for each alert instance. Add temporal edges between alerts that fire within the window. Add topological edges between alerts whose source services share a dependency (use a service dependency map from Kubernetes labels, Consul, or a static topology file). Encode node features as vectors: [severity_one_hot, alert_type_embedding, source_service_embedding, time_delta_normalized].
Train or apply the denoising GNN with virtual noise. Implement a 2-3 layer Graph Attention Network (GAT) or GraphSAGE model. During training, inject virtual noise by randomly adding synthetic alert nodes with shuffled features and random edges (10-20% noise ratio). Train with binary classification: real correlated alerts vs. noise. At inference, the model scores each alert; low-scoring alerts are suppressed, high-scoring alerts are grouped into incident clusters.
Index historical incidents for RAG retrieval. Build a vector store (FAISS, Pinecone, pgvector) of past incident reports, postmortems, and resolved alert clusters. Each entry includes: alert signatures, root cause, resolution steps, and affected services. Use an embedding model to index these documents.

Generate structured alert summaries. For each incident cluster from step 3, retrieve the top-k (k=3-5) most similar historical incidents. Construct a prompt:

Given these active alerts: {cluster_alerts}
Similar past incidents: {retrieved_incidents}
Generate a structured summary with:
- Root cause hypothesis
- Affected components and blast radius
- Recommended immediate actions
- Severity assessment (P1-P4)

Parse the LLM response into a structured format for downstream consumption.

Extract candidate alerting rules from alert patterns. Analyze the denoised alert clusters to identify recurring patterns: which metric thresholds trigger most frequently, which conditions produce false positives. Use a Rule Generator agent to propose new or modified rules in the target query language (PromQL, Datadog monitor JSON, etc.).
Validate rules against historical data. Replay proposed rules against a held-out dataset of historical alerts with known ground-truth labels. Compute precision (what fraction of triggered alerts correspond to real incidents) and recall (what fraction of real incidents are detected). Flag rules below configurable thresholds.
Optimize rules through multi-agent iteration. Run the Optimizer agent to adjust thresholds, add exclusion conditions, or modify time windows to improve precision without sacrificing recall. Pass back to Validator. Repeat for 3-5 iterations or until metrics plateau.
Review and surface rules for human approval. The Reviewer agent checks edge cases, evaluates rule interpretability, and generates a human-readable diff showing old rule vs. proposed rule with expected impact metrics. Present to SREs for approval via PR, Slack message, or dashboard.
Deploy and establish feedback loops. Ship approved rules to the monitoring system. Log all suppressed alerts for periodic audit. Re-run the denoising model retraining on a weekly/monthly cadence as system behavior evolves. Track alert volume reduction ratio and false-negative rate as ongoing KPIs.

Concrete Examples

Example 1: Building an alert denoising pipeline for Kubernetes

User: "We run 200 microservices on Kubernetes and get 5,000+ alerts per day from Prometheus. Most are noise. Help me build a denoising system."

Approach:

Parse Prometheus AlertManager webhook payloads into normalized alert objects
Extract the Kubernetes service dependency graph from kube-state-metrics labels and Istio service mesh topology
Build a temporal-topological alert graph per 5-minute sliding window
Implement a GAT-based classifier in PyTorch Geometric with virtual noise injection during training
Deploy as a sidecar service that sits between AlertManager and PagerDuty, passing only high-confidence alert clusters

Output structure:

# alert_graph.py
import torch
from torch_geometric.nn import GATConv
from torch_geometric.data import Data

class AlertDenoiser(torch.nn.Module):
    def __init__(self, num_features, hidden_dim=64):
        super().__init__()
        self.conv1 = GATConv(num_features, hidden_dim, heads=4)
        self.conv2 = GATConv(hidden_dim * 4, hidden_dim, heads=1)
        self.classifier = torch.nn.Linear(hidden_dim, 2)  # noise vs. real

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index).relu()
        return self.classifier(x)

def inject_virtual_noise(data: Data, noise_ratio=0.15):
    """Add synthetic noise nodes with shuffled features and random edges."""
    num_noise = int(data.num_nodes * noise_ratio)
    noise_features = data.x[torch.randperm(data.num_nodes)[:num_noise]]
    noise_labels = torch.zeros(num_noise, dtype=torch.long)  # label 0 = noise
    # ... attach random edges and concatenate with original graph
    return augmented_data

Example 2: RAG-powered alert summarization for on-call

User: "When an incident fires, our on-call engineer has to read 30+ alerts manually. Build a summarizer that gives them a one-paragraph actionable brief."

Approach:

Index historical postmortems and incident reports into a vector store using sentence-transformer embeddings
When an alert cluster arrives, embed the cluster signature (concatenated alert names + services + severity)
Retrieve top-3 similar past incidents
Prompt an LLM with the structured template to generate an actionable summary
Post the summary to the incident Slack channel or PagerDuty note

Output:

Incident Summary (auto-generated):
- Root Cause Hypothesis: Database connection pool exhaustion on payments-db
  triggered cascading timeouts in checkout-service and order-service.
- Affected Components: payments-db, checkout-service, order-service,
  api-gateway (degraded)
- Blast Radius: ~12% of checkout requests failing (estimated from
  error rate alerts)
- Recommended Actions:
  1. Check payments-db connection pool metrics and recent deployment changes
  2. Restart checkout-service pods if connection pool is stuck
  3. Verify no recent schema migrations on payments-db
- Severity: P2 (customer-facing degradation, not full outage)
- Similar Past Incident: INC-2847 (2024-09-14) - resolved by rolling
  back payments-db connection pool config change

Example 3: Multi-agent alert rule refinement

User: "Our HighCPUUsage alert fires 200 times a day but only 5 are real. Help me refine the rule automatically."

Approach:

Pull the current PromQL rule: node_cpu_seconds_total > 0.85 for 5m
Query historical alert firings and correlate with actual incidents (labeled data)
Run the multi-agent loop:
- Generator proposes: add unless on(instance) node_cpu_seconds_total{mode="iowait"} > 0.3 to exclude IO-bound spikes
- Validator: precision improves from 2.5% to 34%, recall stays at 100%
- Optimizer: tighten threshold to 0.90, extend window to 10m
- Validator: precision now 78%, recall 95%
- Reviewer: flags that 5% recall loss corresponds to 1 missed incident per month; recommends accepting with documentation
Present the refined rule as a PR diff

Output:

# Before
- alert: HighCPUUsage
  expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) > 0.85
  for: 5m

# After (proposed)
- alert: HighCPUUsage
  expr: |
    avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) > 0.90
    unless on(instance)
    avg(rate(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance) > 0.30
  for: 10m
  labels:
    refinement_id: "AG-2025-0142"
    expected_precision: "78%"
    expected_recall: "95%"

Best Practices

Do: Build the service dependency graph from actual infrastructure metadata (K8s labels, service mesh, Terraform state) rather than hardcoding it. The graph quality directly determines denoising quality.
Do: Use virtual noise injection during GNN training even if you have labeled data. It acts as a regularizer and improves generalization to unseen alert patterns.
Do: Structure RAG retrieval around alert signatures (the combination of alert name + source service + co-occurring alerts) rather than raw text similarity, which picks up irrelevant lexical matches.
Do: Always keep humans in the loop for rule refinement — present diffs with expected impact metrics, never auto-deploy rule changes to production.
Avoid: Suppressing alerts without logging them. Always persist suppressed alerts to a cold store for periodic audit and false-negative detection.
Avoid: Running the multi-agent rule refinement loop without held-out validation data. Without ground truth, the optimizer will overfit to noise patterns and degrade recall.

Error Handling

Cold start (no historical data): Start with rule-based deduplication (exact match on alert name + service within a time window) while accumulating data for GNN training. Switch to graph-based denoising after collecting 2-4 weeks of labeled incidents.
GNN training divergence: If the denoiser starts suppressing legitimate alerts, reduce the virtual noise ratio from 15% to 5% and verify that your ground-truth labels are accurate. Monitor the false-negative rate weekly.
RAG retrieval misses: When no similar historical incident exists (novel failure mode), fall back to a zero-shot summary prompt without retrieved context. Flag these incidents for postmortem indexing after resolution.
Rule refinement loop non-convergence: If the Validator and Optimizer oscillate without improving metrics after 5 iterations, surface the current best candidate to a human reviewer and stop iterating. Some rules require domain knowledge the agents lack.
Monitoring system API rate limits: Batch rule validation queries and use historical data exports rather than live API queries for replay testing.

Limitations

The graph-based denoiser requires a service dependency graph. In environments with no service mesh or infrastructure-as-code, building this graph manually is a significant upfront cost.
Virtual noise injection assumes that real alert correlations are structurally different from random noise. In systems where unrelated services frequently fail simultaneously (e.g., shared infrastructure failures), the model may under-suppress.
The 32% SRE acceptance rate for refined rules means 68% of suggestions are rejected. Treat rule refinement as an assistant, not an oracle. Human judgment remains essential.
RAG summarization quality depends heavily on the postmortem knowledge base. Teams with poor documentation practices will see weaker summaries.
The full pipeline (GNN + RAG + multi-agent) is complex to operate. For smaller systems (<50 services, <100 alerts/day), simpler approaches (label-based grouping + static templates) may deliver better ROI.

Reference

Paper: AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems (ASE 2025) Key insight: Treating alert management as a three-phase lifecycle (denoise -> summarize -> refine rules) with graph learning, RAG, and multi-agent iteration achieves 94.8% alert reduction while maintaining diagnostic accuracy — look for the virtual noise injection technique in Section 4 and the multi-agent convergence criteria in Section 6.

alertguardian-intelligent-alert-life-cycle

Más de este repositorio

AlertGuardian: Intelligent Alert Life-Cycle Management

When to Use

Key Technique

Step-by-Step Workflow

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

AlertGuardian: Intelligent Alert Life-Cycle Management

When to Use

Key Technique

Step-by-Step Workflow

Concrete Examples

Best Practices

Error Handling

Limitations

Reference

Más de este repositorio