一键导入
一键导入
Panel data analysis with fixed and random effects models
10 econometrics skills. Trigger: causal analysis, regression models, treatment effects, panel data. Design: method-centric guides with R/Python code and diagnostic tests.
Curated guide to generative AI covering LLMs and diffusion models
27 ai & machine learning skills. Trigger: ML experiments, model training, deep learning, NLP, computer vision. Design: covers frameworks, benchmarks, paper reproduction, and AI research workflows.
10 computer science skills. Trigger: algorithms, systems research, software engineering, security papers. Design: theory, complexity analysis, code-centric research, and security methods.
8 finance skills. Trigger: financial modeling, market data, risk analysis, quantitative finance. Design: data sources, quantitative methods, and regulatory frameworks.
| name | llm-aiops-guide |
| description | Papers on LLMs for IT operations and AIOps research |
| metadata | {"openclaw":{"emoji":"🖥️","category":"domains","subcategory":"cs","keywords":["AIOps","LLM operations","IT automation","log analysis","incident management","DevOps AI"],"source":"https://github.com/Jun-jie-Huang/awesome-LLM-AIOps"}} |
A curated collection of research on applying LLMs to IT Operations (AIOps) — log analysis, anomaly detection, incident management, root cause analysis, and automated remediation. Tracks how foundation models are transforming traditional rule-based operations tooling into intelligent, adaptive systems. Relevant for CS researchers at the intersection of systems, NLP, and operations.
LLM for AIOps
├── Log Analysis
│ ├── Log parsing (template extraction)
│ ├── Anomaly detection (from log sequences)
│ ├── Log summarization
│ └── Root cause from logs
├── Incident Management
│ ├── Incident triage and routing
│ ├── Severity classification
│ ├── Similar incident retrieval
│ └── Resolution recommendation
├── Root Cause Analysis
│ ├── Topology-aware diagnosis
│ ├── Multi-signal correlation
│ └── Causal inference
├── Monitoring & Alerting
│ ├── Metric anomaly detection
│ ├── Alert correlation
│ ├── Noise reduction
│ └── Capacity planning
└── Automated Remediation
├── Runbook generation
├── Script generation
├── Self-healing systems
└── Change impact analysis
Production LLM monitoring dimensions:
QUALITY MONITORING
- Output quality scores: automated evaluation (LLM-as-judge, BERTScore, ROUGE)
- Hallucination rate: factual grounding checks against retrieval context
- Refusal rate: track over-cautious or under-cautious safety filters
- Latency percentiles: p50, p95, p99 for time-to-first-token and total generation
- Token usage: input/output token distributions, context window utilization
DRIFT DETECTION
- Input drift: embedding-space distribution shift (cosine distance, MMD)
- Output drift: topic/style distribution changes over time windows
- Performance drift: sliding-window accuracy on held-out evaluation sets
- Concept drift: monitor for domain vocabulary shifts in user queries
- Baseline comparison: periodically re-evaluate against golden test suites
OPERATIONAL HEALTH
- GPU utilization and memory pressure (per-device, per-replica)
- Request queue depth and timeout rates
- Cache hit rates (KV cache, semantic cache, prompt cache)
- Error rates by error category (OOM, context overflow, timeout, malformed output)
- Throughput: tokens/second per deployment, requests/minute
Designing valid A/B tests for LLM systems:
CHALLENGES UNIQUE TO LLMs
- High output variance: same prompt can produce different outputs
- Evaluation subjectivity: many tasks lack clear ground truth
- Latency-quality tradeoff: larger models are better but slower
- Cost confound: better model may cost 10x more per query
RECOMMENDED APPROACH
1. Define metrics BEFORE experiment:
- Primary: task-specific quality (accuracy, user satisfaction, resolution rate)
- Secondary: latency, cost per query, token efficiency
- Guardrail: safety violations, hallucination rate
2. Traffic splitting strategy:
- User-level randomization (not request-level) to avoid confusion
- Minimum 1-2 weeks for stable estimates
- Stratify by user segment (power users vs. new users)
3. Evaluation methods:
- Automated scoring with LLM-as-judge (calibrated against human raters)
- Blind human evaluation on sampled outputs (inter-rater agreement > 0.7)
- Downstream business metrics (ticket resolution time, user retention)
4. Statistical rigor:
- Bootstrap confidence intervals for LLM quality scores
- Account for multiple comparisons when testing many variants
- Report effect sizes, not just p-values
| Tool | Focus | Key Capabilities |
|---|---|---|
| MLflow | End-to-end ML lifecycle | Experiment tracking, model registry, deployment, LLM evaluation |
| Weights & Biases | Experiment tracking + LLM monitoring | Traces, prompt versioning, evaluation tables, sweeps |
| LangSmith | LLM application observability | Trace visualization, prompt playground, dataset management, online evaluation |
| Comet ML | Experiment management | Model comparison, artifact tracking, LLM prompt tracking |
| Tool | Focus | Key Capabilities |
|---|---|---|
| vLLM | High-throughput serving | PagedAttention, continuous batching, tensor parallelism, speculative decoding |
| TGI (Text Generation Inference) | Production serving | Quantization, streaming, multi-LoRA, watermarking |
| Ollama | Local model running | Easy setup, model library, OpenAI-compatible API |
| TensorRT-LLM | NVIDIA-optimized inference | FP8 quantization, in-flight batching, custom kernels |
| SGLang | Structured generation serving | RadixAttention, constrained decoding, multi-modal support |
| Tool | Focus | Key Capabilities |
|---|---|---|
| LangChain / LangGraph | LLM application framework | Chains, agents, tool use, stateful multi-actor workflows |
| Haystack | NLP pipeline framework | RAG pipelines, document processing, evaluation |
| Prefect / Airflow | Workflow orchestration | DAG scheduling, retry logic, observability |
| Ray Serve | Distributed serving | Auto-scaling, multi-model composition, batch inference |
End-to-end LLMOps pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ DATA PREPARATION │
│ Raw data → Cleaning → Annotation → Train/Eval split │
│ Tools: Label Studio, Argilla, Lilac, DVC │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ MODEL DEVELOPMENT │
│ Base model selection → Fine-tuning (LoRA/QLoRA) → Evaluation │
│ Tools: Hugging Face Transformers, Axolotl, LLaMA-Factory │
│ Eval: lm-evaluation-harness, HELM, custom domain benchmarks │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ MODEL REGISTRY & CI │
│ Version control → Automated testing → Approval gates │
│ Tools: MLflow Registry, W&B Model Registry, HF Hub │
│ Tests: regression suite, safety checks, latency benchmarks │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ DEPLOYMENT │
│ Quantization → Containerization → Canary rollout → Full deploy │
│ Tools: vLLM, TGI, Docker, Kubernetes, Terraform │
│ Strategy: blue-green or canary with automatic rollback │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ PRODUCTION MONITORING │
│ Quality monitoring → Drift detection → Alerting → Feedback │
│ Tools: LangSmith, W&B Weave, Prometheus + Grafana, PagerDuty │
│ Loop: degradation detected → trigger re-evaluation → retrain │
└─────────────────────────────────────────────────────────────────┘
Reducing model size and inference cost:
QUANTIZATION METHODS
- GPTQ: Post-training quantization, good quality at 4-bit, widely supported
- AWQ (Activation-aware Weight Quantization): Better quality than GPTQ at 4-bit
- GGUF: CPU-friendly format, variable bit-width (Q4_K_M, Q5_K_M, Q8_0)
- FP8: NVIDIA H100/B200 native, minimal quality loss, 2x throughput vs FP16
- AQLM: Additive quantization, state-of-the-art at 2-bit
PRACTICAL GUIDANCE
- 8-bit: negligible quality loss for most tasks (~0.1% accuracy drop)
- 4-bit: slight quality loss, acceptable for many production uses (~1-3% accuracy drop)
- 2-3 bit: noticeable degradation, use only when cost is critical
- Always evaluate on YOUR task after quantization (general benchmarks can be misleading)
- Combine quantization with speculative decoding for further speedup
Multi-layer caching for LLM systems:
EXACT MATCH CACHE
- Hash the full prompt, return cached response for identical queries
- Hit rate: typically 5-15% for general-purpose, 30-60% for structured queries
- Tools: Redis, DragonflyDB, in-memory LRU
SEMANTIC CACHE
- Embed the prompt, return cached response for semantically similar queries
- Similarity threshold: 0.95+ cosine similarity (tune per use case)
- Tools: GPTCache, Redis with vector search, Qdrant
- Risk: semantically similar prompts may require different answers
KV CACHE OPTIMIZATION
- PagedAttention (vLLM): eliminates memory waste from pre-allocated KV cache
- Prefix caching: reuse KV cache for shared system prompts across requests
- Quantized KV cache: FP8 or INT8 KV values (H100+, ~2x context capacity)
PROMPT CACHING (API providers)
- Anthropic prompt caching: cache static prefix, pay reduced rate for cached tokens
- OpenAI cached context: automatic for repeated prefixes
- Design prompts with static prefix (system prompt, examples) + dynamic suffix (user query)
Cost-quality optimization through model routing:
TIERED MODEL ROUTING
- Simple queries → small/fast model (e.g., GPT-4o-mini, Claude Haiku, Llama-8B)
- Complex queries → large/capable model (e.g., GPT-4o, Claude Sonnet, Llama-70B)
- Critical queries → frontier model (e.g., o3, Claude Opus)
ROUTING STRATEGIES
1. Classifier-based: Train a small classifier on query complexity
- Features: query length, vocabulary complexity, domain signals
- Labels: which model tier produces acceptable quality
- Cost: classifier inference is negligible (<1ms, <$0.001)
2. Cascade (try-small-first):
- Route to cheapest model first
- Check output quality with a verifier
- Escalate to larger model if quality is insufficient
- Effective when >50% of queries are simple
3. Task-based routing:
- Summarization, translation → mid-tier model
- Code generation, math reasoning → high-tier model
- Classification, extraction → small model or fine-tuned specialist
EXPECTED SAVINGS
- Typical 40-70% cost reduction vs. routing everything to the best model
- Quality degradation: <5% when routing thresholds are properly calibrated
| Paper | Year | Focus |
|---|---|---|
| LogPPT | 2023 | Few-shot log parsing with prompt tuning |
| OpsEval | 2024 | Benchmark for evaluating LLMs in AIOps |
| D-Bot | 2024 | LLM-based database diagnosis |
| RCAgent | 2024 | Agent for root cause analysis |
| LogAgent | 2024 | Autonomous log analysis agent |
| AIOpsLab | 2024 | Holistic benchmark suite for AIOps agents |
| MonitorAssistant | 2024 | LLM-based alert correlation and noise reduction |
| LLM4Ops Survey | 2024 | Comprehensive survey of LLMs for IT operations |