10 econometrics skills. Trigger: causal analysis, regression models, treatment effects, panel data. Design: method-centric guides with R/Python code and diagnostic tests.

2026-04-18223

generative-ai-guide.md

from "wentorai/research-plugins"

Curated guide to generative AI covering LLMs and diffusion models

2026-04-18223

ai-ml-skills.md

from "wentorai/research-plugins"

27 ai & machine learning skills. Trigger: ML experiments, model training, deep learning, NLP, computer vision. Design: covers frameworks, benchmarks, paper reproduction, and AI research workflows.

2026-04-18223

cs-skills.md

from "wentorai/research-plugins"

10 computer science skills. Trigger: algorithms, systems research, software engineering, security papers. Design: theory, complexity analysis, code-centric research, and security methods.

2026-04-18223

finance-skills.md

from "wentorai/research-plugins"

8 finance skills. Trigger: financial modeling, market data, risk analysis, quantitative finance. Design: data sources, quantitative methods, and regulatory frameworks.

2026-04-18223

package.json

"author": "wentorai"

"repository": "wentorai/research-plugins"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

计算机与信息研究科学家计算机与数学类职业15-1221L4

name	llm-aiops-guide
description	Papers on LLMs for IT operations and AIOps research
metadata	{"openclaw":{"emoji":"🖥️","category":"domains","subcategory":"cs","keywords":["AIOps","LLM operations","IT automation","log analysis","incident management","DevOps AI"],"source":"https://github.com/Jun-jie-Huang/awesome-LLM-AIOps"}}

LLM for AIOps Guide

Overview

A curated collection of research on applying LLMs to IT Operations (AIOps) — log analysis, anomaly detection, incident management, root cause analysis, and automated remediation. Tracks how foundation models are transforming traditional rule-based operations tooling into intelligent, adaptive systems. Relevant for CS researchers at the intersection of systems, NLP, and operations.

Research Areas

LLM for AIOps
├── Log Analysis
│   ├── Log parsing (template extraction)
│   ├── Anomaly detection (from log sequences)
│   ├── Log summarization
│   └── Root cause from logs
├── Incident Management
│   ├── Incident triage and routing
│   ├── Severity classification
│   ├── Similar incident retrieval
│   └── Resolution recommendation
├── Root Cause Analysis
│   ├── Topology-aware diagnosis
│   ├── Multi-signal correlation
│   └── Causal inference
├── Monitoring & Alerting
│   ├── Metric anomaly detection
│   ├── Alert correlation
│   ├── Noise reduction
│   └── Capacity planning
└── Automated Remediation
    ├── Runbook generation
    ├── Script generation
    ├── Self-healing systems
    └── Change impact analysis

Key Practices for LLM Operations

Model Monitoring

Production LLM monitoring dimensions:

QUALITY MONITORING
- Output quality scores: automated evaluation (LLM-as-judge, BERTScore, ROUGE)
- Hallucination rate: factual grounding checks against retrieval context
- Refusal rate: track over-cautious or under-cautious safety filters
- Latency percentiles: p50, p95, p99 for time-to-first-token and total generation
- Token usage: input/output token distributions, context window utilization

DRIFT DETECTION
- Input drift: embedding-space distribution shift (cosine distance, MMD)
- Output drift: topic/style distribution changes over time windows
- Performance drift: sliding-window accuracy on held-out evaluation sets
- Concept drift: monitor for domain vocabulary shifts in user queries
- Baseline comparison: periodically re-evaluate against golden test suites

OPERATIONAL HEALTH
- GPU utilization and memory pressure (per-device, per-replica)
- Request queue depth and timeout rates
- Cache hit rates (KV cache, semantic cache, prompt cache)
- Error rates by error category (OOM, context overflow, timeout, malformed output)
- Throughput: tokens/second per deployment, requests/minute

A/B Testing for LLMs

Designing valid A/B tests for LLM systems:

CHALLENGES UNIQUE TO LLMs
- High output variance: same prompt can produce different outputs
- Evaluation subjectivity: many tasks lack clear ground truth
- Latency-quality tradeoff: larger models are better but slower
- Cost confound: better model may cost 10x more per query

RECOMMENDED APPROACH
1. Define metrics BEFORE experiment:
   - Primary: task-specific quality (accuracy, user satisfaction, resolution rate)
   - Secondary: latency, cost per query, token efficiency
   - Guardrail: safety violations, hallucination rate

2. Traffic splitting strategy:
   - User-level randomization (not request-level) to avoid confusion
   - Minimum 1-2 weeks for stable estimates
   - Stratify by user segment (power users vs. new users)

3. Evaluation methods:
   - Automated scoring with LLM-as-judge (calibrated against human raters)
   - Blind human evaluation on sampled outputs (inter-rater agreement > 0.7)
   - Downstream business metrics (ticket resolution time, user retention)

4. Statistical rigor:
   - Bootstrap confidence intervals for LLM quality scores
   - Account for multiple comparisons when testing many variants
   - Report effect sizes, not just p-values

Toolchain Overview

Experiment Tracking and Model Registry

Tool	Focus	Key Capabilities
MLflow	End-to-end ML lifecycle	Experiment tracking, model registry, deployment, LLM evaluation
Weights & Biases	Experiment tracking + LLM monitoring	Traces, prompt versioning, evaluation tables, sweeps
LangSmith	LLM application observability	Trace visualization, prompt playground, dataset management, online evaluation
Comet ML	Experiment management	Model comparison, artifact tracking, LLM prompt tracking

Serving and Inference

Tool	Focus	Key Capabilities
vLLM	High-throughput serving	PagedAttention, continuous batching, tensor parallelism, speculative decoding
TGI (Text Generation Inference)	Production serving	Quantization, streaming, multi-LoRA, watermarking
Ollama	Local model running	Easy setup, model library, OpenAI-compatible API
TensorRT-LLM	NVIDIA-optimized inference	FP8 quantization, in-flight batching, custom kernels
SGLang	Structured generation serving	RadixAttention, constrained decoding, multi-modal support

Orchestration and Pipelines

Tool	Focus	Key Capabilities
LangChain / LangGraph	LLM application framework	Chains, agents, tool use, stateful multi-actor workflows
Haystack	NLP pipeline framework	RAG pipelines, document processing, evaluation
Prefect / Airflow	Workflow orchestration	DAG scheduling, retry logic, observability
Ray Serve	Distributed serving	Auto-scaling, multi-model composition, batch inference

Typical LLMOps Pipeline Architecture

End-to-end LLMOps pipeline:

┌─────────────────────────────────────────────────────────────────┐
│                     DATA PREPARATION                            │
│  Raw data → Cleaning → Annotation → Train/Eval split            │
│  Tools: Label Studio, Argilla, Lilac, DVC                       │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                   MODEL DEVELOPMENT                             │
│  Base model selection → Fine-tuning (LoRA/QLoRA) → Evaluation   │
│  Tools: Hugging Face Transformers, Axolotl, LLaMA-Factory       │
│  Eval: lm-evaluation-harness, HELM, custom domain benchmarks    │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                   MODEL REGISTRY & CI                           │
│  Version control → Automated testing → Approval gates           │
│  Tools: MLflow Registry, W&B Model Registry, HF Hub             │
│  Tests: regression suite, safety checks, latency benchmarks     │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                   DEPLOYMENT                                    │
│  Quantization → Containerization → Canary rollout → Full deploy │
│  Tools: vLLM, TGI, Docker, Kubernetes, Terraform                │
│  Strategy: blue-green or canary with automatic rollback          │
└──────────────────────────┬──────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                   PRODUCTION MONITORING                         │
│  Quality monitoring → Drift detection → Alerting → Feedback     │
│  Tools: LangSmith, W&B Weave, Prometheus + Grafana, PagerDuty  │
│  Loop: degradation detected → trigger re-evaluation → retrain   │
└─────────────────────────────────────────────────────────────────┘

Pipeline Design Principles

Reproducibility: Every model version must be traceable to its training data, hyperparameters, and base model. Use deterministic seeds and pin library versions.
Evaluation-first: Define evaluation criteria before training. Include both automated metrics and human evaluation protocols.
Gradual rollout: Never switch 100% traffic to a new model instantly. Use canary deployments (1% -> 10% -> 50% -> 100%) with automatic rollback on quality regression.
Feedback loops: Collect user feedback (explicit thumbs up/down, implicit engagement metrics) and route it back to evaluation datasets.
Safety gates: Automated checks for toxic output, PII leakage, and prompt injection before any model promotion.

Cost Optimization Strategies

Quantization

Reducing model size and inference cost:

QUANTIZATION METHODS
- GPTQ: Post-training quantization, good quality at 4-bit, widely supported
- AWQ (Activation-aware Weight Quantization): Better quality than GPTQ at 4-bit
- GGUF: CPU-friendly format, variable bit-width (Q4_K_M, Q5_K_M, Q8_0)
- FP8: NVIDIA H100/B200 native, minimal quality loss, 2x throughput vs FP16
- AQLM: Additive quantization, state-of-the-art at 2-bit

PRACTICAL GUIDANCE
- 8-bit: negligible quality loss for most tasks (~0.1% accuracy drop)
- 4-bit: slight quality loss, acceptable for many production uses (~1-3% accuracy drop)
- 2-3 bit: noticeable degradation, use only when cost is critical
- Always evaluate on YOUR task after quantization (general benchmarks can be misleading)
- Combine quantization with speculative decoding for further speedup

Caching Strategies

Multi-layer caching for LLM systems:

EXACT MATCH CACHE
- Hash the full prompt, return cached response for identical queries
- Hit rate: typically 5-15% for general-purpose, 30-60% for structured queries
- Tools: Redis, DragonflyDB, in-memory LRU

SEMANTIC CACHE
- Embed the prompt, return cached response for semantically similar queries
- Similarity threshold: 0.95+ cosine similarity (tune per use case)
- Tools: GPTCache, Redis with vector search, Qdrant
- Risk: semantically similar prompts may require different answers

KV CACHE OPTIMIZATION
- PagedAttention (vLLM): eliminates memory waste from pre-allocated KV cache
- Prefix caching: reuse KV cache for shared system prompts across requests
- Quantized KV cache: FP8 or INT8 KV values (H100+, ~2x context capacity)

PROMPT CACHING (API providers)
- Anthropic prompt caching: cache static prefix, pay reduced rate for cached tokens
- OpenAI cached context: automatic for repeated prefixes
- Design prompts with static prefix (system prompt, examples) + dynamic suffix (user query)

Intelligent Routing

Cost-quality optimization through model routing:

TIERED MODEL ROUTING
- Simple queries → small/fast model (e.g., GPT-4o-mini, Claude Haiku, Llama-8B)
- Complex queries → large/capable model (e.g., GPT-4o, Claude Sonnet, Llama-70B)
- Critical queries → frontier model (e.g., o3, Claude Opus)

ROUTING STRATEGIES
1. Classifier-based: Train a small classifier on query complexity
   - Features: query length, vocabulary complexity, domain signals
   - Labels: which model tier produces acceptable quality
   - Cost: classifier inference is negligible (<1ms, <$0.001)

2. Cascade (try-small-first):
   - Route to cheapest model first
   - Check output quality with a verifier
   - Escalate to larger model if quality is insufficient
   - Effective when >50% of queries are simple

3. Task-based routing:
   - Summarization, translation → mid-tier model
   - Code generation, math reasoning → high-tier model
   - Classification, extraction → small model or fine-tuned specialist

EXPECTED SAVINGS
- Typical 40-70% cost reduction vs. routing everything to the best model
- Quality degradation: <5% when routing thresholds are properly calibrated

Key Papers

Paper	Year	Focus
LogPPT	2023	Few-shot log parsing with prompt tuning
OpsEval	2024	Benchmark for evaluating LLMs in AIOps
D-Bot	2024	LLM-based database diagnosis
RCAgent	2024	Agent for root cause analysis
LogAgent	2024	Autonomous log analysis agent
AIOpsLab	2024	Holistic benchmark suite for AIOps agents
MonitorAssistant	2024	LLM-based alert correlation and noise reduction
LLM4Ops Survey	2024	Comprehensive survey of LLMs for IT operations

Use Cases

Literature tracking: Follow LLM-AIOps research evolution
System design: Learn intelligent operations patterns
Benchmark comparison: Evaluate AIOps approaches
Research planning: Identify under-explored AIOps problems
Industry applications: Bridge research to production AIOps
Cost modeling: Design cost-efficient LLM serving architectures
Pipeline design: Architect end-to-end LLMOps workflows

llm-aiops-guide

LLM for AIOps Guide

Overview

Research Areas

Key Practices for LLM Operations

Model Monitoring

A/B Testing for LLMs

Toolchain Overview

Experiment Tracking and Model Registry

Serving and Inference

Orchestration and Pipelines

Typical LLMOps Pipeline Architecture

Pipeline Design Principles

Cost Optimization Strategies

Quantization

Caching Strategies

Intelligent Routing

Key Papers

Use Cases

References

LLM for AIOps Guide

Overview

Research Areas

Key Practices for LLM Operations

Model Monitoring

A/B Testing for LLMs

Toolchain Overview

Experiment Tracking and Model Registry

Serving and Inference

Orchestration and Pipelines

Typical LLMOps Pipeline Architecture

Pipeline Design Principles

Cost Optimization Strategies

Quantization

Caching Strategies

Intelligent Routing

Key Papers

Use Cases

References

llm-aiops-guide

同仓库更多 Skills

同仓库更多 Skills

LLM for AIOps Guide

Overview

Research Areas

Key Practices for LLM Operations

Model Monitoring

A/B Testing for LLMs

Toolchain Overview

Experiment Tracking and Model Registry

Serving and Inference

Orchestration and Pipelines

Typical LLMOps Pipeline Architecture

Pipeline Design Principles

Cost Optimization Strategies

Quantization

Caching Strategies

Intelligent Routing

Key Papers

Use Cases

References

LLM for AIOps Guide

Overview

Research Areas

Key Practices for LLM Operations

Model Monitoring

A/B Testing for LLMs

Toolchain Overview

Experiment Tracking and Model Registry

Serving and Inference

Orchestration and Pipelines

Typical LLMOps Pipeline Architecture

Pipeline Design Principles

Cost Optimization Strategies

Quantization

Caching Strategies

Intelligent Routing

Key Papers

Use Cases

References