Run any Skill in Manus with one click

$pwd:

llm-serving-capacity-planner

Name: Llm Serving Capacity Planner
Author: BBuf

// Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.

Run Skill in Manus

$ git log --oneline --stat

stars:483

forks:41

updated:May 20, 2026 at 12:13

File Explorer

4 files

SKILL.md

readonly

name	llm-serving-capacity-planner
description	Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.

LLM Serving Capacity Planner

Overview

Use this when a serving log has enough memory lines to explain where GPU HBM went. The analyzer reads SGLang/vLLM startup logs, extracts weight load, KV pool, CUDA graph, framework overhead, and token-capacity lines, then estimates concurrent requests for common token lengths.

Confirmation Required

Before running analysis, collect or verify these inputs:

Item	Why it matters	How to obtain	Default if user skips
Log file path	Primary input; all memory data comes from here	Ask user for the serving startup log	— (required)
GPU type	Determines total HBM for decomposition validation	Ask user or infer from log	Auto-detected from log if possible
nvidia-smi output	Provides per-rank actual memory for cross-validation	Capture with `nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt`	— (optional, but recommended)
Model config.json	Enables theoretical KV cache byte calculation and replication factor analysis	Ask user for the model's config.json path	— (optional, log data used instead)
Request token length	Determines concurrency estimate denominator	Ask user	4096, 6144, 8192

Workflow

Step 1: Collect the serving log

The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:

Load weight begin. avail mem=XX GB
Memory profiling: available_gpu_memory=XX GB, ... (newer sglang)
SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX (SWA models like DeepSeek-V4)
Memory pool end. avail mem=XX GB
Capture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.
max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB
server_args=ServerArgs(...) (for serving parameters)

If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.

Step 2: Optionally capture nvidia-smi data

For per-rank memory comparison:

docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt

Step 3: Run the analyzer

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --nvidia-smi-file /path/to/smi.txt \
  --gpu h200 \
  --config-json /path/to/config.json

For JSON output (automation):

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --format json

Step 4: Review and interpret results

The analyzer prints:

Memory breakdown table: each category (weights, KV pool, CUDA graph, framework, other) with GiB, MiB, percentage, and derivation
Per-rank comparison: nvidia-smi data across all TP ranks
KV pool detail: pool configuration, KV dtype, replication factor, per-token byte calculation
Concurrency estimate: max concurrent requests for different token lengths
Tuning notes: configuration changes that may increase capacity

When To Use It

After launching an LLM serving instance, to understand how GPU memory is distributed
When comparing different --mem-fraction-static values and their impact on KV pool capacity
When planning deployment capacity: how many concurrent requests can a given GPU configuration support
When investigating OOM issues: identifying which memory category is consuming the most
When evaluating whether fp8 KV cache or EP can improve concurrency

Key Concepts

mem-fraction-static

Controls what fraction of available GPU memory after weight loading is reserved for the KV cache pool. Higher values give more KV capacity but less headroom for CUDA graph and other runtime buffers.

0.88 (default): aggressive — 88% of post-weight memory goes to KV pool
0.60: conservative — more free memory left for runtime, but significantly less KV capacity

KV Head Replication

When num_key_value_heads < tp_size, KV cache is replicated across all TP ranks rather than split. For example, models with kv_heads=1, tp=8 means each of the 8 cards stores a full copy of the KV cache — 8x the per-card KV memory compared to a split scenario.

SWA (Sliding Window Attention) Compression

Models like DeepSeek-V4 use CSA (Compressed Sliding Attention) and HCA (Hierarchical Context Attention) with sliding windows. This reduces per-token KV cache bytes compared to the theoretical full-attention calculation. The bytes_per_full_token reported in the log already accounts for this compression.

Reporting Checklist

Include:

Serving configuration: model, GPU, TP/PP/EP, mem-fraction-static, kv-cache-dtype
Memory breakdown table: category / GiB / MiB / percentage / derivation source
Per-rank nvidia-smi comparison: used and free memory per TP rank
KV pool detail: pool size, bytes_per_full_token, KV dtype, replication factor, theoretical per-token KV calculation (when config.json provided)
Concurrency estimate table: request token length / token-limit / request-limit / max concurrent
Tuning notes based on free memory and configuration

Known Limitations

Limitation	Detail	Workaround
SGLang-specific patterns	Currently only SGLang log patterns are fully supported	vLLM patterns to be added as encountered
SWA compression models	Per-token KV bytes cannot be independently calculated from model config for CSA/HCA attention — the framework's internal SWA window parameters are needed	Use `bytes_per_full_token` from the log directly
DeepGEMM JIT memory	The analyzer categorizes DeepGEMM JIT compilation memory as "other" because it is not explicitly reported in the log	Compare with nvidia-smi total for accurate accounting
PP (Pipeline Parallelism)	Memory decomposition is per-rank; PP configurations may have uneven memory across stages	Specify `--target-rank` for each PP stage
MoE expert buffer	Some frameworks allocate additional buffers for expert routing that are not separately reported	Included in "model weights" or "other" depending on when allocated

References

references/log-patterns.md: log line patterns and their semantics for memory analysis.
references/gpu-specs.json: GPU HBM specifications for h20, h100, h200, and b200 aliases.
scripts/capacity_analyzer.py: the core analysis script.

related-skills.json

same repository

vllm-sota-humanize-loop.md

from "BBuf/AI-Infra-Auto-Driven-SKILLS"

Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.

2026-05-26483

model-pr-history-knowledge.md

from "BBuf/AI-Infra-Auto-Driven-SKILLS"

Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.

2026-05-26483

sglang-sota-humanize-loop.md

from "BBuf/AI-Infra-Auto-Driven-SKILLS"

Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.

2026-05-26483

llm-pipeline-analysis.md

from "BBuf/AI-Infra-Auto-Driven-SKILLS"

Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.

2026-05-26483

sglang-humanize-review.md

from "BBuf/AI-Infra-Auto-Driven-SKILLS"

Perform SGLang code review in the style of human maintainers by consulting the 2024-2025 non-agent PR review corpus, including inline code snippets, original multilingual comments, and discussion threads. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.

2026-05-20483

model-compute-simulation.md

from "BBuf/AI-Infra-Auto-Driven-SKILLS"

Build an operator-level compute template for an LLM and estimate FLOPs/MFU for a serving shape. Use when you need tensor shapes, per-op FLOPs, kernel-to-op MFU mapping, or parallelism what-if analysis.

2026-05-20483

package.json

"author": "BBuf"

"repository": "BBuf/AI-Infra-Auto-Driven-SKILLS"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Network and Computer Systems AdministratorsComputer and Mathematical Occupations15-1244L4

name	llm-serving-capacity-planner
description	Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.

LLM Serving Capacity Planner

Overview

Confirmation Required

Before running analysis, collect or verify these inputs:

Item	Why it matters	How to obtain	Default if user skips
Log file path	Primary input; all memory data comes from here	Ask user for the serving startup log	— (required)
GPU type	Determines total HBM for decomposition validation	Ask user or infer from log	Auto-detected from log if possible
nvidia-smi output	Provides per-rank actual memory for cross-validation	Capture with `nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt`	— (optional, but recommended)
Model config.json	Enables theoretical KV cache byte calculation and replication factor analysis	Ask user for the model's config.json path	— (optional, log data used instead)
Request token length	Determines concurrency estimate denominator	Ask user	4096, 6144, 8192

Workflow

Step 1: Collect the serving log

The user should provide the startup log from an SGLang or vLLM serving instance. Key log lines that the analyzer needs:

Load weight begin. avail mem=XX GB
Memory profiling: available_gpu_memory=XX GB, ... (newer sglang)
SW KV memory calculation: bytes_per_full_token=XX, available_bytes=XX GB, full_token=XX (SWA models like DeepSeek-V4)
Memory pool end. avail mem=XX GB
Capture cuda graph end. ... mem usage=XX GB. avail mem=XX GB.
max_total_num_tokens=XX, ... max_running_requests=XX, ... available_gpu_mem=XX GB
server_args=ServerArgs(...) (for serving parameters)

If the log is from a running instance, capture it by redirecting stdout/stderr to a file at launch time.

Step 2: Optionally capture nvidia-smi data

For per-rank memory comparison:

docker exec <container> nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader > smi.txt

Step 3: Run the analyzer

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --nvidia-smi-file /path/to/smi.txt \
  --gpu h200 \
  --config-json /path/to/config.json

For JSON output (automation):

python3 skills/llm-serving-capacity-planner/scripts/capacity_analyzer.py \
  --log-file /path/to/sglang.log \
  --format json

Step 4: Review and interpret results

The analyzer prints:

Memory breakdown table: each category (weights, KV pool, CUDA graph, framework, other) with GiB, MiB, percentage, and derivation
Per-rank comparison: nvidia-smi data across all TP ranks
KV pool detail: pool configuration, KV dtype, replication factor, per-token byte calculation
Concurrency estimate: max concurrent requests for different token lengths
Tuning notes: configuration changes that may increase capacity

When To Use It

After launching an LLM serving instance, to understand how GPU memory is distributed
When comparing different --mem-fraction-static values and their impact on KV pool capacity
When planning deployment capacity: how many concurrent requests can a given GPU configuration support
When investigating OOM issues: identifying which memory category is consuming the most
When evaluating whether fp8 KV cache or EP can improve concurrency

Key Concepts

mem-fraction-static

0.88 (default): aggressive — 88% of post-weight memory goes to KV pool
0.60: conservative — more free memory left for runtime, but significantly less KV capacity

KV Head Replication

SWA (Sliding Window Attention) Compression

Reporting Checklist

Include:

Serving configuration: model, GPU, TP/PP/EP, mem-fraction-static, kv-cache-dtype
Memory breakdown table: category / GiB / MiB / percentage / derivation source
Per-rank nvidia-smi comparison: used and free memory per TP rank
KV pool detail: pool size, bytes_per_full_token, KV dtype, replication factor, theoretical per-token KV calculation (when config.json provided)
Concurrency estimate table: request token length / token-limit / request-limit / max concurrent
Tuning notes based on free memory and configuration

Known Limitations

Limitation	Detail	Workaround
SGLang-specific patterns	Currently only SGLang log patterns are fully supported	vLLM patterns to be added as encountered
SWA compression models	Per-token KV bytes cannot be independently calculated from model config for CSA/HCA attention — the framework's internal SWA window parameters are needed	Use `bytes_per_full_token` from the log directly
DeepGEMM JIT memory	The analyzer categorizes DeepGEMM JIT compilation memory as "other" because it is not explicitly reported in the log	Compare with nvidia-smi total for accurate accounting
PP (Pipeline Parallelism)	Memory decomposition is per-rank; PP configurations may have uneven memory across stages	Specify `--target-rank` for each PP stage
MoE expert buffer	Some frameworks allocate additional buffers for expert routing that are not separately reported	Included in "model weights" or "other" depending on when allocated

References

references/log-patterns.md: log line patterns and their semantics for memory analysis.
references/gpu-specs.json: GPU HBM specifications for h20, h100, h200, and b200 aliases.
scripts/capacity_analyzer.py: the core analysis script.

llm-serving-capacity-planner

LLM Serving Capacity Planner

Overview

Confirmation Required

Workflow

Step 1: Collect the serving log

Step 2: Optionally capture nvidia-smi data

Step 3: Run the analyzer

Step 4: Review and interpret results

When To Use It

Key Concepts

mem-fraction-static

KV Head Replication

SWA (Sliding Window Attention) Compression

Reporting Checklist

Known Limitations

References

More from this repository

More from this repository

LLM Serving Capacity Planner

Overview

Confirmation Required

Workflow

Step 1: Collect the serving log

Step 2: Optionally capture nvidia-smi data

Step 3: Run the analyzer

Step 4: Review and interpret results

When To Use It

Key Concepts

mem-fraction-static

KV Head Replication

SWA (Sliding Window Attention) Compression

Reporting Checklist

Known Limitations

References