ワンクリックで
parallel-strategy-analyzer
// Analyze model architecture and hardware constraints to recommend optimal parallel strategy combinations (DP/FSDP/TP/PP/EP/CP) with memory, communication, compute, and pipeline bubble estimation.
// Analyze model architecture and hardware constraints to recommend optimal parallel strategy combinations (DP/FSDP/TP/PP/EP/CP) with memory, communication, compute, and pipeline bubble estimation.
| name | parallel-strategy-analyzer |
| description | Analyze model architecture and hardware constraints to recommend optimal parallel strategy combinations (DP/FSDP/TP/PP/EP/CP) with memory, communication, compute, and pipeline bubble estimation. |
Analyze model architecture, hardware topology, and resource constraints to recommend the optimal combination of parallel strategies for distributed training.
# Basic usage
/parallel-strategy-analyzer Analyze strategy for a 70B LLaMA model on 64 Ascend A2 NPUs
# With specific constraints
/parallel-strategy-analyzer 13B model, 8× A100 80GB, sequence length 8192, batch size 512
# MoE model
/parallel-strategy-analyzer Mixtral 8x7B on 32 H100, seq_len 4096
# Optimize existing setup
/parallel-strategy-analyzer Currently using DP=8 for 7B model on 16 GPUs, OOM at seq_len=32768
# Compare specific configs
/parallel-strategy-analyzer Compare DP=4,TP=8,PP=2 vs DP=8,TP=8,PP=1 for LLaMA-70B on 64 A100s
| Parameter | Description | Example |
|---|---|---|
| Model size | Parameter count or model name | 70B, LLaMA-2-70B |
| Device count | Total devices | 64 |
| Device type | Hardware | Ascend A2/A3/950DT, A100 80GB, H100 |
| Parameter | Description | Example |
|---|---|---|
| Sequence length | Training seq len | 4096, 32768, 128K |
| Batch size | Global batch size | 1024 |
| Hidden/Layers/Heads | Architecture details | h=8192, L=80, n_h=64 |
| KV heads | For GQA models | n_kv=8 |
| FFN dim | Intermediate size | d_ff=28672 |
| MoE config | Expert count, top-k | 8 experts, top-2 |
| Framework | PyTorch or MindSpore | PyTorch |
Well-known models (LLaMA, GPT, Mixtral) are auto-filled from built-in tables.
Phase 1: Collect Info → Parse model & hardware params, auto-fill known models
Phase 2: Global Baseline → Single-device memory + FLOPs (no parallelism)
Phase 3: Strategy Search → Enumerate valid (dp,tp,pp,cp,ep) combinations
Phase 4: Cost Analysis → Per-strategy: TP/CP/EP/DP comm + PP bubble
Phase 5: Post-Shard Memory → Per-device memory after sharding
Phase 6: Compute Efficiency → MFU estimate, tokens/s throughput
Phase 7: Scoring & Ranking → Weighted score: MFU + memory + comm + bubble
Phase 8: Output Report → Recommendation + alternatives + code
Key improvement over naive approaches: Phase 2 first determines the global baseline (total memory, bottleneck type), then Phase 4-5 evaluate the cost of sharding (communication, bubble, actual per-device memory). This avoids recommending strategies that "fit in memory" but have prohibitive communication overhead.
Total memory without parallelism, FLOPs per step, bottleneck analysis (model states vs activations).
Recommended: DP=4, TP=8, PP=2, FSDP=level2
Memory/device: 58GB / 80GB (72%)
Bubble: 5% (1F1B)
Communication overhead: 12%
MFU estimate: ~45%
Concrete init_device_mesh() + fully_shard() / shard_module() calls.
Per-device: params, grads, optimizer, activations, total.
Per-dimension: volume, ops, exposed time, overlap potential.
Bubble ratio, schedule recommendation (1F1B vs interleaved).
Activation checkpoint, FSDP level, offload recommendations.
Comparison table with memory, bubble, comm overhead, and MFU.
See workflow and reference files for formulas:
references/known-models.md — model architecture tables (LLaMA, DeepSeek, Qwen3/3.5, etc.)references/known-hardware.md — hardware specs per die (Ascend A2/A3/950DT, A100, H100/H200)references/memory-estimation.md — per-tensor memory formulas, FSDP levelsreferences/strategy-rules.md — decision tree, parallel dimension rules, config templatesworkflows/01-collect-model-info.md — Phase 1: hardware parameter tablesworkflows/02-global-baseline.md — Phase 2: global memory baseline + FLOPsworkflows/03-strategy-search.md — Phase 3: strategy enumeration + pruningworkflows/04-cost-analysis.md — Phase 4: TP/CP/EP/DP communication + PP bubbleworkflows/05-scoring-output.md — Phase 5-7: post-sharding memory + scoring + outputThe parallel-strategy-analyzer agent has the complete formulas inline.
auto_parallel/fast-tuner with profiling dataGitCode fork workflow automation. Use this skill whenever the user wants to commit code, push, create or append to a Pull Request, view PR status, squash commits, regenerate a PR description, or run lint checks against a GitCode `origin` (fork) + `upstream` repository. Supports both Chinese and English natural-language triggers (e.g. "帮我提交", "create PR", "看下 PR 状态") and slash-command shortcuts (`/commit`, `/create-pr`, etc.). The full trigger → subcommand mapping lives in the "When to Activate" section.
Review HyperParallel code changes for distributed correctness, stream synchronization, memory safety, cross-platform consistency, and code quality. Use when reviewing PRs, code changes, or when the user mentions "review", "code review", or "check this".
HyperParallel platform abstraction layer development. Use when adding new platform APIs, implementing cross-platform features (FSDP/HSDP/Pipeline/Activation Checkpoint), creating DTensorBase extensions, or modifying collective operations. Covers both PyTorch and MindSpore backends.
Execution-oriented workflow for HyperParallel distributed operator development. Analyzes the operator, implements or updates code and tests.
Internal analysis tool for distributed operator development — provides interface specs, Primitive/ATen mappings and HyperParallel layout derivation logic. Used by dist-op-dev workflow. NOT for direct user calls.