Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

pytorch-fsdp2

Étoiles9 996

Forks745

Mis à jour30 janvier 2026 à 04:09

Adds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, mixed precision/offload config, and distributed checkpointing. Use when models exceed single-GPU memory or when you need DTensor-based sharding with DeviceMesh.

Installation

Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.

Exécuter dans Manus

Source

Orchestra-Research

Orchestra-Research/AI-Research-SKILLs

Ouvrir le dépôt GitHub Voir les dépôts du créateur

Téléchargement

Exécuter dans Manus

Métiers associésSOC

Basé sur la classification professionnelle SOC

Scientifiques en recherche informatique et en informationProfessions informatiques et mathématiques·SOC 15-1221

Développeurs de logicielsSOC 15-1252

Explorateur de fichiers

13 fichiers

SKILL.md

readonly

Plus depuis ce dépôt

même dépôt

model-merging

Orchestra-Research/AI-Research-SKILLs

Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.

2026-06-1610.0k

ara-compiler

Orchestra-Research/AI-Research-SKILLs

Compiles any research input — PDF papers, GitHub repositories, experiment logs, code directories, or raw notes — into a complete Agent-Native Research Artifact (ARA) with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph, and grounded evidence. Use when ingesting a paper or codebase into a structured, machine-executable knowledge package, building an ARA from scratch, or converting research outputs into a falsifiable, agent-traversable form.

2026-04-2810.0k

ara-research-manager

Orchestra-Research/AI-Research-SKILLs

Records research provenance as a post-task epilogue, scanning conversation history at the end of a coding or research session to extract decisions, experiments, dead ends, claims, heuristics, and pivots, and writing them into the ara/ directory with user-vs-AI provenance tags. Use as a session epilogue — never during execution — to maintain a faithful, auditable trace of how a research project actually evolved.

2026-04-2810.0k

ara-rigor-reviewer

Orchestra-Research/AI-Research-SKILLs

Performs ARA Seal Level 2 semantic epistemic review on Agent-Native Research Artifacts, scoring six dimensions (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and producing a constructive, severity-ranked report with a Strong Accept-to-Reject recommendation. Use after Level 1 structural validation passes, when an ARA needs an objective epistemic critique before publication or release.

2026-04-2810.0k

ml-paper-writing

Orchestra-Research/AI-Research-SKILLs

Write publication-ready ML/AI papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Use when drafting papers from research repos, structuring arguments, verifying citations, or preparing camera-ready submissions. For systems venues (OSDI, NSDI, ASPLOS, SOSP), use systems-paper-writing instead.

2026-04-1010.0k

presenting-conference-talks

Orchestra-Research/AI-Research-SKILLs

Generates conference presentation slides (Beamer LaTeX PDF and editable PPTX) from a compiled paper with speaker notes and talk script. Use when preparing oral talks, spotlight presentations, or invited talks for ML and systems conferences.

2026-04-1010.0k

name	pytorch-fsdp2
description	Adds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, mixed precision/offload config, and distributed checkpointing. Use when models exceed single-GPU memory or when you need DTensor-based sharding with DeviceMesh.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["PyTorch","FSDP2","Fully Sharded Data Parallel","Distributed Training","DTensor","Device Mesh","Sharded Checkpointing","Mixed Precision","Offload","Torch Distributed"]
dependencies	["torch"]

Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script

This skill teaches a coding agent how to add PyTorch FSDP2 to a training loop with correct initialization, sharding, mixed precision/offload configuration, and checkpointing.

FSDP2 in PyTorch is exposed primarily via torch.distributed.fsdp.fully_shard and the FSDPModule methods it adds in-place to modules. See: references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md.

When to use this skill

Use FSDP2 when:

Your model doesn’t fit on one GPU (parameters + gradients + optimizer state).
You want an eager-mode sharding approach that is DTensor-based per-parameter sharding (more inspectable, simpler sharded state dicts) than FSDP1.
You may later compose DP with Tensor Parallel using DeviceMesh.

Avoid (or be careful) if:

You need strict backwards-compatible checkpoints across PyTorch versions (DCP warns against this).
You’re forced onto older PyTorch versions without the FSDP2 stack.

Alternatives (when FSDP2 is not the best fit)

DistributedDataParallel (DDP): Use the standard data-parallel wrapper when you want classic distributed data parallel training.
FullyShardedDataParallel (FSDP1): Use the original FSDP wrapper for parameter sharding across data-parallel workers.

Reference: references/pytorch_ddp_notes.md, references/pytorch_fsdp1_api.md.

Contract the agent must follow

Launch with torchrun and set the CUDA device per process (usually via LOCAL_RANK).
Apply fully_shard() bottom-up, i.e., shard submodules (e.g., Transformer blocks) before the root module.
Call model(input), not model.forward(input), so the FSDP2 hooks run (unless you explicitly unshard() or register the forward method).
Create the optimizer after sharding and make sure it is built on the DTensor parameters (post-fully_shard).
Checkpoint using Distributed Checkpoint (DCP) or the distributed-state-dict helpers, not naïve torch.save(model.state_dict()) unless you deliberately gather to full tensors.

(Each of these rules is directly described in the official API docs/tutorial; see references.)

Step-by-step procedure

0) Version & environment sanity

Prefer a recent stable PyTorch where the docs show FSDP2 and DCP updated recently.
Use torchrun --nproc_per_node <gpus_per_node> ... and ensure RANK, WORLD_SIZE, LOCAL_RANK are visible.

Reference: references/pytorch_fsdp2_tutorial.md (launch commands and setup), references/pytorch_fully_shard_api.md (user contract).

1) Initialize distributed and set device

Minimal, correct pattern:

dist.init_process_group(backend="nccl")
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
Optionally create a DeviceMesh to describe the data-parallel group(s)

Reference: references/pytorch_device_mesh_tutorial.md (why DeviceMesh exists & how it manages process groups).

2) Build model on meta device (recommended for very large models)

For big models, initialize on meta, apply sharding, then materialize weights on GPU:

with torch.device("meta"): model = ...
apply fully_shard(...) on submodules, then fully_shard(model)
model.to_empty(device="cuda")
model.reset_parameters() (or your init routine)

Reference: references/pytorch_fsdp2_tutorial.md (migration guide shows this flow explicitly).

3) Apply `fully_shard()` bottom-up (wrapping policy = “apply where needed”)

Do not only call fully_shard on the topmost module.

Recommended sharding pattern for transformer-like models:

iterate modules, if isinstance(m, TransformerBlock): fully_shard(m, ...)
then fully_shard(model, ...)

Why:

fully_shard forms “parameter groups” for collective efficiency and excludes params already grouped by earlier calls. Bottom-up gives better overlap and lower peak memory.

Reference: references/pytorch_fully_shard_api.md (bottom-up requirement and why).

4) Configure `reshard_after_forward` for memory/perf trade-offs

Default behavior:

None means True for non-root modules and False for root modules (good default).

Heuristics:

If you’re memory-bound: keep defaults or force True on many blocks.
If you’re throughput-bound and can afford memory: consider keeping unsharded params longer (root often False).
Advanced: use an int to reshard to a smaller mesh after forward (e.g., intra-node) if it’s a meaningful divisor.

Reference: references/pytorch_fully_shard_api.md (full semantics).

5) Mixed precision & offload (optional but common)

FSDP2 uses:

mp_policy=MixedPrecisionPolicy(param_dtype=..., reduce_dtype=..., output_dtype=..., cast_forward_inputs=...)
offload_policy=CPUOffloadPolicy() if you want CPU offload

Rules of thumb:

Start with BF16 parameters/reductions on H100/A100-class GPUs (if numerically stable for your model).
Keep reduce_dtype aligned with your gradient reduction expectations.
If you use CPU offload, budget for PCIe/NVLink traffic and runtime overhead.

Reference: references/pytorch_fully_shard_api.md (MixedPrecisionPolicy / OffloadPolicy classes).

6) Optimizer, gradient clipping, accumulation

Create the optimizer after sharding so it holds DTensor params.
If you need gradient accumulation / no_sync:
- use the FSDP2 mechanism (set_requires_gradient_sync) instead of FSDP1’s no_sync().

Gradient clipping:

Use the approach shown in the FSDP2 tutorial (“Gradient Clipping and Optimizer with DTensor”), because parameters/gradients are DTensors.

Reference: references/pytorch_fsdp2_tutorial.md.

7) Checkpointing: prefer DCP or distributed state dict helpers

Two recommended approaches:

A) Distributed Checkpoint (DCP) — best default

DCP saves/loads from multiple ranks in parallel and supports load-time resharding.
DCP produces multiple files (often at least one per rank) and operates “in place”.

B) Distributed state dict helpers

get_model_state_dict / set_model_state_dict with StateDictOptions(full_state_dict=True, cpu_offload=True, broadcast_from_rank0=True, ...)
For optimizer: get_optimizer_state_dict / set_optimizer_state_dict

Avoid:

Saving DTensor state dicts with plain torch.save unless you intentionally convert with DTensor.full_tensor() and manage memory carefully.

References:

references/pytorch_dcp_overview.md (DCP behavior and caveats)
references/pytorch_dcp_recipe.md and references/pytorch_dcp_async_recipe.md (end-to-end usage)
references/pytorch_fsdp2_tutorial.md (DTensor vs DCP state-dict flows)
references/pytorch_examples_fsdp2.md (working checkpoint scripts)

Workflow checklists (copy-paste friendly)

Workflow A: Retrofit FSDP2 into an existing training script

Launch with torchrun and initialize the process group.
Set the CUDA device from LOCAL_RANK; create a DeviceMesh if you need multi-dim parallelism.
Build the model (use meta if needed), apply fully_shard bottom-up, then fully_shard(model).
Create the optimizer after sharding so it captures DTensor parameters.
Use model(inputs) so hooks run; use set_requires_gradient_sync for accumulation.
Add DCP save/load via torch.distributed.checkpoint helpers.

Reference: references/pytorch_fsdp2_tutorial.md, references/pytorch_fully_shard_api.md, references/pytorch_device_mesh_tutorial.md, references/pytorch_dcp_recipe.md.

Workflow B: Add DCP save/load (minimal pattern)

Wrap state in Stateful or assemble state via get_state_dict.
Call dcp.save(...) from all ranks to a shared path.
Call dcp.load(...) and restore with set_state_dict.
Validate any resharding assumptions when loading into a different mesh.

Reference: references/pytorch_dcp_recipe.md.

Debug checklist (what the agent should check first)

All ranks on distinct GPUs?
If not, verify torch.cuda.set_device(LOCAL_RANK) and your torchrun flags.
Did you accidentally call forward() directly?
Use model(input) or explicitly unshard() / register forward.
Is fully_shard() applied bottom-up?
If only root is sharded, expect worse memory/perf and possible confusion.
Optimizer created at the right time?
Must be built on DTensor parameters after sharding.
Checkpointing path consistent?
- If using DCP, don’t mix with ad-hoc torch.save unless you understand conversions.
- Be mindful of PyTorch-version compatibility warnings for DCP.

Common issues and fixes

Forward hooks not running → Call model(inputs) (or unshard() explicitly) instead of model.forward(...).
Optimizer sees non-DTensor params → Create optimizer after all fully_shard calls.
Only root module sharded → Apply fully_shard bottom-up on submodules before the root.
Memory spikes after forward → Set reshard_after_forward=True for more modules.
Gradient accumulation desync → Use set_requires_gradient_sync instead of FSDP1’s no_sync().

Reference: references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md.

Minimal reference implementation outline (agent-friendly)

The coding agent should implement a script with these labeled blocks:

init_distributed(): init process group, set device
build_model_meta(): model on meta, apply fully_shard, materialize weights
build_optimizer(): optimizer created after sharding
train_step(): forward/backward/step with model(inputs) and DTensor-aware patterns
checkpoint_save/load(): DCP or distributed state dict helpers

Concrete examples live in references/pytorch_examples_fsdp2.md and the official tutorial reference.

References

references/pytorch_fsdp2_tutorial.md
references/pytorch_fully_shard_api.md
references/pytorch_ddp_notes.md
references/pytorch_fsdp1_api.md
references/pytorch_device_mesh_tutorial.md
references/pytorch_tp_tutorial.md
references/pytorch_dcp_overview.md
references/pytorch_dcp_recipe.md
references/pytorch_dcp_async_recipe.md
references/pytorch_examples_fsdp2.md
references/torchtitan_fsdp_notes.md (optional, production notes)
references/ray_train_fsdp2_example.md (optional, integration example)

pytorch-fsdp2

Plus depuis ce dépôt

Plus depuis ce dépôt

Skill: Use PyTorch FSDP2 (fully_shard) correctly in a training script

When to use this skill

Alternatives (when FSDP2 is not the best fit)

Contract the agent must follow

Step-by-step procedure

0) Version & environment sanity

1) Initialize distributed and set device

2) Build model on meta device (recommended for very large models)

3) Apply fully_shard() bottom-up (wrapping policy = “apply where needed”)

4) Configure reshard_after_forward for memory/perf trade-offs

5) Mixed precision & offload (optional but common)

6) Optimizer, gradient clipping, accumulation

7) Checkpointing: prefer DCP or distributed state dict helpers

Workflow checklists (copy-paste friendly)

Workflow A: Retrofit FSDP2 into an existing training script

Workflow B: Add DCP save/load (minimal pattern)

Debug checklist (what the agent should check first)

Common issues and fixes

Minimal reference implementation outline (agent-friendly)

References

Skill: Use PyTorch FSDP2 (fully_shard) correctly in a training script

When to use this skill

Alternatives (when FSDP2 is not the best fit)

Contract the agent must follow

Step-by-step procedure

0) Version & environment sanity

1) Initialize distributed and set device

2) Build model on meta device (recommended for very large models)

3) Apply fully_shard() bottom-up (wrapping policy = “apply where needed”)

4) Configure reshard_after_forward for memory/perf trade-offs

5) Mixed precision & offload (optional but common)

6) Optimizer, gradient clipping, accumulation

7) Checkpointing: prefer DCP or distributed state dict helpers

Workflow checklists (copy-paste friendly)

Workflow A: Retrofit FSDP2 into an existing training script

Workflow B: Add DCP save/load (minimal pattern)

Debug checklist (what the agent should check first)

Common issues and fixes

Minimal reference implementation outline (agent-friendly)

References

Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script

3) Apply `fully_shard()` bottom-up (wrapping policy = “apply where needed”)

4) Configure `reshard_after_forward` for memory/perf trade-offs

Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script

3) Apply `fully_shard()` bottom-up (wrapping policy = “apply where needed”)

4) Configure `reshard_after_forward` for memory/perf trade-offs