| name | distributed-triage |
| description | Sub-triages issues in the oncall:distributed queue by assigning distributed module labels, routing to sub-oncalls, and marking triaged. Use when an issue has been routed to oncall:distributed and needs second-level triage. |
Distributed Issue Triage Sub-Skill
This sub-skill picks up where the PT-level triage bot leaves off. It processes issues that already have the oncall: distributed label and performs second-level triage: routing to a distributed sub-oncall, classifying by module, and marking triaged.
Contents
Distributed labels reference: See distributed-labels.json for the labels this skill is allowed to apply. ONLY apply labels from this file.
Distributed triage rubric: See distributed-rubric.md for detailed routing guidance, module classification signals, and confidence calibration.
Response templates: See templates.json for distributed-specific comment templates.
MCP Tools Available
Use these GitHub MCP tools for triage:
| Tool | Purpose |
|---|
mcp__github__issue_read | Get issue details, comments, and existing labels |
mcp__github__issue_write | Apply labels or close issues |
mcp__github__add_issue_comment | Add comment (only for reproduction requests or mislabel flags) |
mcp__github__search_issues | Find similar issues for context |
Distributed Triage Steps
0) Already Triaged by Human?
A human has fully classified the issue only when it has BOTH:
- Any
module: label listed in distributed-labels.json, AND
- One of the sub-oncall labels:
oncall: distributed parallelisms, oncall: distributed infra, or oncall: distributed checkpointing.
If both are present:
- Add
ptd-bot-triaged label
- STOP โ a human already classified this issue.
If only one is present (a module label without a sub-oncall, or a sub-oncall without a module label), triage is incomplete โ proceed to Step 1. The PT-level triage bot can apply distributed module labels alongside oncall: distributed, but it does not pick the sub-oncall; that is your job.
This step alone should clear a large portion of the backlog.
1) Is This Actually a Distributed Issue?
Read the issue title, description, and comments. Determine whether the issue is actually related to distributed training.
Signs it is NOT a distributed issue:
- Single-GPU issue with no distributed code (e.g.,
torch.nn on one GPU, CUDA OOM on one device)
- Build/packaging issue (e.g.,
undefined symbol: ncclAlltoAll at import torch with no distributed code)
- Pure
torch.compile issue with no distributed component
- Issue about a domain library (vision, text, audio) that happens to mention "distributed"
If NOT a distributed issue:
- Add
triage review + ptd-bot-triaged labels
- Post a comment using the
not_distributed template from templates.json
- Do NOT remove
oncall: distributed โ let the human oncall re-route
- STOP
2) Route to Distributed Sub-Oncall
Each issue carries exactly ONE sub-oncall label. If the issue already has one of the three sub-oncall labels (oncall: distributed parallelisms, oncall: distributed infra, or oncall: distributed checkpointing), keep it as-is โ do NOT add a second sub-oncall, even if your own classification would have picked a different one. Use the existing sub-oncall to decide the next step (continue to Step 3 if it's oncall: distributed parallelisms; otherwise add ptd-bot-triaged and STOP per the rules below).
If no sub-oncall is present, apply exactly one based on the routing rules in distributed-rubric.md:
| Sub-Oncall Label | When to Apply |
|---|
oncall: distributed parallelisms | FSDP, DDP, DTensor, tensor parallel, context parallel, pipeline parallel. This is the default when unsure. |
oncall: distributed infra | c10d, process groups, collectives, NCCL/Gloo/MPI backends, elastic/torchrun, RPC, stores, distributed tools, DeviceMesh, symmetric memory |
oncall: distributed checkpointing | Distributed checkpoint save/load, DCP, state_dict utilities, async checkpointing |
Use the routing decision tree and edge cases in distributed-rubric.md Section 1 to determine the correct sub-oncall.
After routing to oncall: distributed infra or oncall: distributed checkpointing:
- Add
ptd-bot-triaged
- STOP โ the sub-oncall team owns further triage
After routing to oncall: distributed parallelisms:
- Continue to Step 3 for module classification
3) Classify Module
From the issue description, comments, code snippets, and stack traces, classify into one or more distributed modules. Consult the module classification signals in distributed-rubric.md.
Confidence-based actions:
| Confidence | Criteria | Action |
|---|
| HIGH or MEDIUM | Explicit module mention, obvious API usage, or probable module based on context | Add module: label(s) + ptd-bot-triaged |
| LOW | Cannot determine module โ vague description, no code, no stack trace | Add triage review + ptd-bot-triaged |
Rules:
- You can apply multiple module labels when the issue spans modules (e.g.,
module: fsdp + module: dtensor for FSDP2 issues that hit DTensor bugs).
- When an issue has
oncall: pt2 already applied, do NOT remove it. Add distributed module labels alongside it.
- When the module is unclear, add
triage review + ptd-bot-triaged โ do NOT guess a module label.
4) Type Labels
If the issue is not a bug report, add the appropriate type label:
feature โ wholly new functionality that does not exist today in any form
enhancement โ improvement to something that already works (e.g., performance optimization, better error messages, adding a native backend for an op that already runs via fallback)
Most distributed issues are bug reports โ do not add a type label for bugs. If the issue says the operation "currently works" or "falls back to" a slower path, that is enhancement, not feature. If the enhancement is about performance, also add module: performance.
5) High Priority โ REQUIRES HUMAN REVIEW
CRITICAL: If you believe an issue is high priority, you MUST:
- Add
triage review label and do NOT add ptd-bot-triaged
Do NOT directly add high priority without human confirmation.
High priority criteria for distributed issues:
- Crash / segfault / illegal memory access in distributed code
- Silent correctness issue (wrong results from collectives, incorrect gradient sync)
- Regression from a prior version (e.g., FSDP worked in 2.x, broken in 2.y)
- Hang affecting multi-node training (NCCL timeout, deadlock in collectives)
- Data corruption during distributed checkpointing
- Internal assert failure in c10d or process group code
- Many users affected or core distributed component impacted
6) Missing Reproduction
If the issue lacks a minimal reproduction script:
- Add
needs reproduction + ptd-bot-triaged labels
- Post a comment using the
needs_distributed_reproduction template from templates.json
Do NOT request reproduction when:
- The issue already has a code snippet, script, or steps that someone could follow to reproduce
- The issue is a feature request (no repro needed)
- A multi-node script is provided (that counts as reproduction even if you can't run it locally)
Constraints
DO NOT:
- Close issues (only the PT-level bot or humans close issues)
- Remove existing labels โ only add labels
- Remove
oncall: distributed โ it stays even if the issue is mislabeled
- Remove
oncall: pt2 โ if already present, keep it
- Remove
bot-triaged โ it is applied by the parent skill and must stay
- Add labels not in distributed-labels.json
- Add comments to issues except when using the templates in Step 1 (mislabel) or Step 6 (reproduction)
- Assign issues to users
- Add
high priority directly โ use triage review and let humans decide
DO:
- Be conservative โ when in doubt, add
triage review for human attention
- Add
ptd-bot-triaged whenever the bot has processed the issue, regardless of confidence. Pair with triage review for LOW-confidence or uncertain cases so the cron sweep won't re-pick it. (Exception: ยง5 high-priority flow intentionally omits ptd-bot-triaged.)
- Always add a sub-oncall label (Step 2) before module labels (Step 3)
- Read the full issue including comments before classifying
- Check the rubric's "Common Mislabel Traps" section before finalizing
Note: bot-triaged is automatically applied by the parent skill's post-hook after any issue mutation. You do not need to add it manually.