一键导入
distributed-triage
// Sub-triages issues in the oncall:distributed queue by assigning distributed module labels, routing to sub-oncalls, and marking triaged. Use when an issue has been routed to oncall:distributed and needs second-level triage.
// Sub-triages issues in the oncall:distributed queue by assigning distributed module labels, routing to sub-oncalls, and marking triaged. Use when an issue has been routed to oncall:distributed and needs second-level triage.
Fix bugs reported in PyTorch GitHub issues by reproducing, root-causing, and implementing a fix in the local working tree. Use when the user asks to fix a PyTorch GitHub issue.
Write Metal/MPS kernels for PyTorch operators. Use when adding MPS device support to operators, implementing Metal shaders, or porting CUDA kernels to Apple Silicon. Covers native_functions.yaml dispatch, host-side operators, and Metal kernel implementation.
Triages GitHub issues by routing to oncall teams, applying labels, and closing questions. Use when processing new PyTorch issues or when asked to triage an issue.
Migrate a file to use stricter Pyrefly type checking with annotations required for all functions, classes, and attributes.
Review PyTorch pull requests for code quality, test coverage, security, and backward compatibility. Use when reviewing PRs, when asked to review code changes, or when the user mentions "review PR", "code review", or "check this PR".
Debug PyTorch 2 compiler stack failures including Dynamo graph breaks, Inductor codegen errors, AOTAutograd crashes, and accuracy mismatches. Use when encountering torch.compile errors, BackendCompilerFailed exceptions, recompilation issues, Triton kernel failures, FX graph problems, or when the user mentions debugging PT2, Dynamo, Inductor, or compiled model issues.
| name | distributed-triage |
| description | Sub-triages issues in the oncall:distributed queue by assigning distributed module labels, routing to sub-oncalls, and marking triaged. Use when an issue has been routed to oncall:distributed and needs second-level triage. |
This sub-skill picks up where the PT-level triage bot leaves off. It processes issues that already have the oncall: distributed label and performs second-level triage: routing to a distributed sub-oncall, classifying by module, and marking triaged.
Distributed labels reference: See distributed-labels.json for the labels this skill is allowed to apply. ONLY apply labels from this file.
Distributed triage rubric: See distributed-rubric.md for detailed routing guidance, module classification signals, and confidence calibration.
Response templates: See templates.json for distributed-specific comment templates.
Use these GitHub MCP tools for triage:
| Tool | Purpose |
|---|---|
mcp__github__issue_read | Get issue details, comments, and existing labels |
mcp__github__issue_write | Apply labels or close issues |
mcp__github__add_issue_comment | Add comment (only for reproduction requests or mislabel flags) |
mcp__github__search_issues | Find similar issues for context |
A human has fully classified the issue only when it has BOTH:
module: label listed in distributed-labels.json, ANDoncall: distributed parallelisms, oncall: distributed infra, or oncall: distributed checkpointing.If both are present:
ptd-bot-triaged labelIf only one is present (a module label without a sub-oncall, or a sub-oncall without a module label), triage is incomplete — proceed to Step 1. The PT-level triage bot can apply distributed module labels alongside oncall: distributed, but it does not pick the sub-oncall; that is your job.
This step alone should clear a large portion of the backlog.
Read the issue title, description, and comments. Determine whether the issue is actually related to distributed training.
Signs it is NOT a distributed issue:
torch.nn on one GPU, CUDA OOM on one device)undefined symbol: ncclAlltoAll at import torch with no distributed code)torch.compile issue with no distributed componentIf NOT a distributed issue:
triage review + ptd-bot-triaged labelsnot_distributed template from templates.jsononcall: distributed — let the human oncall re-routeEach issue carries exactly ONE sub-oncall label. If the issue already has one of the three sub-oncall labels (oncall: distributed parallelisms, oncall: distributed infra, or oncall: distributed checkpointing), keep it as-is — do NOT add a second sub-oncall, even if your own classification would have picked a different one. Use the existing sub-oncall to decide the next step (continue to Step 3 if it's oncall: distributed parallelisms; otherwise add ptd-bot-triaged and STOP per the rules below).
If no sub-oncall is present, apply exactly one based on the routing rules in distributed-rubric.md:
| Sub-Oncall Label | When to Apply |
|---|---|
oncall: distributed parallelisms | FSDP, DDP, DTensor, tensor parallel, context parallel, pipeline parallel. This is the default when unsure. |
oncall: distributed infra | c10d, process groups, collectives, NCCL/Gloo/MPI backends, elastic/torchrun, RPC, stores, distributed tools, DeviceMesh, symmetric memory |
oncall: distributed checkpointing | Distributed checkpoint save/load, DCP, state_dict utilities, async checkpointing |
Use the routing decision tree and edge cases in distributed-rubric.md Section 1 to determine the correct sub-oncall.
After routing to oncall: distributed infra or oncall: distributed checkpointing:
ptd-bot-triagedAfter routing to oncall: distributed parallelisms:
From the issue description, comments, code snippets, and stack traces, classify into one or more distributed modules. Consult the module classification signals in distributed-rubric.md.
Confidence-based actions:
| Confidence | Criteria | Action |
|---|---|---|
| HIGH or MEDIUM | Explicit module mention, obvious API usage, or probable module based on context | Add module: label(s) + ptd-bot-triaged |
| LOW | Cannot determine module — vague description, no code, no stack trace | Add triage review + ptd-bot-triaged |
Rules:
module: fsdp + module: dtensor for FSDP2 issues that hit DTensor bugs).oncall: pt2 already applied, do NOT remove it. Add distributed module labels alongside it.triage review + ptd-bot-triaged — do NOT guess a module label.If the issue is not a bug report, add the appropriate type label:
feature — wholly new functionality that does not exist today in any formenhancement — improvement to something that already works (e.g., performance optimization, better error messages, adding a native backend for an op that already runs via fallback)Most distributed issues are bug reports — do not add a type label for bugs. If the issue says the operation "currently works" or "falls back to" a slower path, that is enhancement, not feature. If the enhancement is about performance, also add module: performance.
CRITICAL: If you believe an issue is high priority, you MUST:
triage review label and do NOT add ptd-bot-triagedDo NOT directly add high priority without human confirmation.
High priority criteria for distributed issues:
If the issue lacks a minimal reproduction script:
needs reproduction + ptd-bot-triaged labelsneeds_distributed_reproduction template from templates.jsonDo NOT request reproduction when:
DO NOT:
oncall: distributed — it stays even if the issue is mislabeledoncall: pt2 — if already present, keep itbot-triaged — it is applied by the parent skill and must stayhigh priority directly — use triage review and let humans decideDO:
triage review for human attentionptd-bot-triaged whenever the bot has processed the issue, regardless of confidence. Pair with triage review for LOW-confidence or uncertain cases so the cron sweep won't re-pick it. (Exception: §5 high-priority flow intentionally omits ptd-bot-triaged.)Note: bot-triaged is automatically applied by the parent skill's post-hook after any issue mutation. You do not need to add it manually.