This skill should be used when optimizing AMD GPU kernels on MI300 using the aiter project, including running op tests, benchmarking, iterating on kernel changes, and recording results in the kernel experiment database.

2026-03-23

gpu-architecture-fundamentals

software-developers

This skill should be used when reasoning about GPU architecture fundamentals to guide kernel optimization choices such as memory hierarchy usage, execution model mapping, block sizing, and latency-aware tuning across HIP, Triton, and PyTorch.

2026-03-23

hip-kernel-optimization

software-developers

This skill should be used when writing or tuning HIP kernels on AMD/NVIDIA GPUs, covering memory coalescing, shared-memory tiling, bank conflict avoidance, warp primitives, occupancy, vectorization, async ops, loop unrolling, and profiling.

2026-03-23

kernel-exp-history

software-developers

This skill should be used when optimizing kernels in this repo and needing to consult past optimization experiments, or when recording the current optimization iteration back into the kernel experiment database.

2026-03-23

mi300-cdna3-architecture

software-developers

MI300/CDNA3 architecture guide for HIP/Triton optimization—MFMA variants, dual register files, data formats, sparsity, LDS/GWS, and best practices.

2026-03-23

mi300-hip-programming-insights

software-developers

CDNA3/MI300 HIP programming insights—chiplet/cache model, Infinity Cache, memory coherency, matrix cores, sparsity, and best practices.

2026-03-23

mi300-hip-vs-nvidia

software-developers

MI300 HIP programming differences vs NVIDIA—wavefront vs warp, memory hierarchy, MFMA usage, occupancy, and profiling pitfalls.

2026-03-23

pytorch-kernel-optimization

software-developers

This skill should be used when optimizing PyTorch models and kernels, including efficient tensor operations, torch.compile, custom autograd/CUDA/Triton extensions, mixed precision, memory and data pipeline tuning, model optimization techniques, CUDA graphs, and profiling.

2026-03-23

Showing top 8 of 12 collected skills in this repository.

#002

maxtext-slurm

11 skills283updated 2026-05-09

29% of creator

skill

occupation

description

updated

model-config-guide

computer-systems-analysts

Create GPU config files to support existing MaxText model definitions on AMD GPU clusters. Use when the user wants to add a model, create a config, support a new model, or asks about model configs, parallelism, batch size, OOM, quantization, or .gpu.yml files.

2026-05-09

pre-commit-audit

software-developers

Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, an

2026-05-09

profile-drill

data-scientists-152051

Direct per-kernel time analysis from JAX / TensorFlow xplane traces via `utils/profile_drill.py`. Use when the user asks for a per-kernel breakdown, step-time composition, cross-variant kernel comparison, main-stream-blocking analysis, or any question that needs ground-truth kernel timings below what TraceLens reports. Triggers include "xplane", "trace.json.gz", "input_scatter_fusion", "RaggedAllToAllKernelImpl", "ncclDevKernel", "step − total kernel", "main-stream-busy", "profile drill-down", or suspicion that TraceLens numbers are off by ~1.5–2×.

2026-05-09

batch-sweep

computer-systems-analysts

Four sweep operations: (1) Model perf sweep — find optimal batch size / TGS for a model. Use for: sweep batch size, tune TGS, benchmark throughput, find optimal config. (2) Node perf sweep — compare per-node GPU performance to find outliers. Use for: check nodes, node performance, find slow node, compare nodes. (3) Node network health sweep — detect inter-node network issues via multi-node bisection. Use for: network health, IB issues, RCCL problems, node pair testing, isolate network problem. (4) Model sweep — run all model configs on one or two commits. Use for: regression test, validate commit, test all models, smoke test, CI, compare branches.

2026-05-09

xla-tuning

software-developers

Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.

2026-05-09

job-log-triage

network-and-computer-systems-administrators

Triage MaxText training jobs from log files — failed, hanging, running, or completed. Use when the user asks why a job failed, wants to diagnose an error, sees a crash, hang, timeout, OOM, NCCL error, heartbeat timeout, wants to understand a job's status, or asks about bad/low/dropping TGS or throughput.

2026-05-09

tsdb-diagnosis

network-and-computer-systems-administrators

Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.

2026-05-03

performance-analysis

computer-occupations-all-other

Analyze MaxText training job performance using tgs_tagger, TraceLens, and IRLens. Use when the user asks to analyze a training run, profile traces, HLO IR, TGS metrics, GPU utilization, or mentions tag_tgs, TraceLens, IRLens, xplane, or performance analysis.

2026-05-03

Showing top 8 of 11 collected skills in this repository.

#003

Primus-Turbo

7 skills6624updated 2026-06-26

18% of creator

skill

occupation

description

updated

optimize-handoff

software-developers

Primus-Turbo handoff to the autonomous kernel-optimize loop — collect the prerequisites (kernel path, focused test/bench commands, scoring metric, execution mode, quick-validation harness) a kernel campaign needs and pass them on. Use when pushing a Primus-Turbo kernel toward the hardware limit, not just spot-checking perf.

2026-06-26

primus-turbo-develop

software-developers

Develop, debug, and validate Primus-Turbo operators and modules on AMD GPUs. Covers the layered architecture (ops / kernels-dispatcher / Triton / HIP-CK csrc / modules), how to add or change a feature end-to-end, accuracy verification (SNR, tolerances, reference implementations), performance benchmarking, the backend dispatch system, and build/test/bench commands. Use for any Primus-Turbo development task (GEMM, Attention, GroupedGEMM, MoE, quantization, normalization, activation) and for accuracy or performance validation.

2026-06-26

kernel-optimize

software-developers

AI-driven operator performance optimization framework. Defines the optimization loop, execution environment selection, knowledge routing, and logging conventions to drive agent-autonomous iteration toward hardware limits.

2026-06-22

develop-feature

software-developers

Primus-Turbo feature development workflow — the layered architecture (ops / kernels-dispatcher / Triton / HIP-CK csrc / modules), how to wire a new operator end-to-end, which layer to touch, and which existing file to copy. Use when adding or changing a Primus-Turbo operator or module on AMD GPUs.

2026-06-22

verify-performance

software-developers

Primus-Turbo performance verification — run single-operator and suite benchmarks, read the latency/TFLOPS metrics, source real-model shapes, and derive a combined training-step metric. Use when measuring latency or throughput of a Primus-Turbo operator.

2026-06-22

verify-accuracy

software-quality-assurance-analysts-and-testers

Primus-Turbo accuracy verification — compare an operator against a higher-precision reference for forward and backward, with the right gate (allclose for bf16/fp16/fp32, SNR for fp8/fp4) and FP8 encoding awareness. Use when validating numerical correctness of a Primus-Turbo operator.

2026-06-08

tool-rocprof

network-and-computer-systems-administrators

ROCm profiling workflow for AMD GPU kernels using rocprofv3 and rocprof-compute. Use when profiling hot kernels, collecting counters, diagnosing memory-vs-compute-vs-stall bottlenecks, reading Perfetto traces, or validating low-precision AMD kernels.

2026-06-08

#004

Primus

5 skills10744updated 2026-07-08

13% of creator

skill

occupation

description

updated

backend-gap-report

software-developers

Compare a Primus backend against an upstream repository or reference, verify git state, dependencies, directory changes, and integration coupling, then generate comparison reports, dashboard metadata, and a deployable dashboard index. Also owns the shared Primus engineering dashboard under `tools/backend_gap_report/`, which surfaces both backend-gap reports and weekly engineering reports as first-class sections. Use when comparing TorchTitan, Megatron, or other Primus backends with upstream branches, tags, or releases, or when integrating weekly engineering reports into the shared dashboard.

2026-07-08

backend-patch-explorer

software-developers

Inventory and explain the patch (monkey-patch) optimizations Primus layers over upstream training backends such as Megatron-LM, TorchTitan, and MaxText, by reading the current repository code only. Use when the user asks which patches a backend has, wants a customer-facing patch table, asks how a specific patch works (for example deepep or DeepEP), or wants guidance to port a Primus patch into their own upstream framework. Read-only; no training or cluster commands.

2026-06-10

primus-projection

software-developers

Opinionated guide for using Primus Projection to choose parallelism (TP/PP/EP/CP/DP) and pipeline schedules, validate memory fit on target nodes, reason about communication collectives, and explore optimization trade-offs with minimal compute. Use when the user asks how to pick a parallelism strategy or pipeline schedule, whether a model fits in memory, which optimizations such as DeepEP, SyncFree, zero-bubble, FP8, recomputation, or FSDP2 matter most, or how to run primus projection commands. Read-only planning guidance; no large multi-node training runs required.

2026-06-10

slurm-idle-node-check

network-and-computer-systems-administrators

Check available idle nodes in a SLURM cluster. Use when the user wants to find usable idle nodes, verify node health, check docker status on SLURM nodes, check NIC QoS/DCQCN configuration, check RDMA link status, validate GID table, or troubleshoot cluster node availability.

2026-06-10

slurm-training-node-validation

network-and-computer-systems-administrators

Validate SLURM cluster nodes by running actual training jobs in groups. Use when the user wants to test which idle nodes can successfully run training, verify node health through real workloads, or identify broken nodes in a SLURM cluster.

2026-06-10

#005

Magpie

2 skills566updated 2026-07-08

5.3% of creator

skill

occupation

description

updated

magpie

software-developers

Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.

2026-07-08

amd-kernel-source-finder

software-developers

Find kernel source code, test files, and test commands for AMD GPU kernels identified in profiler traces. Use when the user wants to locate kernel implementations, find tests for specific kernels, or enrich gap_analysis results with source information. Supports Triton JIT (ROCm), Tensile GEMM (rocBLAS), CK Tile, hipBLASLt, and HIP kernels.

2026-04-30

#006

GEAK

1 skills11631updated 2026-06-17

2.6% of creator

skill

occupation

description

updated

add-expert-skill-to-geak

software-developers

Contribute a human-authored, e2e-validated optimization recipe (an "expert skill") to GEAK — scaffold, fill, validate by scope, and open a PR.

2026-06-17

Showing 6 of 6 repositories

All repositories loaded