This skill should be used when optimizing AMD GPU kernels on MI300 using the aiter project, including running op tests, benchmarking, iterating on kernel changes, and recording results in the kernel experiment database.

2026-03-23

gpu-architecture-fundamentals

软件开发工程师

This skill should be used when reasoning about GPU architecture fundamentals to guide kernel optimization choices such as memory hierarchy usage, execution model mapping, block sizing, and latency-aware tuning across HIP, Triton, and PyTorch.

2026-03-23

hip-kernel-optimization

软件开发工程师

This skill should be used when writing or tuning HIP kernels on AMD/NVIDIA GPUs, covering memory coalescing, shared-memory tiling, bank conflict avoidance, warp primitives, occupancy, vectorization, async ops, loop unrolling, and profiling.

2026-03-23

kernel-exp-history

软件开发工程师

This skill should be used when optimizing kernels in this repo and needing to consult past optimization experiments, or when recording the current optimization iteration back into the kernel experiment database.

2026-03-23

mi300-cdna3-architecture

软件开发工程师

MI300/CDNA3 architecture guide for HIP/Triton optimization—MFMA variants, dual register files, data formats, sparsity, LDS/GWS, and best practices.

2026-03-23

mi300-hip-programming-insights

软件开发工程师

CDNA3/MI300 HIP programming insights—chiplet/cache model, Infinity Cache, memory coherency, matrix cores, sparsity, and best practices.

2026-03-23

mi300-hip-vs-nvidia

软件开发工程师

MI300 HIP programming differences vs NVIDIA—wavefront vs warp, memory hierarchy, MFMA usage, occupancy, and profiling pitfalls.

2026-03-23

pytorch-kernel-optimization

软件开发工程师

This skill should be used when optimizing PyTorch models and kernels, including efficient tensor operations, torch.compile, custom autograd/CUDA/Triton extensions, mixed precision, memory and data pipeline tuning, model optimization techniques, CUDA graphs, and profiling.

2026-03-23

当前展示该仓库 Top 8 / 12 个已收集 skills。

#002

maxtext-slurm

11 个 skills283更新于 2026-05-09

占该创作者 29%

skill

职业分类

描述

更新

model-config-guide

计算机系统分析师

Create GPU config files to support existing MaxText model definitions on AMD GPU clusters. Use when the user wants to add a model, create a config, support a new model, or asks about model configs, parallelism, batch size, OOM, quantization, or .gpu.yml files.

2026-05-09

pre-commit-audit

软件开发工程师

Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, an

2026-05-09

profile-drill

数据科学家

Direct per-kernel time analysis from JAX / TensorFlow xplane traces via `utils/profile_drill.py`. Use when the user asks for a per-kernel breakdown, step-time composition, cross-variant kernel comparison, main-stream-blocking analysis, or any question that needs ground-truth kernel timings below what TraceLens reports. Triggers include "xplane", "trace.json.gz", "input_scatter_fusion", "RaggedAllToAllKernelImpl", "ncclDevKernel", "step − total kernel", "main-stream-busy", "profile drill-down", or suspicion that TraceLens numbers are off by ~1.5–2×.

2026-05-09

batch-sweep

计算机系统分析师

Four sweep operations: (1) Model perf sweep — find optimal batch size / TGS for a model. Use for: sweep batch size, tune TGS, benchmark throughput, find optimal config. (2) Node perf sweep — compare per-node GPU performance to find outliers. Use for: check nodes, node performance, find slow node, compare nodes. (3) Node network health sweep — detect inter-node network issues via multi-node bisection. Use for: network health, IB issues, RCCL problems, node pair testing, isolate network problem. (4) Model sweep — run all model configs on one or two commits. Use for: regression test, validate commit, test all models, smoke test, CI, compare branches.

2026-05-09

xla-tuning

软件开发工程师

Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.

2026-05-09

job-log-triage

网络与计算机系统管理员

Triage MaxText training jobs from log files — failed, hanging, running, or completed. Use when the user asks why a job failed, wants to diagnose an error, sees a crash, hang, timeout, OOM, NCCL error, heartbeat timeout, wants to understand a job's status, or asks about bad/low/dropping TGS or throughput.

2026-05-09

tsdb-diagnosis

网络与计算机系统管理员

Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.

2026-05-03

performance-analysis

其他计算机职业

Analyze MaxText training job performance using tgs_tagger, TraceLens, and IRLens. Use when the user asks to analyze a training run, profile traces, HLO IR, TGS metrics, GPU utilization, or mentions tag_tgs, TraceLens, IRLens, xplane, or performance analysis.

2026-05-03

当前展示该仓库 Top 8 / 11 个已收集 skills。

#003

Primus-Turbo

7 个 skills6624更新于 2026-06-26

占该创作者 18%

skill

职业分类

描述

更新

optimize-handoff

软件开发工程师

Primus-Turbo handoff to the autonomous kernel-optimize loop — collect the prerequisites (kernel path, focused test/bench commands, scoring metric, execution mode, quick-validation harness) a kernel campaign needs and pass them on. Use when pushing a Primus-Turbo kernel toward the hardware limit, not just spot-checking perf.

2026-06-26

primus-turbo-develop

软件开发工程师

Develop, debug, and validate Primus-Turbo operators and modules on AMD GPUs. Covers the layered architecture (ops / kernels-dispatcher / Triton / HIP-CK csrc / modules), how to add or change a feature end-to-end, accuracy verification (SNR, tolerances, reference implementations), performance benchmarking, the backend dispatch system, and build/test/bench commands. Use for any Primus-Turbo development task (GEMM, Attention, GroupedGEMM, MoE, quantization, normalization, activation) and for accuracy or performance validation.

2026-06-26

kernel-optimize

软件开发工程师

AI-driven operator performance optimization framework. Defines the optimization loop, execution environment selection, knowledge routing, and logging conventions to drive agent-autonomous iteration toward hardware limits.

2026-06-22

develop-feature

软件开发工程师

Primus-Turbo feature development workflow — the layered architecture (ops / kernels-dispatcher / Triton / HIP-CK csrc / modules), how to wire a new operator end-to-end, which layer to touch, and which existing file to copy. Use when adding or changing a Primus-Turbo operator or module on AMD GPUs.

2026-06-22

verify-performance

软件开发工程师

Primus-Turbo performance verification — run single-operator and suite benchmarks, read the latency/TFLOPS metrics, source real-model shapes, and derive a combined training-step metric. Use when measuring latency or throughput of a Primus-Turbo operator.

2026-06-22

verify-accuracy

软件质量保证分析师与测试员

Primus-Turbo accuracy verification — compare an operator against a higher-precision reference for forward and backward, with the right gate (allclose for bf16/fp16/fp32, SNR for fp8/fp4) and FP8 encoding awareness. Use when validating numerical correctness of a Primus-Turbo operator.

2026-06-08

tool-rocprof

网络与计算机系统管理员

ROCm profiling workflow for AMD GPU kernels using rocprofv3 and rocprof-compute. Use when profiling hot kernels, collecting counters, diagnosing memory-vs-compute-vs-stall bottlenecks, reading Perfetto traces, or validating low-precision AMD kernels.

2026-06-08

#004

Primus

5 个 skills10744更新于 2026-07-08

占该创作者 13%

skill

职业分类

描述

更新

backend-gap-report

软件开发工程师

Compare a Primus backend against an upstream repository or reference, verify git state, dependencies, directory changes, and integration coupling, then generate comparison reports, dashboard metadata, and a deployable dashboard index. Also owns the shared Primus engineering dashboard under `tools/backend_gap_report/`, which surfaces both backend-gap reports and weekly engineering reports as first-class sections. Use when comparing TorchTitan, Megatron, or other Primus backends with upstream branches, tags, or releases, or when integrating weekly engineering reports into the shared dashboard.

2026-07-08

backend-patch-explorer

软件开发工程师

Inventory and explain the patch (monkey-patch) optimizations Primus layers over upstream training backends such as Megatron-LM, TorchTitan, and MaxText, by reading the current repository code only. Use when the user asks which patches a backend has, wants a customer-facing patch table, asks how a specific patch works (for example deepep or DeepEP), or wants guidance to port a Primus patch into their own upstream framework. Read-only; no training or cluster commands.

2026-06-10

primus-projection

软件开发工程师

Opinionated guide for using Primus Projection to choose parallelism (TP/PP/EP/CP/DP) and pipeline schedules, validate memory fit on target nodes, reason about communication collectives, and explore optimization trade-offs with minimal compute. Use when the user asks how to pick a parallelism strategy or pipeline schedule, whether a model fits in memory, which optimizations such as DeepEP, SyncFree, zero-bubble, FP8, recomputation, or FSDP2 matter most, or how to run primus projection commands. Read-only planning guidance; no large multi-node training runs required.

2026-06-10

slurm-idle-node-check

网络与计算机系统管理员

Check available idle nodes in a SLURM cluster. Use when the user wants to find usable idle nodes, verify node health, check docker status on SLURM nodes, check NIC QoS/DCQCN configuration, check RDMA link status, validate GID table, or troubleshoot cluster node availability.

2026-06-10

slurm-training-node-validation

网络与计算机系统管理员

Validate SLURM cluster nodes by running actual training jobs in groups. Use when the user wants to test which idle nodes can successfully run training, verify node health through real workloads, or identify broken nodes in a SLURM cluster.

2026-06-10

#005

Magpie

2 个 skills566更新于 2026-07-08

占该创作者 5.3%

skill

职业分类

描述

更新

magpie

软件开发工程师

Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.

2026-07-08

amd-kernel-source-finder

软件开发工程师

Find kernel source code, test files, and test commands for AMD GPU kernels identified in profiler traces. Use when the user wants to locate kernel implementations, find tests for specific kernels, or enrich gap_analysis results with source information. Supports Triton JIT (ROCm), Tensile GEMM (rocBLAS), CK Tile, hipBLASLt, and HIP kernels.

2026-04-30

#006

GEAK

1 个 skills11631更新于 2026-06-17

占该创作者 2.6%

skill

职业分类

描述

更新

add-expert-skill-to-geak

软件开发工程师

Contribute a human-authored, e2e-validated optimization recipe (an "expert skill") to GEAK — scaffold, fill, validate by scope, and open a PR.

2026-06-17

已展示 6 / 6 个仓库

已展示全部仓库