This skill should be used when optimizing AMD GPU kernels on MI300 using the aiter project, including running op tests, benchmarking, iterating on kernel changes, and recording results in the kernel experiment database.

2026-03-23

gpu-architecture-fundamentals

소프트웨어 개발자

This skill should be used when reasoning about GPU architecture fundamentals to guide kernel optimization choices such as memory hierarchy usage, execution model mapping, block sizing, and latency-aware tuning across HIP, Triton, and PyTorch.

2026-03-23

hip-kernel-optimization

소프트웨어 개발자

This skill should be used when writing or tuning HIP kernels on AMD/NVIDIA GPUs, covering memory coalescing, shared-memory tiling, bank conflict avoidance, warp primitives, occupancy, vectorization, async ops, loop unrolling, and profiling.

2026-03-23

kernel-exp-history

소프트웨어 개발자

This skill should be used when optimizing kernels in this repo and needing to consult past optimization experiments, or when recording the current optimization iteration back into the kernel experiment database.

2026-03-23

mi300-cdna3-architecture

소프트웨어 개발자

MI300/CDNA3 architecture guide for HIP/Triton optimization—MFMA variants, dual register files, data formats, sparsity, LDS/GWS, and best practices.

2026-03-23

mi300-hip-programming-insights

소프트웨어 개발자

CDNA3/MI300 HIP programming insights—chiplet/cache model, Infinity Cache, memory coherency, matrix cores, sparsity, and best practices.

2026-03-23

mi300-hip-vs-nvidia

소프트웨어 개발자

MI300 HIP programming differences vs NVIDIA—wavefront vs warp, memory hierarchy, MFMA usage, occupancy, and profiling pitfalls.

2026-03-23

pytorch-kernel-optimization

소프트웨어 개발자

This skill should be used when optimizing PyTorch models and kernels, including efficient tensor operations, torch.compile, custom autograd/CUDA/Triton extensions, mixed precision, memory and data pipeline tuning, model optimization techniques, CUDA graphs, and profiling.

2026-03-23

이 저장소에서 수집된 skills 12개 중 상위 8개를 표시합니다.

#002

maxtext-slurm

skills 11개2832026-05-09 업데이트

제작자 내 29%

skill

직업 분류

설명

업데이트

model-config-guide

컴퓨터 시스템 분석가

Create GPU config files to support existing MaxText model definitions on AMD GPU clusters. Use when the user wants to add a model, create a config, support a new model, or asks about model configs, parallelism, batch size, OOM, quantization, or .gpu.yml files.

2026-05-09

pre-commit-audit

소프트웨어 개발자

Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, an

2026-05-09

profile-drill

데이터 과학자

Direct per-kernel time analysis from JAX / TensorFlow xplane traces via `utils/profile_drill.py`. Use when the user asks for a per-kernel breakdown, step-time composition, cross-variant kernel comparison, main-stream-blocking analysis, or any question that needs ground-truth kernel timings below what TraceLens reports. Triggers include "xplane", "trace.json.gz", "input_scatter_fusion", "RaggedAllToAllKernelImpl", "ncclDevKernel", "step − total kernel", "main-stream-busy", "profile drill-down", or suspicion that TraceLens numbers are off by ~1.5–2×.

2026-05-09

batch-sweep

컴퓨터 시스템 분석가

Four sweep operations: (1) Model perf sweep — find optimal batch size / TGS for a model. Use for: sweep batch size, tune TGS, benchmark throughput, find optimal config. (2) Node perf sweep — compare per-node GPU performance to find outliers. Use for: check nodes, node performance, find slow node, compare nodes. (3) Node network health sweep — detect inter-node network issues via multi-node bisection. Use for: network health, IB issues, RCCL problems, node pair testing, isolate network problem. (4) Model sweep — run all model configs on one or two commits. Use for: regression test, validate commit, test all models, smoke test, CI, compare branches.

2026-05-09

xla-tuning

소프트웨어 개발자

Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.

2026-05-09

job-log-triage

네트워크·컴퓨터 시스템 관리자

Triage MaxText training jobs from log files — failed, hanging, running, or completed. Use when the user asks why a job failed, wants to diagnose an error, sees a crash, hang, timeout, OOM, NCCL error, heartbeat timeout, wants to understand a job's status, or asks about bad/low/dropping TGS or throughput.

2026-05-09

tsdb-diagnosis

네트워크·컴퓨터 시스템 관리자

Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.

2026-05-03

performance-analysis

기타 컴퓨터 관련 직업

Analyze MaxText training job performance using tgs_tagger, TraceLens, and IRLens. Use when the user asks to analyze a training run, profile traces, HLO IR, TGS metrics, GPU utilization, or mentions tag_tgs, TraceLens, IRLens, xplane, or performance analysis.

2026-05-03

이 저장소에서 수집된 skills 11개 중 상위 8개를 표시합니다.

#003

Primus-Turbo

skills 7개66242026-06-26 업데이트

제작자 내 18%

skill

직업 분류

설명

업데이트

optimize-handoff

소프트웨어 개발자

Primus-Turbo handoff to the autonomous kernel-optimize loop — collect the prerequisites (kernel path, focused test/bench commands, scoring metric, execution mode, quick-validation harness) a kernel campaign needs and pass them on. Use when pushing a Primus-Turbo kernel toward the hardware limit, not just spot-checking perf.

2026-06-26

primus-turbo-develop

소프트웨어 개발자

Develop, debug, and validate Primus-Turbo operators and modules on AMD GPUs. Covers the layered architecture (ops / kernels-dispatcher / Triton / HIP-CK csrc / modules), how to add or change a feature end-to-end, accuracy verification (SNR, tolerances, reference implementations), performance benchmarking, the backend dispatch system, and build/test/bench commands. Use for any Primus-Turbo development task (GEMM, Attention, GroupedGEMM, MoE, quantization, normalization, activation) and for accuracy or performance validation.

2026-06-26

kernel-optimize

소프트웨어 개발자

AI-driven operator performance optimization framework. Defines the optimization loop, execution environment selection, knowledge routing, and logging conventions to drive agent-autonomous iteration toward hardware limits.

2026-06-22

develop-feature

소프트웨어 개발자

Primus-Turbo feature development workflow — the layered architecture (ops / kernels-dispatcher / Triton / HIP-CK csrc / modules), how to wire a new operator end-to-end, which layer to touch, and which existing file to copy. Use when adding or changing a Primus-Turbo operator or module on AMD GPUs.

2026-06-22

verify-performance

소프트웨어 개발자

Primus-Turbo performance verification — run single-operator and suite benchmarks, read the latency/TFLOPS metrics, source real-model shapes, and derive a combined training-step metric. Use when measuring latency or throughput of a Primus-Turbo operator.

2026-06-22

verify-accuracy

소프트웨어 품질 보증 분석가·테스터

Primus-Turbo accuracy verification — compare an operator against a higher-precision reference for forward and backward, with the right gate (allclose for bf16/fp16/fp32, SNR for fp8/fp4) and FP8 encoding awareness. Use when validating numerical correctness of a Primus-Turbo operator.

2026-06-08

tool-rocprof

네트워크·컴퓨터 시스템 관리자

ROCm profiling workflow for AMD GPU kernels using rocprofv3 and rocprof-compute. Use when profiling hot kernels, collecting counters, diagnosing memory-vs-compute-vs-stall bottlenecks, reading Perfetto traces, or validating low-precision AMD kernels.

2026-06-08

#004

Primus

skills 5개107442026-07-08 업데이트

제작자 내 13%

skill

직업 분류

설명

업데이트

backend-gap-report

소프트웨어 개발자

Compare a Primus backend against an upstream repository or reference, verify git state, dependencies, directory changes, and integration coupling, then generate comparison reports, dashboard metadata, and a deployable dashboard index. Also owns the shared Primus engineering dashboard under `tools/backend_gap_report/`, which surfaces both backend-gap reports and weekly engineering reports as first-class sections. Use when comparing TorchTitan, Megatron, or other Primus backends with upstream branches, tags, or releases, or when integrating weekly engineering reports into the shared dashboard.

2026-07-08

backend-patch-explorer

소프트웨어 개발자

Inventory and explain the patch (monkey-patch) optimizations Primus layers over upstream training backends such as Megatron-LM, TorchTitan, and MaxText, by reading the current repository code only. Use when the user asks which patches a backend has, wants a customer-facing patch table, asks how a specific patch works (for example deepep or DeepEP), or wants guidance to port a Primus patch into their own upstream framework. Read-only; no training or cluster commands.

2026-06-10

primus-projection

소프트웨어 개발자

Opinionated guide for using Primus Projection to choose parallelism (TP/PP/EP/CP/DP) and pipeline schedules, validate memory fit on target nodes, reason about communication collectives, and explore optimization trade-offs with minimal compute. Use when the user asks how to pick a parallelism strategy or pipeline schedule, whether a model fits in memory, which optimizations such as DeepEP, SyncFree, zero-bubble, FP8, recomputation, or FSDP2 matter most, or how to run primus projection commands. Read-only planning guidance; no large multi-node training runs required.

2026-06-10

slurm-idle-node-check

네트워크·컴퓨터 시스템 관리자

Check available idle nodes in a SLURM cluster. Use when the user wants to find usable idle nodes, verify node health, check docker status on SLURM nodes, check NIC QoS/DCQCN configuration, check RDMA link status, validate GID table, or troubleshoot cluster node availability.

2026-06-10

slurm-training-node-validation

네트워크·컴퓨터 시스템 관리자

Validate SLURM cluster nodes by running actual training jobs in groups. Use when the user wants to test which idle nodes can successfully run training, verify node health through real workloads, or identify broken nodes in a SLURM cluster.

2026-06-10

#005

Magpie

skills 2개5662026-07-08 업데이트

제작자 내 5.3%

skill

직업 분류

설명

업데이트

magpie

소프트웨어 개발자

Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.

2026-07-08

amd-kernel-source-finder

소프트웨어 개발자

Find kernel source code, test files, and test commands for AMD GPU kernels identified in profiler traces. Use when the user wants to locate kernel implementations, find tests for specific kernels, or enrich gap_analysis results with source information. Supports Triton JIT (ROCm), Tensile GEMM (rocBLAS), CK Tile, hipBLASLt, and HIP kernels.

2026-04-30

#006

GEAK

skills 1개116312026-06-17 업데이트

제작자 내 2.6%

skill

직업 분류

설명

업데이트

add-expert-skill-to-geak

소프트웨어 개발자

Contribute a human-authored, e2e-validated optimization recipe (an "expert skill") to GEAK — scaffold, fill, validate by scope, and open a PR.

2026-06-17

저장소 6개 중 6개 표시

모든 저장소를 표시했습니다