Skip to main content
在 Manus 中运行任何 Skill
一键导入
$pwd:
AMD-AGI
GitHub 创作者资料

AMD-AGI

按仓库查看 5 个 GitHub 仓库中的 33 个已收集 skills,并展示近似职业覆盖。

已收集 skills
33
仓库
5
职业领域
2
更新
2026-05-28
职业覆盖
该创作者主要覆盖的职业大类。
仓库浏览

仓库与代表性 skills

#001
Apex
12 个 skills679更新于 2026-03-23
占该创作者 36%
aiter-reflection
计算机程序员

This skill should be used when optimizing AMD GPU kernels on MI300 using the aiter project, including running op tests, benchmarking, iterating on kernel changes, and recording results in the kernel experiment database.

2026-03-23
gpu-architecture-fundamentals
计算机硬件工程师电气工程师

This skill should be used when reasoning about GPU architecture fundamentals to guide kernel optimization choices such as memory hierarchy usage, execution model mapping, block sizing, and latency-aware tuning across HIP, Triton, and PyTorch.

2026-03-23
hip-kernel-optimization
软件开发工程师

This skill should be used when writing or tuning HIP kernels on AMD/NVIDIA GPUs, covering memory coalescing, shared-memory tiling, bank conflict avoidance, warp primitives, occupancy, vectorization, async ops, loop unrolling, and profiling.

2026-03-23
kernel-exp-history
计算机程序员

This skill should be used when optimizing kernels in this repo and needing to consult past optimization experiments, or when recording the current optimization iteration back into the kernel experiment database.

2026-03-23
mi300-cdna3-architecture
计算机硬件工程师电气和电子工程技术专家和技术员

MI300/CDNA3 architecture guide for HIP/Triton optimization—MFMA variants, dual register files, data formats, sparsity, LDS/GWS, and best practices.

2026-03-23
mi300-hip-programming-insights
计算机程序员

CDNA3/MI300 HIP programming insights—chiplet/cache model, Infinity Cache, memory coherency, matrix cores, sparsity, and best practices.

2026-03-23
mi300-hip-vs-nvidia
计算机程序员

MI300 HIP programming differences vs NVIDIA—wavefront vs warp, memory hierarchy, MFMA usage, occupancy, and profiling pitfalls.

2026-03-23
pytorch-kernel-optimization
计算机与信息研究科学家

This skill should be used when optimizing PyTorch models and kernels, including efficient tensor operations, torch.compile, custom autograd/CUDA/Triton extensions, mixed precision, memory and data pipeline tuning, model optimization techniques, CUDA graphs, and profiling.

2026-03-23
当前展示该仓库 Top 8 / 12 个已收集 skills。
#002
maxtext-slurm
11 个 skills283更新于 2026-05-09
占该创作者 33%
model-config-guide
计算机系统分析师

Create GPU config files to support existing MaxText model definitions on AMD GPU clusters. Use when the user wants to add a model, create a config, support a new model, or asks about model configs, parallelism, batch size, OOM, quantization, or .gpu.yml files.

2026-05-09
pre-commit-audit
软件开发工程师

Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, an

2026-05-09
profile-drill
数据科学家

Direct per-kernel time analysis from JAX / TensorFlow xplane traces via `utils/profile_drill.py`. Use when the user asks for a per-kernel breakdown, step-time composition, cross-variant kernel comparison, main-stream-blocking analysis, or any question that needs ground-truth kernel timings below what TraceLens reports. Triggers include "xplane", "trace.json.gz", "input_scatter_fusion", "RaggedAllToAllKernelImpl", "ncclDevKernel", "step − total kernel", "main-stream-busy", "profile drill-down", or suspicion that TraceLens numbers are off by ~1.5–2×.

2026-05-09
batch-sweep
计算机系统分析师

Four sweep operations: (1) Model perf sweep — find optimal batch size / TGS for a model. Use for: sweep batch size, tune TGS, benchmark throughput, find optimal config. (2) Node perf sweep — compare per-node GPU performance to find outliers. Use for: check nodes, node performance, find slow node, compare nodes. (3) Node network health sweep — detect inter-node network issues via multi-node bisection. Use for: network health, IB issues, RCCL problems, node pair testing, isolate network problem. (4) Model sweep — run all model configs on one or two commits. Use for: regression test, validate commit, test all models, smoke test, CI, compare branches.

2026-05-09
xla-tuning
软件开发工程师

Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.

2026-05-09
job-log-triage
网络与计算机系统管理员

Triage MaxText training jobs from log files — failed, hanging, running, or completed. Use when the user asks why a job failed, wants to diagnose an error, sees a crash, hang, timeout, OOM, NCCL error, heartbeat timeout, wants to understand a job's status, or asks about bad/low/dropping TGS or throughput.

2026-05-09
tsdb-diagnosis
网络与计算机系统管理员

Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.

2026-05-03
performance-analysis
其他计算机职业

Analyze MaxText training job performance using tgs_tagger, TraceLens, and IRLens. Use when the user asks to analyze a training run, profile traces, HLO IR, TGS metrics, GPU utilization, or mentions tag_tgs, TraceLens, IRLens, xplane, or performance analysis.

2026-05-03
当前展示该仓库 Top 8 / 11 个已收集 skills。
#003
GEAK
5 个 skills10226更新于 2026-05-18
占该创作者 15%
fp8-gemm-tuning-sglang-aiter
数据科学家

Use when trying to optimize end-to-end SGLang performance with gemm tuning for FP8 models on AMD HIP/ROCm by replacing the default Triton GEMM backend with a tuned Composable Kernel (CK) path through aiter; this skill is the verified playbook for that entire process, using FP8 block-wise GEMM (gemm_a8w8_blockscale) as the primary worked example—GEMM shape/dispatch logging in SGLang, CK composable-kernel tuning, and AITER_CONFIG_GEMM_A8W8_BLOCKSCALE CSV integration. FP8 blockscale and bpreshuffle should also apply by switch the place for dumping gemm and the ck tool used for tuning.

2026-05-18
triton
软件质量保证分析师与测试员

Use when generating a fixed test harness for a Triton (@triton.jit) GPU kernel under the v3 GEAK preprocess pipeline. Covers harness CLI contract, Triton-specific entry-point detection, three-tier shape lists, --iterations argparse rule, and the GPU-RNG-pollution pitfall that rocprofv3 punishes.

2026-05-17
hip
软件开发工程师

Use when generating a fixed test harness for a HIP / CUDA / CK / HSACO GPU kernel under the v3 GEAK preprocess pipeline. Covers harness CLI contract, the three HIP build shapes (pybind11, standalone make, raw hipcc), the COMMANDMENT wrapper-script rule, --iterations argparse, and the GPU-RNG-pollution pitfall that rocprofv3 punishes.

2026-05-17
flydsl
软件开发工程师

Use when working with FlyDSL kernels (`@flyc.kernel` / `flydsl.compiler`) on AMD GPUs. Covers three complementary workflows: writing new tile-programmed kernels, optimizing existing kernels for performance, and debugging correctness issues (NaN, wrong results, compilation errors, hangs).

2026-05-11
pytorch2flydsl-translation
软件开发工程师

Use when translating PyTorch GPU kernels to FlyDSL. Provides API reference, translation guides, and strategy for mapping PyTorch ops to FlyDSL equivalents.

2026-04-22
#004
Primus
3 个 skills10037更新于 2026-05-08
占该创作者 9.1%
#005
Magpie
2 个 skills526更新于 2026-05-28
占该创作者 6.1%
已展示 5 / 5 个仓库
已展示全部仓库