Skip to main content
Manus에서 모든 스킬 실행
원클릭으로
$pwd:
AMD-AGI
GitHub creator profile

AMD-AGI

Repository-level view of 33 collected skills across 5 GitHub repositories, including approximate occupation coverage.

skills collected
33
repositories
5
occupation fields
2
updated
2026-05-28
occupation focus
Major fields detected across this creator.
repository explorer

Repositories and representative skills

#001
Apex
12 skills679updated 2026-03-23
36% of creator
aiter-reflection
컴퓨터 프로그래머

This skill should be used when optimizing AMD GPU kernels on MI300 using the aiter project, including running op tests, benchmarking, iterating on kernel changes, and recording results in the kernel experiment database.

2026-03-23
gpu-architecture-fundamentals
컴퓨터 하드웨어 엔지니어전기 엔지니어

This skill should be used when reasoning about GPU architecture fundamentals to guide kernel optimization choices such as memory hierarchy usage, execution model mapping, block sizing, and latency-aware tuning across HIP, Triton, and PyTorch.

2026-03-23
hip-kernel-optimization
소프트웨어 개발자

This skill should be used when writing or tuning HIP kernels on AMD/NVIDIA GPUs, covering memory coalescing, shared-memory tiling, bank conflict avoidance, warp primitives, occupancy, vectorization, async ops, loop unrolling, and profiling.

2026-03-23
kernel-exp-history
컴퓨터 프로그래머

This skill should be used when optimizing kernels in this repo and needing to consult past optimization experiments, or when recording the current optimization iteration back into the kernel experiment database.

2026-03-23
mi300-cdna3-architecture
컴퓨터 하드웨어 엔지니어전기 및 전자 공학 기술자 및 기능사

MI300/CDNA3 architecture guide for HIP/Triton optimization—MFMA variants, dual register files, data formats, sparsity, LDS/GWS, and best practices.

2026-03-23
mi300-hip-programming-insights
컴퓨터 프로그래머

CDNA3/MI300 HIP programming insights—chiplet/cache model, Infinity Cache, memory coherency, matrix cores, sparsity, and best practices.

2026-03-23
mi300-hip-vs-nvidia
컴퓨터 프로그래머

MI300 HIP programming differences vs NVIDIA—wavefront vs warp, memory hierarchy, MFMA usage, occupancy, and profiling pitfalls.

2026-03-23
pytorch-kernel-optimization
컴퓨터·정보 연구 과학자

This skill should be used when optimizing PyTorch models and kernels, including efficient tensor operations, torch.compile, custom autograd/CUDA/Triton extensions, mixed precision, memory and data pipeline tuning, model optimization techniques, CUDA graphs, and profiling.

2026-03-23
Showing top 8 of 12 collected skills in this repository.
#002
maxtext-slurm
11 skills283updated 2026-05-09
33% of creator
model-config-guide
컴퓨터 시스템 분석가

Create GPU config files to support existing MaxText model definitions on AMD GPU clusters. Use when the user wants to add a model, create a config, support a new model, or asks about model configs, parallelism, batch size, OOM, quantization, or .gpu.yml files.

2026-05-09
pre-commit-audit
소프트웨어 개발자

Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, an

2026-05-09
profile-drill
데이터 과학자

Direct per-kernel time analysis from JAX / TensorFlow xplane traces via `utils/profile_drill.py`. Use when the user asks for a per-kernel breakdown, step-time composition, cross-variant kernel comparison, main-stream-blocking analysis, or any question that needs ground-truth kernel timings below what TraceLens reports. Triggers include "xplane", "trace.json.gz", "input_scatter_fusion", "RaggedAllToAllKernelImpl", "ncclDevKernel", "step − total kernel", "main-stream-busy", "profile drill-down", or suspicion that TraceLens numbers are off by ~1.5–2×.

2026-05-09
batch-sweep
컴퓨터 시스템 분석가

Four sweep operations: (1) Model perf sweep — find optimal batch size / TGS for a model. Use for: sweep batch size, tune TGS, benchmark throughput, find optimal config. (2) Node perf sweep — compare per-node GPU performance to find outliers. Use for: check nodes, node performance, find slow node, compare nodes. (3) Node network health sweep — detect inter-node network issues via multi-node bisection. Use for: network health, IB issues, RCCL problems, node pair testing, isolate network problem. (4) Model sweep — run all model configs on one or two commits. Use for: regression test, validate commit, test all models, smoke test, CI, compare branches.

2026-05-09
xla-tuning
소프트웨어 개발자

Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.

2026-05-09
job-log-triage
네트워크·컴퓨터 시스템 관리자

Triage MaxText training jobs from log files — failed, hanging, running, or completed. Use when the user asks why a job failed, wants to diagnose an error, sees a crash, hang, timeout, OOM, NCCL error, heartbeat timeout, wants to understand a job's status, or asks about bad/low/dropping TGS or throughput.

2026-05-09
tsdb-diagnosis
네트워크·컴퓨터 시스템 관리자

Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.

2026-05-03
performance-analysis
기타 컴퓨터 관련 직업

Analyze MaxText training job performance using tgs_tagger, TraceLens, and IRLens. Use when the user asks to analyze a training run, profile traces, HLO IR, TGS metrics, GPU utilization, or mentions tag_tgs, TraceLens, IRLens, xplane, or performance analysis.

2026-05-03
Showing top 8 of 11 collected skills in this repository.
#003
GEAK
5 skills10226updated 2026-05-18
15% of creator
fp8-gemm-tuning-sglang-aiter
데이터 과학자

Use when trying to optimize end-to-end SGLang performance with gemm tuning for FP8 models on AMD HIP/ROCm by replacing the default Triton GEMM backend with a tuned Composable Kernel (CK) path through aiter; this skill is the verified playbook for that entire process, using FP8 block-wise GEMM (gemm_a8w8_blockscale) as the primary worked example—GEMM shape/dispatch logging in SGLang, CK composable-kernel tuning, and AITER_CONFIG_GEMM_A8W8_BLOCKSCALE CSV integration. FP8 blockscale and bpreshuffle should also apply by switch the place for dumping gemm and the ck tool used for tuning.

2026-05-18
triton
소프트웨어 품질 보증 분석가·테스터

Use when generating a fixed test harness for a Triton (@triton.jit) GPU kernel under the v3 GEAK preprocess pipeline. Covers harness CLI contract, Triton-specific entry-point detection, three-tier shape lists, --iterations argparse rule, and the GPU-RNG-pollution pitfall that rocprofv3 punishes.

2026-05-17
hip
소프트웨어 개발자

Use when generating a fixed test harness for a HIP / CUDA / CK / HSACO GPU kernel under the v3 GEAK preprocess pipeline. Covers harness CLI contract, the three HIP build shapes (pybind11, standalone make, raw hipcc), the COMMANDMENT wrapper-script rule, --iterations argparse, and the GPU-RNG-pollution pitfall that rocprofv3 punishes.

2026-05-17
flydsl
소프트웨어 개발자

Use when working with FlyDSL kernels (`@flyc.kernel` / `flydsl.compiler`) on AMD GPUs. Covers three complementary workflows: writing new tile-programmed kernels, optimizing existing kernels for performance, and debugging correctness issues (NaN, wrong results, compilation errors, hangs).

2026-05-11
pytorch2flydsl-translation
소프트웨어 개발자

Use when translating PyTorch GPU kernels to FlyDSL. Provides API reference, translation guides, and strategy for mapping PyTorch ops to FlyDSL equivalents.

2026-04-22
#004
Primus
3 skills10037updated 2026-05-08
9.1% of creator
#005
Magpie
2 skills526updated 2026-05-28
6.1% of creator
저장소 5개 중 5개 표시
모든 저장소를 표시했습니다