Skip to main content
Ejecuta cualquier Skill en Manus
con un clic
$pwd:
ROCm
GitHub creator profile

ROCm

Repository-level view of 59 collected skills across 10 GitHub repositories, including approximate occupation coverage.

skills collected
59
repositories
10
occupation fields
2
updated
2026-05-29
occupation focus
Major fields detected across this creator.
Showing the top 8 repositories here; full repository list continues below.
repository explorer

Repositories and representative skills

#001
rocm-systems
21 skills388253updated 2026-05-29
36% of creator
brainstorming
Especialistas en gestión de proyectos

Use before any creative work — designing a new amdsmi_* API, adding a CLI command, building a feature, or modifying behavior. Explores intent, requirements, and design before any code is written.

2026-05-29
dispatching-parallel-agents
Desarrolladores de software

Use when facing 2+ independent problems with no shared state — e.g., unrelated test failures in different subsystems, multiple independent bug investigations, parallel research tasks. Dispatch one focused subagent per domain instead of investigating sequentially.

2026-05-29
executing-plans
Desarrolladores de software

Use when a written implementation plan exists at docs/dev/plans/ and you need to execute it task-by-task in the current session with verification at each step.

2026-05-29
restructure-commits
Desarrolladores de software

Use when finishing an amd-smi development branch — consolidating commits into logical groups with clean messages AND deciding how to integrate the work (merge to develop, push and open PR, keep as-is, or discard). Covers commit restructuring plus the merge/PR/cleanup workflow.

2026-05-29
systematic-debugging
Desarrolladores de software

Use when encountering any bug, test failure, build failure, or unexpected behavior in amd-smi — before proposing any fix. Enforces root-cause investigation before symptom patching.

2026-05-29
test-driven-development
Desarrolladores de software

Use when implementing any amd-smi feature, bug fix, or behavior change — before writing implementation code. Enforces strict RED-GREEN-REFACTOR: failing test first, watch it fail, minimal code to pass, refactor.

2026-05-29
using-git-worktrees
Desarrolladores de software

Use when starting amd-smi feature work, executing an implementation plan, or reviewing a PR that needs an isolated workspace away from the main checkout. Sets up a worktree following the rocm-systems-pr<PR#> convention.

2026-05-29
writing-plans
Desarrolladores de software

Use when an approved spec exists and you need a bite-sized, file-level implementation plan before any code is written. Produces a plan ready for executing-plans or subagent dispatch.

2026-05-29
Showing top 8 of 21 collected skills in this repository.
#002
FlyDSL
16 skills19256updated 2026-05-29
27% of creator
format-code
Desarrolladores de software

Format and clean up changed files before committing, matching the project's CI style gate. Formats Python with black + ruff and C/C++ with clang-format using the repository's .clang-format. Use when the user says "format code", "clean up code", "lint", "format before commit", "/format-code", or mentions black, ruff, clang-format, or CI style failures while tidying their working tree.

2026-05-29
capture-kernel-trace
Desarrolladores de software

Capture GPU kernel ATT (Advanced Thread Trace) via rocprofv3 on a remote Docker container or locally. Discovers kernel names, configures input.yaml with the target kernel_include_regex, runs rocprofv3 -i input.yaml with FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1, and downloads the latest ui_output_agent_* directory for analysis. Usage: /capture-kernel-trace <test_script.py> [kernel_name_pattern]

2026-05-29
kernel-trace-analysis
Desarrolladores de software

Profile GPU kernels using rocprofv3 to collect ATT instruction-level traces, then analyze the trace data using hotspot_analyzer.py to identify top-K stall hotspots (VMEM-load, VMEM-wait, LDS/SMEM-wait, barrier, MFMA stalls) mapped back to source lines, and produce an actionable optimization plan. Usage: /kernel-trace-analysis <cmd> Can also analyze an existing dispatch dir directly: /kernel-trace-analysis --dir <path>

2026-05-29
lds-optimization
Desarrolladores de software

Optimize LDS (Local Data Share / shared memory) access patterns in FlyDSL GPU kernels. Diagnose bank conflicts and high lgkmcnt stalls from ATT trace data, then apply swizzle or padding layouts to eliminate conflicts. Also increase the distance between LDS write and subsequent LDS read to hide LDS latency. LDS read preceded by write always requires a sync (s_waitcnt lgkmcnt or s_barrier). Use when trace analysis shows ds_read/ds_write/lgkmcnt as a bottleneck. Usage: /lds-optimization

2026-05-29
prefetch-data-load
Desarrolladores de software

Apply prefetch optimization to FlyDSL kernel loops: pre-load the first iteration's data before the loop, issue async loads for the next iteration inside the loop body, and swap buffers at the loop tail via runtime loop-carried values. This overlaps data load latency with compute instructions. Use when a kernel has a loop where buffer_load feeds into MFMA/compute and load latency is exposed. Usage: /prefetch-data-load

2026-05-29
flydsl-kernel-authoring
Desarrolladores de software

Comprehensive reference for authoring FlyDSL GPU kernels on AMD GPUs. Covers the layout algebra, tiled copy/MMA, buffer ops, loop-carried range loops, SmemAllocator, autotuning, and common patterns. Use when writing, reviewing, or understanding FlyDSL kernel code.

2026-05-27
build-rocm-image
Administradores de redes y sistemas informáticos

Connect to a remote host via SSH and build a Docker image with rocprofv3, aiter, and FlyDSL. Use when user wants to build/rebuild the ROCm development image on a remote host. Usage: /build-rocm-image <hostname>

2026-05-20
oob-detection
Analistas de seguridad de la información

Detect out-of-bounds memory accesses in CPU or GPU code using static interval analysis and runtime assertions/printfs. Use when investigating OOB, buffer overrun, invalid memory access, HIP/ROCm illegal address, CUDA illegal memory access, silent tensor corruption, or suspicious buffer_load/store address arithmetic.

2026-05-20
Showing top 8 of 16 collected skills in this repository.
#003
pytorch
6 skills25382updated 2026-03-12
10% of creator
pr-review
Analistas de garantía de calidad de software y probadores

Review PyTorch pull requests for code quality, test coverage, security, and backward compatibility. Use when reviewing PRs, when asked to review code changes, or when the user mentions "review PR", "code review", or "check this PR".

2026-03-12
triaging-issues
Desarrolladores de software

Triages GitHub issues by routing to oncall teams, applying labels, and closing questions. Use when processing new PyTorch issues or when asked to triage an issue.

2026-03-10
pt2-bug-basher
Desarrolladores de software

Debug PyTorch 2 compiler stack failures including Dynamo graph breaks, Inductor codegen errors, AOTAutograd crashes, and accuracy mismatches. Use when encountering torch.compile errors, BackendCompilerFailed exceptions, recompilation issues, Triton kernel failures, FX graph problems, or when the user mentions debugging PT2, Dynamo, Inductor, or compiled model issues.

2026-03-05
document-public-apis
Desarrolladores de software

Document undocumented public APIs in PyTorch by removing functions from coverage_ignore_functions and coverage_ignore_classes in docs/source/conf.py, running Sphinx coverage, and adding the appropriate autodoc directives to the correct .md or .rst doc files. Use when a user asks to remove functions from conf.py ignore lists.

2026-02-24
pyrefly-type-coverage
Desarrolladores de software

Migrate a file to use stricter Pyrefly type checking with annotations required for all functions, classes, and attributes.

2026-02-04
metal-kernel
Desarrolladores de software

Write Metal/MPS kernels for PyTorch operators. Use when adding MPS device support to operators, implementing Metal shaders, or porting CUDA kernels to Apple Silicon. Covers native_functions.yaml dispatch, host-side operators, and Metal kernel implementation.

2026-01-27
#004
ATOM
5 skills9964updated 2026-05-23
8.5% of creator
atom-patterns
Desarrolladores de software

Coding patterns and architecture index for the ATOM LLM inference engine

2026-05-23
capture-trace
Administradores de redes y sistemas informáticos

Capture a PyTorch profiler / kineto trace from a running ATOM server for a short benchmark window. Use when the user asks for "a trace", "profiler trace", "GPU trace", or "抓 trace" for performance investigation — what kernels ran, what's on the critical path, what's slow. Do NOT use for crashes (use debug-agent-locate-kernel) or numerical bugs (use dump-bisect-debug).

2026-05-23
debug-agent-locate-kernel
Administradores de redes y sistemas informáticos

Identify which GPU kernel is faulting/hanging in ATOM via rocm-debug-agent (for faults/asserts) or rocgdb (for silent livelocks). debug-agent dumps wave registers + faulting PC + (with --save-code-objects) disassembled code object on memory faults / ASSERT_TRAP. rocgdb attaches to a live process and lists in-flight `info dispatches` + HSA `info queues` — works when the kernel isn't faulting but just stuck (e.g. atomic-counter deadlock). Use when: server crashes with "Memory access fault by GPU node-N", server hangs with GPU at 100% but no token output, kernel asserting `s_trap`, or `HIP_LAUNCH_BLOCKING=1` makes a hang vanish. Do NOT use for: numerical bugs (use dump-bisect-debug), compile errors, OOM.

2026-05-23
dump-bisect-debug
Científicos de datos

Locate forward numerical bugs by dumping intermediate tensors from a target implementation and a known-good reference, then bisecting layer by layer. Also covers batch-invariance bisect (the same token at any batch position should produce a bitwise-identical output, per DeepSeek V4 paper §3.3). Use when "the output is wrong but I don't know where" — model produces gibberish, degenerates, or picks the wrong token, but code review reveals nothing.

2026-05-23
run-atom-workload
Desarrolladores de software

Run any ATOM workload — accuracy eval (GSM8K via lm_eval), performance benchmark, concurrency sweep, offline simple_inference, or fault repro under rocm-debug-agent. Use when the user asks to "test accuracy", "测精度", "跑 GSM8K", "跑 benchmark", "test performance", "run sweep", "repro the fault", "测一下 MTP1 精度", "跑 simple_inference" — anything that drives an ATOM workload. Encodes the canonical flow (stop → start → workload-in-shell-bg → wait_infer_drain → stop) and the model-family env vars. Same pattern works for both server-based workloads (lm_eval / benchmark client) and offline simple_inference. Do NOT use for profiling traces (use capture-trace).

2026-05-23
#005
aiter
2 skills450331updated 2026-05-27
3.4% of creator
#006
rocm-libraries
2 skills354298updated 2026-05-28
3.4% of creator
#007
TransformerEngine
2 skills6929updated 2026-04-24
3.4% of creator
#008
xla
2 skills98updated 2026-03-25
3.4% of creator
#009
repo-digest
2 skills60updated 2026-05-05
3.4% of creator
Mostrando 10 de 10 repositorios
Todos los repositorios cargados