ワンクリックで
build-rocm-image
// Connect to a remote host via SSH and build a Docker image with rocprofv3, aiter, and FlyDSL. Use when user wants to build/rebuild the ROCm development image on a remote host. Usage: /build-rocm-image <hostname>
// Connect to a remote host via SSH and build a Docker image with rocprofv3, aiter, and FlyDSL. Use when user wants to build/rebuild the ROCm development image on a remote host. Usage: /build-rocm-image <hostname>
Format and clean up changed files before committing, matching the project's CI style gate. Formats Python with black + ruff and C/C++ with clang-format using the repository's .clang-format. Use when the user says "format code", "clean up code", "lint", "format before commit", "/format-code", or mentions black, ruff, clang-format, or CI style failures while tidying their working tree.
Capture GPU kernel ATT (Advanced Thread Trace) via rocprofv3 on a remote Docker container or locally. Discovers kernel names, configures input.yaml with the target kernel_include_regex, runs rocprofv3 -i input.yaml with FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1, and downloads the latest ui_output_agent_* directory for analysis. Usage: /capture-kernel-trace <test_script.py> [kernel_name_pattern]
Profile GPU kernels using rocprofv3 to collect ATT instruction-level traces, then analyze the trace data using hotspot_analyzer.py to identify top-K stall hotspots (VMEM-load, VMEM-wait, LDS/SMEM-wait, barrier, MFMA stalls) mapped back to source lines, and produce an actionable optimization plan. Usage: /kernel-trace-analysis <cmd> Can also analyze an existing dispatch dir directly: /kernel-trace-analysis --dir <path>
Optimize LDS (Local Data Share / shared memory) access patterns in FlyDSL GPU kernels. Diagnose bank conflicts and high lgkmcnt stalls from ATT trace data, then apply swizzle or padding layouts to eliminate conflicts. Also increase the distance between LDS write and subsequent LDS read to hide LDS latency. LDS read preceded by write always requires a sync (s_waitcnt lgkmcnt or s_barrier). Use when trace analysis shows ds_read/ds_write/lgkmcnt as a bottleneck. Usage: /lds-optimization
Apply prefetch optimization to FlyDSL kernel loops: pre-load the first iteration's data before the loop, issue async loads for the next iteration inside the loop body, and swap buffers at the loop tail via runtime loop-carried values. This overlaps data load latency with compute instructions. Use when a kernel has a loop where buffer_load feeds into MFMA/compute and load latency is exposed. Usage: /prefetch-data-load
Comprehensive reference for authoring FlyDSL GPU kernels on AMD GPUs. Covers the layout algebra, tiled copy/MMA, buffer ops, loop-carried range loops, SmemAllocator, autotuning, and common patterns. Use when writing, reviewing, or understanding FlyDSL kernel code.
| name | build-rocm-image |
| description | Connect to a remote host via SSH and build a Docker image with rocprofv3, aiter, and FlyDSL. Use when user wants to build/rebuild the ROCm development image on a remote host. Usage: /build-rocm-image <hostname> |
Build a Docker image on a remote host with rocm gpu access based on rocm/vllm-dev:nightly.
| Argument | Required | Description |
|---|---|---|
<HOST> | Yes | The remote hostname to SSH into and build the image on. Example: hjbog-srdc-39.amd.com |
When this skill is invoked, the argument passed in is the target hostname. Replace all occurrences of <HOST> below with the provided hostname. If no hostname is provided, ask the user for it before proceeding.
<HOST> (provided as argument)rocm/vllm-dev:nightlyssh -o ConnectTimeout=30 <HOST> "cat > /tmp/Dockerfile.rocm-custom << 'DOCKERFILE'
FROM rocm/vllm-dev:nightly
# Uninstall existing aiter
RUN pip uninstall -y aiter 2>/dev/null; true
# Install build dependencies
RUN pip install ninja cmake pybind11
# Clone and install aiter from main branch
RUN cd /tmp && \
git clone --depth 1 --branch main https://github.com/ROCm/aiter.git && \
cd aiter && \
pip install -e . && \
cd / && rm -rf /tmp/aiter
# Clone and install FlyDSL from main branch
RUN cd /tmp && \
git clone --depth 1 --branch main https://github.com/ROCm/FlyDSL.git && \
cd FlyDSL && \
pip install -e . && \
cd / && rm -rf /tmp/FlyDSL
# Install the rocprof-trace-decoder release that matches the ROCm version.
RUN set -eux; \
ROCM_VERSION="$(sed -E 's/^([0-9]+)\.([0-9]+).*/\1.\2/' /opt/rocm/.info/version)"; \
RTD_VERSION="0.1.5"; \
echo "Using rocprof-trace-decoder ${RTD_VERSION} for ROCm ${ROCM_VERSION}"; \
RTD_INSTALLER="rocprof-trace-decoder-manylinux-2.28-${RTD_VERSION}-Linux.sh"; \
cd /tmp; \
wget -q "https://github.com/ROCm/rocprof-trace-decoder/releases/download/${RTD_VERSION}/${RTD_INSTALLER}"; \
chmod +x "${RTD_INSTALLER}"; \
"./${RTD_INSTALLER}" --skip-license --prefix=/tmp/rtd-install; \
find /tmp/rtd-install -name '*.so*' -exec cp -a {} /opt/rocm/lib/ \; ; \
ldconfig; \
rm -rf "${RTD_INSTALLER}" /tmp/rtd-install
# Verify installations
RUN python3 -c 'import aiter; print(\"aiter OK\")' && \
python3 -c 'import flydsl; print(\"FlyDSL OK\")' && \
which rocprofv3 && echo 'rocprofv3 OK' && \
ls /opt/rocm/lib/librocprof*decoder* && echo 'rocprof-trace-decoder OK'
LABEL description=\"ROCm dev image with aiter(main), FlyDSL(main), rocprofv3, and rocprof-trace-decoder\"
DOCKERFILE
"
Build the image with a descriptive tag. Use --network=host to ensure git clone works.
ssh -o ConnectTimeout=30 <HOST> "docker build --network=host -t rocm-dev-custom:main -f /tmp/Dockerfile.rocm-custom /tmp"
Note: Use --progress=plain to see full build logs.
ssh -o ConnectTimeout=30 <HOST> "docker run --rm rocm-dev-custom:main bash -c '
echo \"=== aiter ===\"
python3 -c \"import aiter; print(aiter.__version__)\" 2>/dev/null || python3 -c \"import aiter; print(\\\"aiter OK\\\")\"
echo \"=== FlyDSL ===\"
python3 -c \"import flydsl; print(flydsl.__version__)\" 2>/dev/null || python3 -c \"import flydsl; print(\\\"FlyDSL OK\\\")\"
echo \"=== rocprofv3 ===\"
rocprofv3 --version 2>/dev/null || which rocprofv3
echo \"=== ROCm ===\"
cat /opt/rocm/.info/version
'"
ssh -o ConnectTimeout=30 <HOST> "rm -f /tmp/Dockerfile.rocm-custom"
Report to the user:
/opt/rocm/.info/version and add the matching decoder release from ROCm release notesdocker image pruneTo start a container from the built image with GPU access:
ssh <HOST> "docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size=64g rocm-dev-custom:main bash"