Run any Skill in Manus with one click

insight-trace

Stars24

Forks62

UpdatedMay 25, 2026 at 01:57

Generate a MindStudio Insight trace for any `kernel_entry(args)` style kernel in this repo — SPMD mix, AIC-only single-task (e.g. `aic_pv_matmul`), or AIV-only single-task (e.g. `aiv_softmax_prepare`). Use when the user asks to "produce/generate/run an Insight trace", "trace this kernel under msprof op simulator", or troubleshoot Insight trace collection. AICore-only replay path — bypasses AICPU orchestration. For PTOAS-style kernels, use [PTOAS msprof_op_simulator_usage_zh.md](https://github.com/hw-native-sys/PTOAS/blob/main/.claude/skill/msprof_op_sim_insight_skill.md) instead.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

hw-native-sys

hw-native-sys/simpler

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

SKILL.md

readonly

Insight Trace for `kernel_entry(args)` Kernels

Builds a single mix-arch .so wrapper around a target kernel_entry(args), launches it under msprof op simulator, and exports the per-core trace.json + instr_exe CSVs that MindStudio Insight consumes. The recipe is self-contained and covers all three kernel shapes; it does not assume any pre-existing workspace in outputs/.

When to use

Required entry: extern "C" __aicore__ void kernel_entry(__gm__ int64_t *args).

Three supported kernel shapes (Step 0 picks the right one):

Shape	Example	Wrapper branches	Args layout
SPMD mix	`tests/.../spmd_paged_attention/kernels/mix/paged_attention_parallel.cpp`	both `__DAV_CUBE__` and `__DAV_VEC__` include the kernel	slot 0..15 tensor/scalar, *slot 48/49 = `LocalContext`/`GlobalContext`* (per `intrinsic.h`)
AIC-only task	`tests/st/a2a3/.../paged_attention_unroll/kernels/aic/aic_pv_matmul.cpp`, `aic_qk_matmul.cpp`	only `__DAV_CUBE__` includes the kernel; `__DAV_VEC__` body is empty	positional Tensor pointers + scalars only; no slot 48/49
AIV-only task	`tests/st/a2a3/.../paged_attention_unroll/kernels/aiv/aiv_softmax_prepare.cpp`, `aiv_online_update.cpp`	only `__DAV_VEC__` includes the kernel; `__DAV_CUBE__` body is empty	positional Tensor pointers + scalars only; no slot 48/49

In all three shapes the kernel is launched via <<<HW_BLOCK_NUM, ...>>>. For SPMD mix, HW_BLOCK_NUM is the kernel's true hw block dim (typically 24). For per-task AIC-only / AIV-only, HW_BLOCK_NUM=1 is the default — each hw core would just redo identical work, which doesn't help the trace and only inflates simulated runtime.

Skip this skill if:

The kernel is PTOAS-style — the dedicated msprof flow already works.
The user wants AICPU scheduler / tensormap / ringbuffer state machine in the trace. Those need real hardware msprof --application; the simulator returns 207000 (ACL_ERROR_RT_FEATURE_NOT_SUPPORT) on the AICPU KFC launch. The replay produced here covers AIC/AIV pipeline behavior only.

Step 0 — Classify the kernel shape

Read the kernel source and decide:

Mix or single-arch? If the file path is under kernels/mix/ and uses __DAV_CUBE__/__DAV_VEC__ macros internally, it's SPMD mix. If it lives under kernels/aic/, it's AIC-only. Under kernels/aiv/, AIV-only.
Does kernel_entry read slot 48 or 49? Grep for SPMD_LOCAL_CONTEXT_INDEX, get_block_idx(args), get_sub_block_id(args), or direct args[48] / args[49] reads. Present → SPMD-context layout. Absent → positional layout (typical for AIC-only / AIV-only tasks; the orchestration submits one logical task per dispatch).
Does the kernel reference __DAV_VEC__-only or __DAV_CUBE__-only intrinsics? AIC-only kernels use TMATMUL, TLOAD/TSTORE on TileMat*, PIPE_M, PIPE_FIX — these don't compile under __DAV_VEC__. So an AIC-only kernel must not be #included in the __DAV_VEC__ branch of the wrapper, and vice versa.

Step 1 — Gather inputs

Before writing any code, confirm with the user (or read from the kernel source):

Item	Where to look
Kernel source path	User input
Kernel shape (mix / aic-only / aiv-only)	Step 0 above
Tensor args (count, dtypes, shapes)	`args[0..N]` reads in `kernel_entry`
Scalar args (count, values)	`args[N..]` reads, treat float scalars as `f32_bits()`
`hw_block_dim`	SPMD mix: `<<<HW, ...>>>` in the original launcher (typically 24); AIC-only / AIV-only: default 1
FIFO slot sizes (SPMD mix only)	`PAConfig<...>::_SLOT_SIZE FIFO_DEPTH * hw_blocks` — compute from kernel constants, not from memory
Per-task work knobs	Anything that scales per-block work (`n_blocks`, `context_len`, batch)

Decide a shrunk shape. The simulator runs camodel in serial mode ([TmSim]: Run in serial mode.), so total runtime ≈ Σ (per-block work). Practical targets:

Aim for a few-minute simulated run. As a rule of thumb, total work units (e.g. batch × n_blocks for paged-attention) in the low hundreds finishes in a few minutes; tens of thousands will exceed any reasonable task-submit --max-time. Single-task AIC-only / AIV-only kernels with small n_blocks finish in well under a minute.
For SPMD mix: keep total_logical_blocks ≈ hw_block_dim so every hw block has work and the trace shows all cores active.
Shrink the inner dimension that scales per-block work (e.g. context_len / n_blocks for paged-attention) when the kernel allows.

If the kernel hardcodes the shape via static constexpr (e.g. CASE1_BATCH = 256), copy the source into the workspace and sed -i it smaller. The wrapper #includes the local copy. Sync the host-side kBatch to the same value — both must move together or the kernel walks off-end. For per-task kernels (aic_pv_matmul etc.) the shape comes from the Tensor shapes you build in replay_host.cpp, so just shrink those directly.

Step 2 — Create the workspace

TS=$(date +%Y%m%d_%H%M%S)
WS="$REPO_ROOT/outputs/insight_trace_<case>_${TS}"
mkdir -p "$WS"

Five files go in $WS. Each has a small fixed shape; the only kernel- specific bits are the #include path, HW_BLOCK_NUM, and the host-side shape constants and tensor init_tensor(...) calls. The full templates for each file are inlined in this skill — you do not need to copy from any pre-existing workspace.

The build must be a single mix-arch .so with a single launch_replay. Splitting into separate --cce-aicore-arch=dav-c220-cube and dav-c220-vec libraries does not produce an OPPROF dump under msprof op simulator — see "Common pitfalls" below.

File 1 — `replay_kernel.cpp` (mix-arch wrapper)

Inline the absolute include path to the target .cpp. The wrapper file is compiled twice by bisheng under --cce-aicore-arch=dav-c220 (once per arch); both variants land in the same .so. Only #include the kernel under the arch macro it actually compiles for — including an AIC-only kernel under __DAV_VEC__ (or vice versa) makes bisheng try to compile pipe/tile intrinsics for the wrong arch and fails.

All three variants below share the same top-of-file boilerplate:

#include <stdint.h>

#ifndef AICORE
#define AICORE [aicore]
#endif

Where the per-variant template says <prologue block>, paste these eight lines verbatim before the kernel #include:

#ifndef __CCE_AICORE__
#define __CCE_AICORE__ 220
#endif
#include <cce_aicore_intrinsics.h>
#ifndef PTO_NPU_ARCH_A2A3
#define PTO_NPU_ARCH_A2A3
#endif
#ifndef EVENT_ID7
#define EVENT_ID7 ((event_t)7)
#endif
#ifndef PIPE_FIX
#define PIPE_FIX ((pipe_t)10)
#endif

1a. SPMD mix kernel (e.g. `paged_attention_parallel.cpp`)

#if defined(__DAV_CUBE__) || defined(__DAV_VEC__)
// <prologue block>
#include "<ABSOLUTE PATH TO KERNEL .cpp>"
#endif

extern "C" __global__ AICORE void replay_entry(
    __gm__ int64_t *aic_args, __gm__ int64_t *aiv_args
) {
#if defined(__DAV_CUBE__)
    int32_t hw_idx = get_block_idx();
    kernel_entry(aic_args + static_cast<uint64_t>(hw_idx) * 50);
#endif
#if defined(__DAV_VEC__)
    int32_t lane_idx = static_cast<int32_t>(
        get_block_idx() * get_subblockdim() + get_subblockid());
    kernel_entry(aiv_args + static_cast<uint64_t>(lane_idx) * 50);
#endif
}

50 = MAX_TENSOR_ARGS(16) + MAX_SCALAR_ARGS(32) + 2. Do not write 16/17 here — that was the old broken pattern.

1b. AIC-only single-task kernel (e.g. `aic_pv_matmul.cpp`, `aic_qk_matmul.cpp`)

The kernel only includes under __DAV_CUBE__. The __DAV_VEC__ variant of replay_entry still gets emitted (so the replay_entry_mix_aiv symbol exists for msprof to find) but its body is empty.

#if defined(__DAV_CUBE__)
// <prologue block>
#include "<ABSOLUTE PATH TO AIC KERNEL .cpp>"
#endif

extern "C" __global__ AICORE void replay_entry(__gm__ int64_t *args) {
#if defined(__DAV_CUBE__)
    kernel_entry(args);
#endif
    // __DAV_VEC__ variant: intentionally empty — AIC-only kernel.
}

For HW_BLOCK_NUM > 1 (each AIC core reading its own row), use kernel_entry(args + static_cast<uint64_t>(get_block_idx()) * 50) and allocate per-row args host-side. Default HW_BLOCK_NUM=1 shares one args row across all cores.

1c. AIV-only single-task kernel (e.g. `aiv_softmax_prepare.cpp`, `aiv_online_update.cpp`)

Mirror of 1b with arch macros swapped:

#if defined(__DAV_VEC__)
// <prologue block>
#include "<ABSOLUTE PATH TO AIV KERNEL .cpp>"
#endif

extern "C" __global__ AICORE void replay_entry(__gm__ int64_t *args) {
#if defined(__DAV_VEC__)
    kernel_entry(args);
#endif
    // __DAV_CUBE__ variant: intentionally empty — AIV-only kernel.
}

For HW_BLOCK_NUM > 1, both AIV lanes (sub_block_id 0/1) of each hw core are dispatched — use kernel_entry(args + (get_block_idx() * get_subblockdim() + get_subblockid()) * 50) to give each lane its own row.

File 2 — `replay_launch.cpp`

The launcher signature must match the wrapper (replay_entry) above. For SPMD mix the wrapper takes two args pointers; for AIC-only / AIV-only it takes one.

2a. SPMD mix

#include <stdint.h>
#ifndef AICORE
#define AICORE [aicore]
#endif

extern "C" __global__ AICORE void replay_entry(
    __gm__ int64_t *aic_args, __gm__ int64_t *aiv_args);

extern "C" void launch_replay(void *aic_args, void *aiv_args, void *stream) {
    replay_entry<<<HW_BLOCK_NUM, nullptr, stream>>>(
        (__gm__ int64_t *)aic_args, (__gm__ int64_t *)aiv_args);
}

2b. AIC-only or AIV-only

#include <stdint.h>
#ifndef AICORE
#define AICORE [aicore]
#endif

extern "C" __global__ AICORE void replay_entry(__gm__ int64_t *args);

extern "C" void launch_replay(void *args, void *stream) {
    replay_entry<<<HW_BLOCK_NUM, nullptr, stream>>>(
        (__gm__ int64_t *)args);
}

Replace HW_BLOCK_NUM with the kernel's hw block dim literal — typically 24 for SPMD mix, 1 for per-task AIC-only / AIV-only kernels (see Step 1).

File 3 — `replay_host.cpp`

ACL host runner. Common responsibilities (all shapes):

aclInit → aclrtSetDevice(getenv("ACL_DEVICE_ID")) → one stream.
aclrtMalloc for every input/output tensor and every FIFO/scratch buffer the kernel reads or writes.
Build the kernel's Tensor structs on host, aclrtMemcpy to a device-side d_tensors array.
Build per-row args[] arrays. Per row: slot 0..N-1 = device pointer to d_tensors[i], scalar slots per kernel.
launch_replay(...) → aclrtSynchronizeStream → free.

The shape-specific differences are in the args layout and whether LocalContext/GlobalContext arrays are built.

3a. SPMD mix args layout

FIFO sizes: hw_blocks * FIFO_DEPTH * SLOT_SIZE using the kernel's MAX_Q_TILE-equivalent constants.
LocalContext[hw_blocks] + GlobalContext[hw_blocks] for AIC, and LocalContext[2*hw_blocks] + GlobalContext[2*hw_blocks] for AIV. block_idx = hw block index for both; sub_block_id = lane index (0/1) for AIV, 0 for AIC.
aic_args[hw_blocks * 50] and aiv_args[2*hw_blocks * 50]. Per row: slot 0..N-1 = tensor pointers, scalar slots per kernel, row[48] = device-side LocalContext for that row, row[49] = device-side GlobalContext.
launch_replay(d_aic_args, d_aiv_args, stream).

Sketch (initialize ACL, allocate device buffers for inputs/outputs and the FIFO rings, build per-hw-block LocalContext/GlobalContext, populate the args rows, then launch and synchronize):

constexpr int kArgsSlots = 50;
constexpr int kHwBlocks = /* kernel's hw block dim, e.g. 24 */;
constexpr int kAivLanes = kHwBlocks * 2;

// d_tensors, d_*_local, d_*_global already aclrtMalloc'd and populated.
auto fill_args = [&](std::vector<int64_t> &args, int rows,
                     uintptr_t local_base, uintptr_t global_base) {
    for (int r = 0; r < rows; ++r) {
        int64_t *row = args.data() + uint64_t(r) * kArgsSlots;
        for (int i = 0; i < kNumTensors; ++i) {
            row[i] = (int64_t)((uintptr_t)d_tensors + i * sizeof(Tensor));
        }
        // row[kNumTensors..] = scalar values (f32_bits() for floats)
        row[48] = (int64_t)(local_base  + r * sizeof(LocalContext));
        row[49] = (int64_t)(global_base + r * sizeof(GlobalContext));
    }
};
std::vector<int64_t> aic_args(kHwBlocks  * kArgsSlots, 0);
std::vector<int64_t> aiv_args(kAivLanes * kArgsSlots, 0);
fill_args(aic_args, kHwBlocks,  (uintptr_t)d_aic_local,  (uintptr_t)d_aic_global);
fill_args(aiv_args, kAivLanes,  (uintptr_t)d_aiv_local,  (uintptr_t)d_aiv_global);
// aclrtMemcpy both into d_aic_args / d_aiv_args, then:
launch_replay(d_aic_args, d_aiv_args, stream);

LocalContext.block_idx = hw block index (r for AIC, r/2 for AIV); GlobalContext.sub_block_id = lane index (0/1 for AIV, 0 for AIC).

3b. AIC-only / AIV-only positional args layout

No FIFO rings (those are runtime-managed; here you trace one task in isolation).
No LocalContext / GlobalContext arrays — the kernel doesn't read slot 48/49.
args[HW_BLOCK_NUM * 50] (50 slots is uniform; using the kernel's exact slot count works too). Per row: slot 0..N-1 = tensor pointers for the kernel's positional args[0..N-1] reads, slot N..M = scalar values (treat float scalars as f32_bits()). Slots 48/49 are zero.
launch_replay(d_args, stream).
Default HW_BLOCK_NUM=1 → only fill row 0 and aclrtMemcpy 50 int64_ts.

Sketch for aic_pv_matmul.cpp (4 tensors + 2 scalars, HW_BLOCK_NUM=1):

constexpr int kArgsSlots = 50;
// pij_buf, value_cache, block_table, oi_new + (n_blocks, bt_offset)
init_tensor(&tensors[0], d_pij,        kPijBytes,  DataType::BFLOAT16, {n_blocks, M, K});
init_tensor(&tensors[1], d_value_cache,kVcBytes,   DataType::BFLOAT16, {kTotalKvBlocks, kBlockSize, kKvHeads, kHeadDim});
init_tensor(&tensors[2], d_block_table,kBtBytes,   DataType::INT32,    {kBatch, kMaxBlocksPerReq});
init_tensor(&tensors[3], d_oi_new,     kOiBytes,   DataType::FLOAT32,  {kQTile, kHeadDim});

std::array<int64_t, kArgsSlots> args{};
args[0] = (int64_t)((uintptr_t)d_tensors + 0 * sizeof(Tensor));
args[1] = (int64_t)((uintptr_t)d_tensors + 1 * sizeof(Tensor));
args[2] = (int64_t)((uintptr_t)d_tensors + 2 * sizeof(Tensor));
args[3] = (int64_t)((uintptr_t)d_tensors + 3 * sizeof(Tensor));
args[4] = (int64_t)kNBlocks;    // n_blocks
args[5] = (int64_t)kBtOffset;   // bt_offset
ACL_CHECK(aclrtMemcpy(d_args, sizeof(args), args.data(), sizeof(args),
                      ACL_MEMCPY_HOST_TO_DEVICE));
launch_replay(d_args, stream);

For aiv_softmax_prepare.cpp etc., follow the same pattern — count the positional args[i] reads in kernel_entry, allocate tensors of matching shape, and pack scalars (remember f32_bits() for floats).

Filling args[] with zeros (aclrtMemset(d_query, ..., 0)) is fine — the trace covers pipeline structure, not numerical correctness. If the kernel has data-dependent early-exit / scale-by-max paths and you want those branches in the trace, fill with random instead. (Note: if a kernel reads block_table to compute address offsets, leaving it zero makes every block point at value_cache[0] — fine for pipeline tracing but identical addresses across blocks.)

File 4 — `CMakeLists.txt`

Single mix-arch shared library: bisheng compiles the wrapper twice under --cce-aicore-arch=dav-c220 (no -cube/-vec suffix) and ships both variants in one .so. Do not split into two --cce-aicore-arch=dav-c220-cube + dav-c220-vec libs — that layout produces an empty OPPROF dump under msprof op simulator.

Generic template (set REPO_ROOT / PTO_ISA_ROOT from the environment; adjust project() name only):

cmake_minimum_required(VERSION 3.16)

set(CMAKE_C_COMPILER bisheng)
set(CMAKE_CXX_COMPILER bisheng)

project(insight_trace_replay)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)

if(NOT DEFINED ENV{ASCEND_HOME_PATH})
    message(FATAL_ERROR "ASCEND_HOME_PATH is not set (source CANN set_env.sh first)")
endif()
set(ASCEND_HOME_PATH $ENV{ASCEND_HOME_PATH})
set(SOC_VERSION dav_2201 CACHE STRING "Simulator SoC version")
set(PTO_ISA_ROOT $ENV{PTO_ISA_ROOT} CACHE PATH "PTO ISA root")
set(REPO_ROOT $ENV{REPO_ROOT} CACHE PATH "simpler repo root")

add_compile_options(
    -D_FORTIFY_SOURCE=2 -O2 -std=c++17
    -Wno-macro-redefined -Wno-ignored-attributes
    -fstack-protector-strong -fPIC
)
add_link_options(-s -Wl,-z,relro -Wl,-z,now)

set(CMAKE_CCE_COMPILE_OPTIONS
    -xcce -fenable-matrix --cce-aicore-enable-tl -fPIC
    -Xhost-start -Xhost-end
    "SHELL:-mllvm -cce-aicore-stack-size=0x8000"
    "SHELL:-mllvm -cce-aicore-function-stack-size=0x8000"
    "SHELL:-mllvm -cce-aicore-record-overflow=true"
    "SHELL:-mllvm -cce-aicore-addr-transform"
    "SHELL:-mllvm -cce-aicore-dcci-insert-for-scalar=false"
)
set(CMAKE_CPP_COMPILE_OPTIONS
    -xc++
    "SHELL:-include stdint.h"
    "SHELL:-include stddef.h"
)

set(COMMON_INCLUDES
    ${PTO_ISA_ROOT}/include
    ${PTO_ISA_ROOT}/include/pto
    ${REPO_ROOT}/src/a2a3/runtime/tensormap_and_ringbuffer/runtime
    ${REPO_ROOT}/src/a2a3/runtime/tensormap_and_ringbuffer/common
    ${REPO_ROOT}/src/common/task_interface
    ${REPO_ROOT}/src/a2a3/platform/include
    ${REPO_ROOT}/simpler_setup/incore
    ${ASCEND_HOME_PATH}/pkg_inc
    ${ASCEND_HOME_PATH}/pkg_inc/profiling
    ${ASCEND_HOME_PATH}/pkg_inc/runtime/runtime
    ${ASCEND_HOME_PATH}/include
)

add_library(replay_kernel SHARED replay_kernel.cpp replay_launch.cpp)
target_compile_options(replay_kernel PRIVATE
    ${CMAKE_CCE_COMPILE_OPTIONS}
    --cce-aicore-arch=dav-c220
    -DREGISTER_BASE -std=c++17)
target_include_directories(replay_kernel PRIVATE ${COMMON_INCLUDES})
target_link_options(replay_kernel PRIVATE --cce-fatobj-link)

add_executable(replay_host replay_host.cpp)
target_compile_options(replay_host PRIVATE ${CMAKE_CPP_COMPILE_OPTIONS})
target_include_directories(replay_host PRIVATE ${COMMON_INCLUDES})
target_link_directories(replay_host PUBLIC
    ${ASCEND_HOME_PATH}/lib64
    ${ASCEND_HOME_PATH}/aarch64-linux/simulator/${SOC_VERSION}/lib
)
target_link_libraries(replay_host PRIVATE
    replay_kernel
    runtime_camodel
    stdc++ ascendcl m tiling_api platform c_sec dl nnopbase
)

The COMMON_INCLUDES paths assume the standard simpler-repo layout (src/a2a3/runtime/..., src/common/...). If your kernel pulls in headers from another arch directory, append to COMMON_INCLUDES.

File 5 — `run_collect.sh`

Build + collect + export wrapped in one bash script so task-submit can run it under a single NPU lock. Generic template — set CANN_HOME / PTO_ISA_ROOT / REPO_ROOT to whatever this machine provides; do not hard-code absolute paths into the workspace:

#!/usr/bin/env bash
set -euo pipefail

# Required env (caller sets these — typical values are machine-specific):
#   CANN_HOME       e.g. .../cann-x.y.z-betaN  (must contain set_env.sh)
#   PTO_ISA_ROOT    path to the pto-isa repo checkout
#   REPO_ROOT       path to the simpler repo checkout
: "${CANN_HOME:?CANN_HOME must be set}"
: "${PTO_ISA_ROOT:?PTO_ISA_ROOT must be set}"
: "${REPO_ROOT:?REPO_ROOT must be set}"

WS="${WS:-$(dirname "$(readlink -f "$0")")}"
SOC_VERSION="${SOC_VERSION:-dav_2201}"
DEVICE_ID="${TARGET_DEVICE_ID:-${NPU_LOCKED_DEVICE:-0}}"
BUILD_DIR="$WS/build"
COLLECT_DIR="$WS/msprof_collect"
EXPORT_ROOT="$WS/insight_export"

source "$CANN_HOME/../cann/set_env.sh" 2>/dev/null \
  || source "$CANN_HOME/set_env.sh"
export ASCEND_HOME_PATH="$CANN_HOME"
SIM_LIB_DIR="$CANN_HOME/aarch64-linux/simulator/$SOC_VERSION/lib"
export LD_LIBRARY_PATH="$BUILD_DIR:$SIM_LIB_DIR:$CANN_HOME/lib64:$CANN_HOME/aarch64-linux/devlib:$CANN_HOME/devlib:${LD_LIBRARY_PATH:-}"
export ACL_DEVICE_ID="$DEVICE_ID"
mkdir -p "$BUILD_DIR" "$COLLECT_DIR" "$EXPORT_ROOT"

cmake -G Ninja -S "$WS" -B "$BUILD_DIR" \
    -DSOC_VERSION="$SOC_VERSION" \
    -DPTO_ISA_ROOT="$PTO_ISA_ROOT" \
    -DREPO_ROOT="$REPO_ROOT"
cmake --build "$BUILD_DIR" --target replay_host

msprof op simulator \
  --application="$BUILD_DIR/replay_host" \
  --kernel-name="replay_entry" \
  --launch-count=1 \
  --soc-version="$SOC_VERSION" \
  --timeout=120 \
  --output="$COLLECT_DIR/out" \
  2>&1 | tee "$COLLECT_DIR/msprof_collect.log"

OPPROF_DIR="$(find "$COLLECT_DIR/out" -maxdepth 1 -mindepth 1 -type d -name 'OPPROF_*' | sort | tail -n 1)"
test -n "$OPPROF_DIR"
if [[ -d "$OPPROF_DIR/device0/tmp_dump" ]]; then
  EXPORT_SRC="$OPPROF_DIR/device0/tmp_dump"
else
  EXPORT_SRC="$OPPROF_DIR/dump"
fi

msprof op simulator --export="$EXPORT_SRC" --output="$EXPORT_ROOT" \
  2>&1 | tee "$EXPORT_ROOT/msprof_export.log"

--kernel-name=replay_entry is mandatory under recent CANN releases — without it, the simulator reports success but writes no tmp_dump. msprof auto-matches replay_entry_mix_aic / replay_entry_mix_aiv from this unmangled name. This is the single most common reason for an empty OPPROF dump.

Step 3 — Smoke-build before submitting to the queue

If your dev environment uses an NPU-lock queue (task-submit or similar), compile errors should fail locally first. Source the CANN environment and run cmake:

source "$CANN_HOME/set_env.sh"   # or the project's set_env wrapper
export PTO_ISA_ROOT="..."
export REPO_ROOT="..."
cmake -G Ninja -S "$WS" -B "$WS/build" \
    -DSOC_VERSION=dav_2201 \
    -DPTO_ISA_ROOT="$PTO_ISA_ROOT" \
    -DREPO_ROOT="$REPO_ROOT"
cmake --build "$WS/build" --target replay_host
nm -D "$WS/build/libreplay_kernel.so" | grep -E ' T (replay_entry|launch_replay)$'

Both replay_entry and launch_replay must show as T symbols. Mix-arch device variants live in fatbinary sections and do not appear in nm -D — that is expected.

Step 4 — Run the collect script

If your environment has a queue gate (e.g. task-submit), wrap the collect run with the appropriate --max-time:

task-submit --device auto --max-time 1800 --run "bash $WS/run_collect.sh"

If you have direct simulator access, just run bash $WS/run_collect.sh after exporting the env vars from Step 3.

Choose --max-time based on expected work units (Σ per-block work). A few hundred units finishes in well under 30 minutes; tens of thousands of units may need the maximum allowed time and is usually a sign the shape needs to be shrunk further (Step 1).

The collect log should show:

[INFO] <ProfInit> Start profiling on kernel: replay_entry_mix_aic
... Profiling running finished. All task success.

Step 5 — Verify Insight artifacts

The new (CANN 9.0.0-beta.2) collector already writes the export-shaped layout under OPPROF_*/simulator/; the export step in the script just regularizes paths. Required output:

$WS/insight_export/OPPROF_*/simulator/
├── trace.json                          # top-level timeline
├── visualize_data.bin                  # Insight visualizer payload
├── core{N}.cubecore0/                  # one per AIC hw block
│   ├── trace.json
│   └── core{N}.cubecore0_instr_exe_*.csv
└── core{N}.veccore{0,1}/               # one per AIV lane
    ├── trace.json
    └── core{N}.veccore{0,1}_instr_exe_*.csv

For an SPMD mix launch you should see one cubecore0 + two veccore subdirectories per hw block (so HW_BLOCK_NUM × 3 instr_exe CSVs). For an AIC-only launch with HW_BLOCK_NUM=1, you see one cubecore0/ (no veccore/); for AIV-only, two veccore{0,1}/ (no cubecore0/). Drag the simulator/ directory into MindStudio Insight.

Common pitfalls (in order of how often they bite)

Empty OPPROF dump after a "successful" simulator run. Almost always: --kernel-name was missing. Re-add it and re-run.
Compile error: AIC pipe/tile intrinsic in AIV variant (or vice versa). The wrapper #included an AIC-only kernel under both __DAV_CUBE__ and __DAV_VEC__. Use the 1b/1c templates: include the kernel only under the matching arch, leave the other branch's replay_entry body empty.
<<<>>> launch fails to dispatch AIV (SPMD mix). Wrapper compiled as cube-only — confirm --cce-aicore-arch=dav-c220 (no suffix) and that the wrapper's __DAV_VEC__ branch is present and includes the kernel.
Simulator runs full duration then SIGKILL. Shape too large for --max-time. Shrink the inner dimension first (context_len, n_blocks), only then batch. Partial dumps may still export but trace.json bloats to GB.
Compile error: intrinsic.h not found. The kernel includes intrinsic.h and tensor.h by name. Confirm COMMON_INCLUDES has src/a2a3/runtime/tensormap_and_ringbuffer/{runtime,common} — tensor.h lives in runtime/, intrinsic.h in common/.
Off-by-one in SPMD args layout. Slot 48/49 are positional; if you write 16/17 (right after the kernel's last scalar) the kernel reads garbage and silently misbehaves. Always 48/49. Does not apply to AIC-only / AIV-only positional args (slot 48/49 unused there).
AIC-only/AIV-only kernel reads zeros from a tensor index. Forgot to populate block_table etc. — zeros are valid pointers but collapse all per-block lookups onto entry 0. Acceptable for pipeline tracing; fix only if the trace needs to exercise distinct addresses.
Kernel missed debug_line information warning. Cosmetic. Insight loses source-level call stack but trace.json/CSVs are unaffected. Add -g to the wrapper's compile options if needed.
x86_64-linux/simulator/... not found. The CANN install on this host is aarch64. CMakeLists and LD_LIBRARY_PATH must point at aarch64-linux/simulator/$SOC_VERSION/lib.

insight-trace

More from this repository

More from this repository

Insight Trace for kernel_entry(args) Kernels

When to use

Step 0 — Classify the kernel shape

Step 1 — Gather inputs

Step 2 — Create the workspace

File 1 — replay_kernel.cpp (mix-arch wrapper)

1a. SPMD mix kernel (e.g. paged_attention_parallel.cpp)

1b. AIC-only single-task kernel (e.g. aic_pv_matmul.cpp, aic_qk_matmul.cpp)

1c. AIV-only single-task kernel (e.g. aiv_softmax_prepare.cpp, aiv_online_update.cpp)

File 2 — replay_launch.cpp

2a. SPMD mix

2b. AIC-only or AIV-only

File 3 — replay_host.cpp

3a. SPMD mix args layout

3b. AIC-only / AIV-only positional args layout

File 4 — CMakeLists.txt

File 5 — run_collect.sh

Step 3 — Smoke-build before submitting to the queue

Step 4 — Run the collect script

Step 5 — Verify Insight artifacts

Common pitfalls (in order of how often they bite)

Final checklist

Insight Trace for kernel_entry(args) Kernels

When to use

Step 0 — Classify the kernel shape

Step 1 — Gather inputs

Step 2 — Create the workspace

File 1 — replay_kernel.cpp (mix-arch wrapper)

1a. SPMD mix kernel (e.g. paged_attention_parallel.cpp)

1b. AIC-only single-task kernel (e.g. aic_pv_matmul.cpp, aic_qk_matmul.cpp)

1c. AIV-only single-task kernel (e.g. aiv_softmax_prepare.cpp, aiv_online_update.cpp)

File 2 — replay_launch.cpp

2a. SPMD mix

2b. AIC-only or AIV-only

File 3 — replay_host.cpp

3a. SPMD mix args layout

3b. AIC-only / AIV-only positional args layout

File 4 — CMakeLists.txt

File 5 — run_collect.sh

Step 3 — Smoke-build before submitting to the queue

Step 4 — Run the collect script

Step 5 — Verify Insight artifacts

Common pitfalls (in order of how often they bite)

Final checklist

Insight Trace for `kernel_entry(args)` Kernels

File 1 — `replay_kernel.cpp` (mix-arch wrapper)

1a. SPMD mix kernel (e.g. `paged_attention_parallel.cpp`)

1b. AIC-only single-task kernel (e.g. `aic_pv_matmul.cpp`, `aic_qk_matmul.cpp`)

1c. AIV-only single-task kernel (e.g. `aiv_softmax_prepare.cpp`, `aiv_online_update.cpp`)

File 2 — `replay_launch.cpp`

File 3 — `replay_host.cpp`

File 4 — `CMakeLists.txt`

File 5 — `run_collect.sh`

Insight Trace for `kernel_entry(args)` Kernels

File 1 — `replay_kernel.cpp` (mix-arch wrapper)

1a. SPMD mix kernel (e.g. `paged_attention_parallel.cpp`)

1b. AIC-only single-task kernel (e.g. `aic_pv_matmul.cpp`, `aic_qk_matmul.cpp`)

1c. AIV-only single-task kernel (e.g. `aiv_softmax_prepare.cpp`, `aiv_online_update.cpp`)

File 2 — `replay_launch.cpp`

File 3 — `replay_host.cpp`

File 4 — `CMakeLists.txt`

File 5 — `run_collect.sh`