| name | incore-profiling |
| description | Profile PyPTO kernels in-core with the Ascend msprof op-simulator — cycle-accurate per-kernel traces. Use when the user wants to profile a built case, inspect kernel timing or instruction streams, or generate MindStudio Insight traces. |
In-Core Kernel Profiling (msprof op-simulator)
Run cycle-accurate, single-AI-core profiling of every PTOAS kernel in a PyPTO
build via the Ascend msprof op simulator. For each kernel the tool generates a
standalone testcase, builds it, runs it on the op-simulator, and collects the
Insight trace artifacts.
This is runtime profiling — distinct from the compiler's
report/perf_hints.log, which records compile-time hints. Use that file for
"why is codegen suggesting X"; use this skill for "how does the kernel actually
execute".
When to use
- The user wants per-kernel timing / instruction-level traces of a built case.
- The user asks for MindStudio Insight traces or
trace.json artifacts.
- The user wants to compare kernel execution cost across changes.
Prerequisites
- A built case under
build_output/<case>/ — it must contain generated .cpp
kernels under a ptoas/ directory (top-level, or nested as
next_levels/**/ptoas/; the tool discovers both), each with its sibling
.pto (the testcase generator reads the .pto's make_tensor_view shapes
and strides to size the GM buffers). PyPTO emits both side by side, so this
is automatic.
- A working CANN installation. The tool auto-discovers
set_env.sh; pass
--cann-set-env <path> if discovery fails. It must be TL-capable (its
bisheng accepts --cce-aicore-enable-tl) — ascend-toolkit/latest is often
an older non-TL CANN; prefer the toolchain used for the device build (the one
ASCEND_HOME_PATH points to). The tool preflights this and fails early with a
clear message otherwise.
- The
set_env.sh must be readable/sourceable by the current user; per-kernel
build and collect dirs are created private (mode 0700).
ptoas-bin installed. If missing, run the /pto-env-setup skill.
No PTOAS source checkout is needed — testcases are generated by the bundled
gen_profiling_case.py (a small .pto-driven generator). Pass --ptoas-root <checkout> only if you want PTOAS's full validation-harness generator instead.
Quick start
python .claude/skills/incore-profiling/incore_profile.py \
--build-dir build_output/<case> --target a2a3
--target a2a3 for Ascend A2/A3 devices, --target a5 for A5. This sets the
compile arch (dav-c220 / dav-c310) and constrains camodel-SoC selection.
--case <model.py> instead of --build-dir builds the case first, then
profiles it. Arguments after -- are forwarded to the case script.
--list-funcs --build-dir <dir> previews the kernels without running anything
(needs no toolchain).
--func <name> profiles a single kernel; repeatable.
CANN, the camodel SoC, and the compile arch are auto-resolved from --target.
Override any of them with --cann-set-env, --soc-version, --aicore-arch.
Output
Each run writes to <build-dir>/kernel_insight_all_funcs_<timestamp>/:
manifest_export.csv and summary.txt — index and per-kernel status.
funcs/<kernel>/collect/out/OPPROF_*/simulator/:
trace.json and visualize_data.bin — open these in MindStudio Insight.
core0.*/ — per-core trace.json, *_instr_exe.csv, *_code_exe.csv for
instruction-level analysis.
A final EXPORTED N/M line reports how many kernels succeeded.
Next step: clean the trace
The raw visualize_data.bin is cluttered (sync flags, cache-miss / control-flow
lanes). Turn it into a de-cluttered, Perfetto-viewable per-pipe trace with the
repo tool:
python -m pypto.tools.clean_sim_trace \
<build-dir>/kernel_insight_all_funcs_<ts>/funcs/<kernel>/collect/out/OPPROF_* -o <out>
It writes trace.clean.json (pipeline lanes in dataflow order
MTE2→MTE1→CUBE→VECTOR→FIXPIPE→MTE3, sync flags re-anchored as flow arrows) and
instr_metrics.json (per-instruction pipe / cycles / vector-utilization). The
per-pipe cycle breakdown is the fastest way to spot a degenerate trace
(CUBE=0 cycles — see Caveats). Rename the cleaned trace to
<kernel>.clean.json straight away (see below) so multiple profiled kernels stay
distinguishable when downloaded side by side for Perfetto.
Where to put the cleaned trace
Always write the cleaned, presentable trace under the repo's build_output/
in a descriptive, self-documenting folder — never leave it in /tmp. Use the
name pattern incore_<kernel>_<source-stem>_<timestamp>/ (e.g.
build_output/incore_fa_fused_decode2l_20260611_160643/) so the kernel, the
originating script, and the run are obvious from the directory listing alone.
Recommended layout inside that folder:
build_output/incore_<kernel>_<source>_<ts>/
<kernel>.clean.json cleaned per-pipe trace (the deliverable)
instr_metrics.json per-instruction pipe / cycles / vector-utilization
raw_simulator/ visualize_data.bin + per-core *_instr_exe.csv (for re-analysis)
summary.txt provenance (source script, wired workload, per-pipe breakdown)
build_output/ is gitignored, so these artifacts stay local. The summary.txt
must record the wired workload (e.g. fa_total, work-table, seq_lens, scalar
args) when real intermediates were patched in — otherwise the numbers are not
reproducible. Pass -o build_output/incore_<kernel>_<source>_<ts> to
clean_sim_trace so the output lands in the run folder directly (no nested
subfolder), then rename its trace.clean.json to <kernel>.clean.json — the
kernel-name prefix keeps multiple profiled kernels distinguishable when several
.clean.json files are downloaded together and opened in Perfetto:
mv build_output/incore_<kernel>_<source>_<ts>/trace.clean.json \
build_output/incore_<kernel>_<source>_<ts>/<kernel>.clean.json
Troubleshooting
- Build fails with
unknown type name '__biasbuf__' or use of undeclared identifier 'aicore' (often after argument unused during compilation: '--cce-aicore-enable-tl') — the CANN is not TL-capable (commonly
ascend-toolkit/latest, e.g. 8.3.RC1). Re-run with a TL-capable
--cann-set-env <…/cann-8.5.x/set_env.sh>. The tool preflights this and should
now fail early naming the cause.
- Build fails inside
pto/npu/a5/*.hpp — wrong target; pass --target a2a3
for an A2/A3 device.
ld: cannot find -lruntime_camodel — the auto-selected SoC has no camodel
library. Auto-selection substring-matches 910b and takes the first match,
which can be bare Ascend910B even when the camodel ships only for
Ascend910B1. List the real ones with
find "$ASCEND_HOME_PATH" -name libruntime_camodel.so and pass the variant
matching your device (npu-smi info), e.g. --soc-version Ascend910B1.
- Trace is ~0 cycles /
CUBE=0 / only SCALAR+sync instrs — the kernel is
data-dependent and the auto golden zeroed its control tensor. See Caveats.
CANN set_env.sh not found — pass --cann-set-env <path>, or set
ASCEND_HOME_PATH / CANN_SET_ENV.
sibling .pto not found — the kernel .cpp has no .pto next to it (the
generator reads it for buffer sizes). Use a ptoas/ dir that has both, or pass
--ptoas-root <PTOAS source checkout> to fall back to the validation generator.
- Export step reports "no dump file" — handled automatically; newer
msprof
emits traces during collect, and the tool skips the redundant export pass.
Caveats
Data-dependent kernels read misleadingly fast. The auto-generated golden.py
zero-fills integer input buffers. If a kernel's loop trip-count, grid-stride
bound, or work-table length is read from an input tensor, the zeroed input
yields 0 iterations: a sub-microsecond, near-empty trace with CUBE=0 cycles
that looks like the kernel is free. This is an artifact of synthetic inputs, not
a fast kernel — the tool flags it with a WARN line and a non-empty message in
summary.txt/manifest_export.csv. The generator already sizes every GM buffer
to its full .pto shape, so you do not need to resize anything — to profile
such a kernel for real, just overwrite the control inputs with a valid set
(non-zero loop bound + dense work table + real lengths) by editing the case's
golden.py (or writing the vN.bin directly) and the scalar tail args in
main.cpp, then rebuild the *_sim target and re-run msprof op simulator.
Only the control tensors must be real — per-instruction cost is data-independent,
so the bulk data tensors can stay random/zero.
How it works
For each kernel .cpp the tool: (1) generates a standalone testcase via the
bundled gen_profiling_case.py — it reads the kernel signature from the .cpp
and the GM buffer sizes from the sibling .pto's make_tensor_view shapes, then
emits main.cpp + launch.cpp (<<<1, …>>> single-core) + CMakeLists.txt +
golden.py; mixed cube+vector kernels get a trivial merged __global__
dispatcher (#if __DAV_CUBE__ → <k>_aic, #if __DAV_VEC__ → <k>_aiv). It
then (2) builds the simulator binary with CMake, (3) runs golden.py for input
data, (4) runs msprof op simulator to collect traces, (5) records the
artifacts. Steps run per kernel and continue on failure unless --no-keep-going.
Future work
--target must currently be passed explicitly. Once the build pipeline records
the backend arch into the build folder, incore_profile.py can auto-detect it
and --build-dir alone will suffice. --target will remain as an override.