Run any Skill in Manus with one click

incore-profiling

Stars86

Forks73

UpdatedJune 15, 2026 at 13:11

Profile PyPTO kernels in-core with the Ascend msprof op-simulator — cycle-accurate per-kernel traces. Use when the user wants to profile a built case, inspect kernel timing or instruction streams, or generate MindStudio Insight traces.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

hw-native-sys

hw-native-sys/pypto

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

3 files

SKILL.md

readonly

In-Core Kernel Profiling (msprof op-simulator)

Run cycle-accurate, single-AI-core profiling of every PTOAS kernel in a PyPTO build via the Ascend msprof op simulator. For each kernel the tool generates a standalone testcase, builds it, runs it on the op-simulator, and collects the Insight trace artifacts.

This is runtime profiling — distinct from the compiler's report/perf_hints.log, which records compile-time hints. Use that file for "why is codegen suggesting X"; use this skill for "how does the kernel actually execute".

When to use

The user wants per-kernel timing / instruction-level traces of a built case.
The user asks for MindStudio Insight traces or trace.json artifacts.
The user wants to compare kernel execution cost across changes.

Prerequisites

A built case under build_output/<case>/ — it must contain generated .cpp kernels under a ptoas/ directory (top-level, or nested as next_levels/**/ptoas/; the tool discovers both), each with its sibling .pto (the testcase generator reads the .pto's make_tensor_view shapes and strides to size the GM buffers). PyPTO emits both side by side, so this is automatic.
A working CANN installation. The tool auto-discovers set_env.sh; pass --cann-set-env <path> if discovery fails. It must be TL-capable (its bisheng accepts --cce-aicore-enable-tl) — ascend-toolkit/latest is often an older non-TL CANN; prefer the toolchain used for the device build (the one ASCEND_HOME_PATH points to). The tool preflights this and fails early with a clear message otherwise.
The set_env.sh must be readable/sourceable by the current user; per-kernel build and collect dirs are created private (mode 0700).
ptoas-bin installed. If missing, run the /pto-env-setup skill.

No PTOAS source checkout is needed — testcases are generated by the bundled gen_profiling_case.py (a small .pto-driven generator). Pass --ptoas-root <checkout> only if you want PTOAS's full validation-harness generator instead.

Quick start

python .claude/skills/incore-profiling/incore_profile.py \
  --build-dir build_output/<case> --target a2a3

--target a2a3 for Ascend A2/A3 devices, --target a5 for A5. This sets the compile arch (dav-c220 / dav-c310) and constrains camodel-SoC selection.
--case <model.py> instead of --build-dir builds the case first, then profiles it. Arguments after -- are forwarded to the case script.
--list-funcs --build-dir <dir> previews the kernels without running anything (needs no toolchain).
--func <name> profiles a single kernel; repeatable.

CANN, the camodel SoC, and the compile arch are auto-resolved from --target. Override any of them with --cann-set-env, --soc-version, --aicore-arch.

Output

Each run writes to <build-dir>/kernel_insight_all_funcs_<timestamp>/:

manifest_export.csv and summary.txt — index and per-kernel status.
funcs/<kernel>/collect/out/OPPROF_*/simulator/:
- trace.json and visualize_data.bin — open these in MindStudio Insight.
- core0.*/ — per-core trace.json, *_instr_exe.csv, *_code_exe.csv for instruction-level analysis.

A final EXPORTED N/M line reports how many kernels succeeded.

Next step: clean the trace

The raw visualize_data.bin is cluttered (sync flags, cache-miss / control-flow lanes). Turn it into a de-cluttered, Perfetto-viewable per-pipe trace with the repo tool:

python -m pypto.tools.clean_sim_trace \
  <build-dir>/kernel_insight_all_funcs_<ts>/funcs/<kernel>/collect/out/OPPROF_* -o <out>

It writes trace.clean.json (pipeline lanes in dataflow order MTE2→MTE1→CUBE→VECTOR→FIXPIPE→MTE3, sync flags re-anchored as flow arrows) and instr_metrics.json (per-instruction pipe / cycles / vector-utilization). The per-pipe cycle breakdown is the fastest way to spot a degenerate trace (CUBE=0 cycles — see Caveats). Rename the cleaned trace to <kernel>.clean.json straight away (see below) so multiple profiled kernels stay distinguishable when downloaded side by side for Perfetto.

Where to put the cleaned trace

Always write the cleaned, presentable trace under the repo's build_output/ in a descriptive, self-documenting folder — never leave it in /tmp. Use the name pattern incore_<kernel>_<source-stem>_<timestamp>/ (e.g. build_output/incore_fa_fused_decode2l_20260611_160643/) so the kernel, the originating script, and the run are obvious from the directory listing alone. Recommended layout inside that folder:

build_output/incore_<kernel>_<source>_<ts>/
  <kernel>.clean.json    cleaned per-pipe trace (the deliverable)
  instr_metrics.json     per-instruction pipe / cycles / vector-utilization
  raw_simulator/         visualize_data.bin + per-core *_instr_exe.csv  (for re-analysis)
  summary.txt            provenance (source script, wired workload, per-pipe breakdown)

build_output/ is gitignored, so these artifacts stay local. The summary.txt must record the wired workload (e.g. fa_total, work-table, seq_lens, scalar args) when real intermediates were patched in — otherwise the numbers are not reproducible. Pass -o build_output/incore_<kernel>_<source>_<ts> to clean_sim_trace so the output lands in the run folder directly (no nested subfolder), then rename its trace.clean.json to <kernel>.clean.json — the kernel-name prefix keeps multiple profiled kernels distinguishable when several .clean.json files are downloaded together and opened in Perfetto:

mv build_output/incore_<kernel>_<source>_<ts>/trace.clean.json \
   build_output/incore_<kernel>_<source>_<ts>/<kernel>.clean.json

Troubleshooting

Build fails with unknown type name '__biasbuf__' or use of undeclared identifier 'aicore' (often after argument unused during compilation: '--cce-aicore-enable-tl') — the CANN is not TL-capable (commonly ascend-toolkit/latest, e.g. 8.3.RC1). Re-run with a TL-capable --cann-set-env <…/cann-8.5.x/set_env.sh>. The tool preflights this and should now fail early naming the cause.
Build fails inside pto/npu/a5/*.hpp — wrong target; pass --target a2a3 for an A2/A3 device.
ld: cannot find -lruntime_camodel — the auto-selected SoC has no camodel library. Auto-selection substring-matches 910b and takes the first match, which can be bare Ascend910B even when the camodel ships only for Ascend910B1. List the real ones with find "$ASCEND_HOME_PATH" -name libruntime_camodel.so and pass the variant matching your device (npu-smi info), e.g. --soc-version Ascend910B1.
Trace is ~0 cycles / CUBE=0 / only SCALAR+sync instrs — the kernel is data-dependent and the auto golden zeroed its control tensor. See Caveats.
CANN set_env.sh not found — pass --cann-set-env <path>, or set ASCEND_HOME_PATH / CANN_SET_ENV.
sibling .pto not found — the kernel .cpp has no .pto next to it (the generator reads it for buffer sizes). Use a ptoas/ dir that has both, or pass --ptoas-root <PTOAS source checkout> to fall back to the validation generator.
Export step reports "no dump file" — handled automatically; newer msprof emits traces during collect, and the tool skips the redundant export pass.

Caveats

Data-dependent kernels read misleadingly fast. The auto-generated golden.py zero-fills integer input buffers. If a kernel's loop trip-count, grid-stride bound, or work-table length is read from an input tensor, the zeroed input yields 0 iterations: a sub-microsecond, near-empty trace with CUBE=0 cycles that looks like the kernel is free. This is an artifact of synthetic inputs, not a fast kernel — the tool flags it with a WARN line and a non-empty message in summary.txt/manifest_export.csv. The generator already sizes every GM buffer to its full .pto shape, so you do not need to resize anything — to profile such a kernel for real, just overwrite the control inputs with a valid set (non-zero loop bound + dense work table + real lengths) by editing the case's golden.py (or writing the vN.bin directly) and the scalar tail args in main.cpp, then rebuild the *_sim target and re-run msprof op simulator. Only the control tensors must be real — per-instruction cost is data-independent, so the bulk data tensors can stay random/zero.

How it works

For each kernel .cpp the tool: (1) generates a standalone testcase via the bundled gen_profiling_case.py — it reads the kernel signature from the .cpp and the GM buffer sizes from the sibling .pto's make_tensor_view shapes, then emits main.cpp + launch.cpp (<<<1, …>>> single-core) + CMakeLists.txt + golden.py; mixed cube+vector kernels get a trivial merged __global__ dispatcher (#if __DAV_CUBE__ → <k>_aic, #if __DAV_VEC__ → <k>_aiv). It then (2) builds the simulator binary with CMake, (3) runs golden.py for input data, (4) runs msprof op simulator to collect traces, (5) records the artifacts. Steps run per kernel and continue on failure unless --no-keep-going.

Future work

--target must currently be passed explicitly. Once the build pipeline records the backend arch into the build folder, incore_profile.py can auto-detect it and --build-dir alone will suffice. --target will remain as an override.

name	incore-profiling
description	Profile PyPTO kernels in-core with the Ascend msprof op-simulator — cycle-accurate per-kernel traces. Use when the user wants to profile a built case, inspect kernel timing or instruction streams, or generate MindStudio Insight traces.

incore-profiling

More from this repository

In-Core Kernel Profiling (msprof op-simulator)

When to use

Prerequisites

Quick start

Output

Next step: clean the trace

Where to put the cleaned trace

Troubleshooting

Caveats

How it works

Future work

In-Core Kernel Profiling (msprof op-simulator)

When to use

Prerequisites

Quick start

Output

Next step: clean the trace

Where to put the cleaned trace

Troubleshooting

Caveats

How it works

Future work

More from this repository