name	sp1-profiling
description	Profile an SP1 zkVM program to find cycle-count hotspots. Use when the user asks to profile, find bottlenecks, see "where cycles go", or analyze performance of an SP1 program. Covers enabling the profiler, running it, and interpreting the Firefox-Profiler-format JSON without needing samply.
allowed-tools	Read, Grep, Glob, Bash, Edit, Write

Profiling SP1 programs

SP1's profiler captures per-function cycle counts using the program's DWARF debug info and writes a Firefox-Profiler-format JSON. Each "sample" in the output is one RISC-V cycle, so sample counts equal cycle counts exactly.

Reference: https://docs.succinct.xyz/docs/sp1/optimizing-programs/profiling

When to use this skill

User asks to profile an SP1 program / find hotspots / analyze cycles.
User wants to know where the bulk of total_instruction_count() is spent.
User wants to validate a precompile is being used (e.g. Keccak, secp256k1, BN254).

When not to use it

Just measuring total cycles → call client.execute(...) and read report.total_instruction_count(). Don't enable profiling.
Measuring proving time / GPU throughput → that's a separate benchmark, not profiling.

Step 1 — Enable the `profiling` feature

In the script crate's Cargo.toml:

sp1-sdk = { version = "<your-version>", features = ["profiling", ...] }

The feature is a no-op when TRACE_FILE isn't set, so leave it on permanently — the same binary handles profiling and other runs without recompiling.

Step 2 — Run with `TRACE_FILE` set

TRACE_FILE=profiles/<name>.json cargo run --release -- --execute ...

Notes:

Must be --execute (or whatever invokes client.execute(...)), not --prove. Profiler hooks into the executor.
Output files are large (often >100MB for ~10M cycles). Always write to a gitignored directory like profiles/ — never to the repo root or workspace dirs. Confirm the destination directory exists; create it if not.
For programs >100M cycles, set TRACE_SAMPLE_RATE=100 (sample 1-in-N) to keep the file manageable.
Build the program with debug info (default for SP1 programs); without it, frames will be raw addresses instead of demangled Rust names.
A small input is usually enough. The profiler shows proportions, not absolute scaling, so 1–8 iterations of the workload typically gives the same hotspot picture as 1000.

Step 3 — Analyze the JSON

The standard tool is samply load <file> (Firefox Profiler UI). For non-interactive analysis (CI, headless box, or quick reporting), parse the JSON directly — it's reasonably simple. Always present both self-time and inclusive-time tables: self-time tells you which leaf functions burn cycles, inclusive-time tells you which high-level operations contain that work.

Schema essentials (from one thread under threads[0]):

samples.data[i][0] → stack id (one row per cycle).
stackTable.data[sid] → [prefix_stack_id, frame_id] (linked list, walk via prefix to get the chain).
frameTable.data[fid][0] → string-table index for the function name.
stringTable[idx] → demangled-ish name (Rust mangling artifacts like [hash]:: are present).

Drop-in Python analyzer:

import json, sys
from collections import Counter

d = json.load(open(sys.argv[1]))
t = d['threads'][0]
strings, frames, stacks, samples = (
    t['stringTable'], t['frameTable']['data'],
    t['stackTable']['data'], t['samples']['data'],
)

def chain(sid):
    out = []
    while sid is not None:
        prefix, frame = stacks[sid]
        out.append(frame); sid = prefix
    return out

self_t, incl_t, total = Counter(), Counter(), 0
for s in samples:
    sid = s[0]
    if sid is None: continue
    total += 1
    c = chain(sid)
    self_t[c[0]] += 1
    for f in set(c):
        incl_t[f] += 1

def name(fi):
    loc = frames[fi][0]
    return strings[loc] if loc is not None else '?'

print(f"Total cycles: {total}\n=== SELF TIME (top 20) ===")
for f, c in self_t.most_common(20):
    print(f"  {c:>10} ({100*c/total:5.2f}%)  {name(f)[:140]}")
print("\n=== INCLUSIVE TIME (top 20) ===")
for f, c in incl_t.most_common(20):
    print(f"  {c:>10} ({100*c/total:5.2f}%)  {name(f)[:140]}")

Save as analyze.py (or run inline) and invoke with the trace path.

Step 4 — Interpret the output

Read self-time and inclusive-time together:

A function with high inclusive but low self time means its cost is in callees — drill in. Example: a top-level entry point is typically near-100% inclusive and near-0% self.
A function with high self time is the literal bottleneck. Common culprits in zkVM workloads:
- memcpy / memset → struct moves and zero-init; often the call site (one frame up the stack) is the real cost. Look at the types involved and consider passing by reference or reusing buffers.
- syscall_<precompile> (e.g. syscall_keccak_permute, syscall_secp256k1_add) → the precompile is being used. Self-time should be small relative to syscall count; if it's large, suspect a missing patch or a non-precompiled fallback.
- Repeated short-lived object setup (e.g. hasher state, allocators) → many small operations each pay driver overhead. Reuse instances (reset()) instead of creating new ones.
Cross-check with report.syscall_counts from client.execute(...). A near-zero count for a syscall you expected to fire means the relevant patched crate isn't being pulled in (verify via cargo tree).
Normalize by the number of iterations in the input. Per-iteration cost is more meaningful than total. Cycles should scale linearly with iteration count; if not, there's amortization (good — note it) or a fixed-cost outlier (investigate).

Reporting back to the user

Always include:

Total cycles and per-iteration cost.
Top ~5 self-time entries with percentages.
Top ~5 inclusive-time entries with percentages.
One-paragraph interpretation: where the time goes structurally, and the most actionable optimization target.
The path where the trace was written, so the user can run samply load <file> themselves.

Keep raw stack dumps out of the chat — they're long and low signal.

name	sp1-profiling
description	Profile an SP1 zkVM program to find cycle-count hotspots. Use when the user asks to profile, find bottlenecks, see "where cycles go", or analyze performance of an SP1 program. Covers enabling the profiler, running it, and interpreting the Firefox-Profiler-format JSON without needing samply.
allowed-tools	Read, Grep, Glob, Bash, Edit, Write

Profiling SP1 programs

Reference: https://docs.succinct.xyz/docs/sp1/optimizing-programs/profiling

When to use this skill

User asks to profile an SP1 program / find hotspots / analyze cycles.
User wants to know where the bulk of total_instruction_count() is spent.
User wants to validate a precompile is being used (e.g. Keccak, secp256k1, BN254).

When not to use it

Just measuring total cycles → call client.execute(...) and read report.total_instruction_count(). Don't enable profiling.
Measuring proving time / GPU throughput → that's a separate benchmark, not profiling.

Step 1 — Enable the `profiling` feature

In the script crate's Cargo.toml:

sp1-sdk = { version = "<your-version>", features = ["profiling", ...] }

The feature is a no-op when TRACE_FILE isn't set, so leave it on permanently — the same binary handles profiling and other runs without recompiling.

Step 2 — Run with `TRACE_FILE` set

TRACE_FILE=profiles/<name>.json cargo run --release -- --execute ...

Notes:

Must be --execute (or whatever invokes client.execute(...)), not --prove. Profiler hooks into the executor.
Output files are large (often >100MB for ~10M cycles). Always write to a gitignored directory like profiles/ — never to the repo root or workspace dirs. Confirm the destination directory exists; create it if not.
For programs >100M cycles, set TRACE_SAMPLE_RATE=100 (sample 1-in-N) to keep the file manageable.
Build the program with debug info (default for SP1 programs); without it, frames will be raw addresses instead of demangled Rust names.
A small input is usually enough. The profiler shows proportions, not absolute scaling, so 1–8 iterations of the workload typically gives the same hotspot picture as 1000.

Step 3 — Analyze the JSON

Schema essentials (from one thread under threads[0]):

samples.data[i][0] → stack id (one row per cycle).
stackTable.data[sid] → [prefix_stack_id, frame_id] (linked list, walk via prefix to get the chain).
frameTable.data[fid][0] → string-table index for the function name.
stringTable[idx] → demangled-ish name (Rust mangling artifacts like [hash]:: are present).

Drop-in Python analyzer:

import json, sys
from collections import Counter

d = json.load(open(sys.argv[1]))
t = d['threads'][0]
strings, frames, stacks, samples = (
    t['stringTable'], t['frameTable']['data'],
    t['stackTable']['data'], t['samples']['data'],
)

def chain(sid):
    out = []
    while sid is not None:
        prefix, frame = stacks[sid]
        out.append(frame); sid = prefix
    return out

self_t, incl_t, total = Counter(), Counter(), 0
for s in samples:
    sid = s[0]
    if sid is None: continue
    total += 1
    c = chain(sid)
    self_t[c[0]] += 1
    for f in set(c):
        incl_t[f] += 1

def name(fi):
    loc = frames[fi][0]
    return strings[loc] if loc is not None else '?'

print(f"Total cycles: {total}\n=== SELF TIME (top 20) ===")
for f, c in self_t.most_common(20):
    print(f"  {c:>10} ({100*c/total:5.2f}%)  {name(f)[:140]}")
print("\n=== INCLUSIVE TIME (top 20) ===")
for f, c in incl_t.most_common(20):
    print(f"  {c:>10} ({100*c/total:5.2f}%)  {name(f)[:140]}")

Save as analyze.py (or run inline) and invoke with the trace path.

Step 4 — Interpret the output

Read self-time and inclusive-time together:

A function with high inclusive but low self time means its cost is in callees — drill in. Example: a top-level entry point is typically near-100% inclusive and near-0% self.
A function with high self time is the literal bottleneck. Common culprits in zkVM workloads:
- memcpy / memset → struct moves and zero-init; often the call site (one frame up the stack) is the real cost. Look at the types involved and consider passing by reference or reusing buffers.
- syscall_<precompile> (e.g. syscall_keccak_permute, syscall_secp256k1_add) → the precompile is being used. Self-time should be small relative to syscall count; if it's large, suspect a missing patch or a non-precompiled fallback.
- Repeated short-lived object setup (e.g. hasher state, allocators) → many small operations each pay driver overhead. Reuse instances (reset()) instead of creating new ones.
Cross-check with report.syscall_counts from client.execute(...). A near-zero count for a syscall you expected to fire means the relevant patched crate isn't being pulled in (verify via cargo tree).
Normalize by the number of iterations in the input. Per-iteration cost is more meaningful than total. Cycles should scale linearly with iteration count; if not, there's amortization (good — note it) or a fixed-cost outlier (investigate).

Reporting back to the user

Always include:

Total cycles and per-iteration cost.
Top ~5 self-time entries with percentages.
Top ~5 inclusive-time entries with percentages.
One-paragraph interpretation: where the time goes structurally, and the most actionable optimization target.
The path where the trace was written, so the user can run samply load <file> themselves.

Keep raw stack dumps out of the chat — they're long and low signal.

sp1-profiling

Profiling SP1 programs

When to use this skill

When not to use it

Step 1 — Enable the profiling feature

Step 2 — Run with TRACE_FILE set

Step 3 — Analyze the JSON

Step 4 — Interpret the output

Reporting back to the user

Más de este repositorio

Más de este repositorio

Profiling SP1 programs

When to use this skill

When not to use it

Step 1 — Enable the profiling feature

Step 2 — Run with TRACE_FILE set

Step 3 — Analyze the JSON

Step 4 — Interpret the output

Reporting back to the user

Step 1 — Enable the `profiling` feature

Step 2 — Run with `TRACE_FILE` set

Step 1 — Enable the `profiling` feature

Step 2 — Run with `TRACE_FILE` set