원클릭으로 Manus에서 모든 스킬 실행

$pwd:

linux-perf

Name: Linux Perf
Author: intel

// Profile and fix Linux performance problems using `perf`. Workflows: (A) hardware counters -- IPC, cache-miss, branch mispredictions; (B) hotspot profiling -- which functions and source lines consume CPU, with SIMD and accumulator detection; (C) cache-line contention -- false sharing, HITM, `perf c2c`; (D) core-count scaling -- dual-profile comparison, bottleneck categorization; (E) structured hotspot report with annotated source and pattern observations. Resolution strategies: TTAS spinlock, SIMD upconversion, parallel accumulator, structured false-sharing fix, per-CPU stats. Trigger on: perf, profiling, profile, hotspot, hotspots, cache miss, IPC, false sharing, HITM, scaling, core count, thread scaling, bottleneck, slow code, CPU bound, why is this slow, where does time go, does not scale. When in doubt, invoke this skill -- better to use it unnecessarily than to miss a performance opportunity.

Manus에서 실행

$ git log --oneline --stat

stars:4

forks:2

updated:2026년 5월 22일 15:50

파일 탐색기

9 개 파일

SKILL.md

readonly

related-skills.json

같은 저장소

performance-patterns.md

from "intel/intel-performance-skills"

Detect and fix x86/C/C++ performance patterns from source code or profiling output (perf, VTune, flamegraphs). Invoke when the user asks to optimize, review for performance, or write new SIMD/vectorized code — even without profiling data. Trigger on: serial accumulator loops, narrow SIMD (xmm/ymm that could be ymm/zmm), _mm* intrinsics, HITM/cmpxchg clusters, false sharing, missing restrict or vzeroupper, hot symbol inside a system library (.so) that may have a version gap, or any request to write a fast reduction, dot product, or CPU-dispatched function. Patterns: serial accumulator, TTAS spinlock, SIMD upconversion (zipper), false sharing, per-CPU stats, missing vzeroupper, missing restrict, CPU dispatch, library version upgrade, fast CRC32C (crc32c function name trigger, single-accumulator _mm_crc32 loop, table-lookup CRC32C), known algorithms (Cosine Similarity, Hamming Distance, Jaccard Distance), SIMD sort for numeric primitives (float/double/int32_t/uint32_t/int64_t/uint64_t via x86-simd-sort).

2026-05-294

phoronix-test-suite.md

from "intel/intel-performance-skills"

Install, run, parse, and optimize benchmarks from the Phoronix Test Suite (PTS). Use this skill whenever the user mentions "phoronix", "pts/", or "phoronix-test-suite", or asks to run, measure, improve, or optimize a PTS test — e.g., "run pts/mt-dgemm", "optimize pts/compress-zstd", "what score does pts/x265 get". Trigger immediately on any `pts/<testname>` reference, even if the user doesn't explicitly say "phoronix". Also trigger when the user asks to find or edit the source code of a PTS test.

2026-05-194

package.json

"author": "intel"

"repository": "intel/intel-performance-skills"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

네트워크·컴퓨터 시스템 관리자컴퓨터 및 수학직15-1244L4

name

linux-perf

description

Profile and fix Linux performance problems using `perf`. Workflows: (A) hardware counters -- IPC, cache-miss, branch mispredictions; (B) hotspot profiling -- which functions and source lines consume CPU, with SIMD and accumulator detection; (C) cache-line contention -- false sharing, HITM, `perf c2c`; (D) core-count scaling -- dual-profile comparison, bottleneck categorization; (E) structured hotspot report with annotated source and pattern observations. Resolution strategies: TTAS spinlock, SIMD upconversion, parallel accumulator, structured false-sharing fix, per-CPU stats. Trigger on: perf, profiling, profile, hotspot, hotspots, cache miss, IPC, false sharing, HITM, scaling, core count, thread scaling, bottleneck, slow code, CPU bound, why is this slow, where does time go, does not scale. When in doubt, invoke this skill -- better to use it unnecessarily than to miss a performance opportunity.

Linux perf profiling skill

Guide the user through profiling with Linux perf — from setup and data collection through reporting and interpretation.

The skill is organized into five parts:

Part 1: Setup — permissions and build flags (always check these first)
Part 2: Flows — end-to-end workflows (A: quick stats, B: hotspot profiling, C: cache contention)
Part 3: Cross-skill integrations and Reference — when to delegate, and quick lookup tables
Part 4: Building blocks — focused data collection primitives that flows and other skills can call directly
Part 5: Resolution strategies — common fix patterns that flows can reference by name

Part 1: setup

Check permissions (always do this first)

cat /proc/sys/kernel/perf_event_paranoid

≤ 1: no sudo needed for most data collection
2 or higher: hardware events (cycles, cache-misses, etc.) require root

To lower the limit (ask the user first):

echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

Sudo protocol: if sudo is needed, present these options and wait for the user's answer before proceeding:

Run with sudo
Lower paranoia level with the command above
Skip perf collection

Ensure debug symbols

Always compile with -g alongside existing optimization flags (e.g. -O2 -g). This adds DWARF debug info so perf report and perf annotate show function names and source lines — it does not weaken optimization.

If given a pre-built binary without -g, offer to recompile. If you cannot, use Building block: resolve address to source (Part 4) for address correlation.

gcc -O2 -g -o <output> <source>

Part 2: flows

Choose a flow based on what the user needs:

Flow	Best for	Time needed
Flow A — `perf stat`	Quick counters: IPC, cache-miss rate, branch-miss rate	Seconds
Flow B — `perf record` + `perf report`	Which functions/lines are hot	Minutes
Flow C — `perf c2c`	Cache-line contention in multi-threaded code	Minutes
Flow D — dual-profile + focused c2c	Core-count scaling problems: workload does not get faster (or gets slower) as threads increase	30+ min
Flow E — hotspot analysis report	Structured formatted report: top functions table + annotated source with % column + pattern observations	Minutes

When in doubt, start with Flow A — it's fast and often answers the question without a full recording.

Flow A: Quick statistics with `perf stat`

Best for workloads like openssl speed ... or any case where you want a fast CPU efficiency answer.

Use Building block: perf stat (Part 4) to collect the counters.

Read references/flow-a.md for the execution steps.

Flow B: Hotspot profiling with `perf record` + `perf report`

Use this when the user wants to know which functions are consuming time, not just aggregate stats.

Read references/flow-b.md for the execution steps.

Flow C: Cache-line contention with `perf c2c`

Use to diagnose false sharing or true sharing of cache lines in multi-threaded code.

Trigger this flow when:

User mentions multithreaded scalability problems
Flow B shows significant non-streaming loads/stores (repeated access to the same small structures, not iterating over large arrays/trees)
lock or mutex appears prominently in function names, variable names, or hotspot call chains
User explicitly mentions cache line contention, false sharing, or HITM events

Threshold: ignore entries below 5% Tot Hitm by default. User may override (e.g. "show me everything above 2%").

Read references/flow-c.md for the execution steps.

Flow D: Core-count scaling analysis

Use when a workload does not scale with core count — throughput plateaus or regresses as threads increase.

Trigger this flow when:

User says "adding more threads doesn't help" or "performance gets worse with more cores"
A scaling sweep (throughput vs core count) shows a flat or declining curve
Flamegraph or profile shows lock-related functions dominating at high core counts

Read references/flow-d.md for the execution steps.

Flow E: Hotspot analysis report

Use when the user wants a structured, formatted deliverable rather than an exploratory profiling session — a document they can read, share, or file.

Trigger this flow when:

User asks for a "hotspot report", "profiling report", or "analysis report"
User wants to know the top functions with annotated source and percentages
User wants a formatted, shareable summary of where time is spent

Read references/flow-e.md for the execution steps.

Part 3: cross-skill integrations

`phoronix-test-suite`: PTS benchmarks

Invoke phoronix-test-suite immediately — the moment the profiling target is identified as a pts/<name> benchmark, before any perf steps begin.

Trigger

Workload is pts/<name> (any Phoronix Test Suite benchmark) → invoke phoronix-test-suite skill before Phase 0 of any flow.

What it handles

The phoronix-test-suite skill owns the full lifecycle for PTS benchmarks: install, source extraction, rebuild (with -g), binary deployment, and result recording. It knows the correct install.sh build flags, source layout, and where to copy the resulting binary — things a manual gcc or make invocation will get wrong.

Do not attempt to install, build, or rebuild a PTS test manually. Always delegate to this skill.

When the trigger fires, say: "I'll invoke the phoronix-test-suite skill to handle install and rebuild before profiling." then invoke it — before running any perf commands.

SIMD optimization → `performance-patterns`

Invoke performance-patterns as early as possible — trigger it the moment you see any of these patterns, even from reading source code or looking at perf stat output. Do not wait for perf annotate.

Early triggers (source code or perf stat — no annotate needed)

Serial accumulator in source (s += a[i] * b[i], sum += x[i], running max/min) → always a SIMD opportunity; invoke performance-patterns immediately
IPC < 2.0 + CPU-bound (low kernel%, low cache-miss rate) → dependency-chain stall; serial accumulator is the most common cause

Later triggers (from Flow B report or annotate output)

Scalar instructions in a hot loop (addsd, mulss, movsd without ymm/zmm)
Narrow SIMD (xmm) on a CPU that supports ymm (AVX2) or zmm (AVX-512)
Horizontal-reduction anti-pattern (shufps/addss/unpckhps after mulps) — see Flow B Phase 4
Extreme cost on first SSE instruction after an AVX function — AVX↔SSE transition penalty; see below

AVX↔SSE transition penalty (missing vzeroupper)

Detect with:

perf stat -e other_assists.avx_to_sse,other_assists.sse_to_avx ./program

A non-zero other_assists.avx_to_sse count confirms the penalty. In perf annotate, it appears as an extreme cycle count on the first SSE instruction following a function that used ymm or zmm0–zmm15 registers.

When detected, invoke performance-patterns with the missing-vzeroupper pattern: read patterns/missing-vzeroupper.md for the fix.

When any trigger fires, say: "I'll now invoke the performance-patterns skill to analyze and optimize the hot loop." then invoke it — before doing further manual assembly analysis yourself.

Part 4: building blocks

Detailed command syntax, flags, and output formats for each primitive. Read references/building-blocks.md when you need to execute one.

Building block	Purpose
`Check CPU capabilities`	Read `/proc/cpuinfo` flags and `nproc` to determine supported ISA levels, vector width tier, and feature extensions; used by Annotate pattern scan and other blocks
`Ensure debug symbols`	Check binary for DWARF info; offer to recompile with `-g` before expensive perf collection
`perf stat`	Collect hardware counters (IPC, cache-miss rate, branch-miss rate)
`perf record`	Sample a workload and write a perf.data file
`perf annotate (assembly view)`	Per-instruction cycle attribution for a named function
`resolve address to source`	Map raw addresses → file:line via addr2line / objdump
`c2c hot cache lines`	Record + report cache-line HITM summary table
`c2c access map for a cache line`	Per-offset accessor table for one cache line
`Top-N functions`	Ranked list of hottest functions from a perf.data recording
`Top-N lines within a function`	Ranked source lines inside one function
`Dual-profile comparison`	Run top-15 at 1 core and at N cores; produce delta table of rank/% changes to identify scaling bottleneck candidates
`Annotate pattern scan`	Scan `perf annotate` output for a function and return a structured table of detected anti-patterns (scalar FP, narrow SIMD, serial accumulator, horizontal reduction, lock CAS, memory load pressure) with suggested resolution strategies
`Branch probability measurement`	Measure per-branch taken-probabilities in hot functions using Intel PMU events; identify near-zero-probability branches as `[[gnu::cold]]` candidates
`GCC static branch probability`	Parse GCC's compile-time profile_estimate dump to obtain static branch-probability estimates; works on any platform; use as a proxy when no workload is available, or compare against perf data to find divergences worth optimizing

Part 5: resolution strategies

The resolution strategies are owned by the performance-patterns skill. When a flow identifies a named pattern, invoke performance-patterns — it will load triggers/from-profile.md to match the signal, then the appropriate patterns/<name>.md for the full fix.

Resolution strategy	`performance-patterns` pattern file
Test-and-Test-and-Set (TTAS)	`patterns/ttas.md`
SIMD vector width upconversion	`patterns/simd-upconversion.md`
Parallel accumulator rewrite	`patterns/parallel-accumulator.md`
Structured false-sharing fix	`patterns/false-sharing.md`
Per-CPU statistics aggregation	`patterns/per-cpu-stats.md`
Missing vzeroupper (AVX↔SSE penalty)	`patterns/missing-vzeroupper.md`

Reference

Useful commands

perf list                    # list all available event types
sudo perf list               # full list including hardware events (requires root)

Always check perf list before using -e <event> — event names vary between kernel versions and hardware. Some perf events require full root (sudo) privileges; the updated paranoia level is not sufficient for these.

Common event names

Event	What it measures
`cycles`	CPU clock cycles (default)
`instructions`	Instructions retired
`cache-misses`	Last-level cache misses
`cache-references`	Last-level cache accesses
`branch-misses`	Branch mispredictions
`branches`	Branch instructions retired
`page-faults`	Page faults (useful for I/O-heavy workloads)
`context-switches`	OS context switches
`cpu-migrations`	Process migrations between CPUs

linux-perf

이 저장소의 다른 Skills

이 저장소의 다른 Skills

Linux perf profiling skill

Part 1: setup

Check permissions (always do this first)

Ensure debug symbols

Part 2: flows

Flow A: Quick statistics with perf stat

Flow B: Hotspot profiling with perf record + perf report

Flow C: Cache-line contention with perf c2c

Flow D: Core-count scaling analysis

Flow E: Hotspot analysis report

Part 3: cross-skill integrations

phoronix-test-suite: PTS benchmarks

Trigger

What it handles

SIMD optimization → performance-patterns

Early triggers (source code or perf stat — no annotate needed)

Later triggers (from Flow B report or annotate output)

AVX↔SSE transition penalty (missing vzeroupper)

Part 4: building blocks

Part 5: resolution strategies

Reference

Useful commands

Common event names

Linux perf profiling skill

Part 1: setup

Check permissions (always do this first)

Ensure debug symbols

Part 2: flows

Flow A: Quick statistics with perf stat

Flow B: Hotspot profiling with perf record + perf report

Flow C: Cache-line contention with perf c2c

Flow D: Core-count scaling analysis

Flow E: Hotspot analysis report

Part 3: cross-skill integrations

phoronix-test-suite: PTS benchmarks

Trigger

What it handles

SIMD optimization → performance-patterns

Early triggers (source code or perf stat — no annotate needed)

Later triggers (from Flow B report or annotate output)

AVX↔SSE transition penalty (missing vzeroupper)

Part 4: building blocks

Part 5: resolution strategies

Reference

Useful commands

Common event names

Flow A: Quick statistics with `perf stat`

Flow B: Hotspot profiling with `perf record` + `perf report`

Flow C: Cache-line contention with `perf c2c`

`phoronix-test-suite`: PTS benchmarks

SIMD optimization → `performance-patterns`

Flow A: Quick statistics with `perf stat`

Flow B: Hotspot profiling with `perf record` + `perf report`

Flow C: Cache-line contention with `perf c2c`

`phoronix-test-suite`: PTS benchmarks

SIMD optimization → `performance-patterns`