com um clique
kernel-exp-history
// This skill should be used when optimizing kernels in this repo and needing to consult past optimization experiments, or when recording the current optimization iteration back into the kernel experiment database.
// This skill should be used when optimizing kernels in this repo and needing to consult past optimization experiments, or when recording the current optimization iteration back into the kernel experiment database.
This skill should be used when optimizing AMD GPU kernels on MI300 using the aiter project, including running op tests, benchmarking, iterating on kernel changes, and recording results in the kernel experiment database.
This skill should be used when reasoning about GPU architecture fundamentals to guide kernel optimization choices such as memory hierarchy usage, execution model mapping, block sizing, and latency-aware tuning across HIP, Triton, and PyTorch.
This skill should be used when writing or tuning HIP kernels on AMD/NVIDIA GPUs, covering memory coalescing, shared-memory tiling, bank conflict avoidance, warp primitives, occupancy, vectorization, async ops, loop unrolling, and profiling.
MI300/CDNA3 architecture guide for HIP/Triton optimization—MFMA variants, dual register files, data formats, sparsity, LDS/GWS, and best practices.
CDNA3/MI300 HIP programming insights—chiplet/cache model, Infinity Cache, memory coherency, matrix cores, sparsity, and best practices.
MI300 HIP programming differences vs NVIDIA—wavefront vs warp, memory hierarchy, MFMA usage, occupancy, and profiling pitfalls.
| name | kernel-exp-history |
| description | This skill should be used when optimizing kernels in this repo and needing to consult past optimization experiments, or when recording the current optimization iteration back into the kernel experiment database. |
Use the local kernel experiment database to look up prior optimization attempts and record new results after an optimization iteration completes.
references/kernel_exp_dataclass.py to understand the database helpers and schema.top_experiments(max_results=20) to get a score-sorted list of high-impact experiments.get_experiment(exp_id) or list_experiments() and filter by operator_sig, dtype_sig, env, or base_commit.Example 1: Find similar kernel optimizations
# Search for cache kernel optimizations
from kernel_exp_dataclass import list_experiments
experiments = list_experiments()
cache_exps = [e for e in experiments if 'cache' in e.operator_sig.lower()]
# Sort by score
cache_exps_sorted = sorted(cache_exps, key=lambda x: x.score, reverse=True)
print("Top cache kernel optimizations:")
for exp in cache_exps_sorted[:5]:
print(f" {exp.score:.4f}x - {exp.change_summary}")
Example 2: Find best unroll factor
# Compare different unroll factors
unroll_exps = [e for e in experiments if 'unroll' in e.change_summary.lower()]
for exp in unroll_exps:
factor = 'unknown'
if 'unroll 4' in exp.detailed_description.lower():
factor = '4'
elif 'unroll 8' in exp.detailed_description.lower():
factor = '8'
print(f"Unroll {factor}: {exp.score:.4f}x - {exp.operator_sig[:50]}")
Example 3: Learn from failures
# Find what NOT to do
failures = [e for e in experiments if e.score < 0.98 or e.is_buggy]
print("Failed optimizations (learn from these!):")
for exp in failures:
print(f" ❌ {exp.change_summary}")
print(f" Why: {exp.detailed_description[:100]}...")
KernelExperiment, including:
change_summary, detailed_description, raw_result, scoreoperator_sig, dtype_sig, env, base_commit, profiling_infois_buggy, error_message, statuspid if this iteration builds on a parent experiment (set manually)create_experiment() to append the entry to the database.change_summary (1 line, <80 chars):
<What> - <Result> or <What> - <Why it failed>detailed_description (multiple paragraphs): Structure:
**Approach**: [What you tried]
- Specific technical details
- Why you thought it would work
**Result**: [What happened]
- Quantitative results
- Qualitative observations
**Why it worked/failed**: [Root cause analysis]
- Technical explanation
- Compare to similar attempts
**Key insight**: [Takeaway for future]
- What this taught you
- How to apply the lesson
raw_result (structured text):
Iteration N Results - [SUCCESS/REGRESSION/CRASH]:
**Overall**: X.XXXXx speedup = Y.YY% [IMPROVEMENT/REGRESSION]
**Per-kernel breakdown**:
- kernel_1: X.XXXXx (+Y.YY%)
- kernel_2: X.XXXXx (+Y.YY%)
...
**Summary**: X improvements, Y neutral, Z regressions
**Key finding**: [One-line takeaway]
profiling_info (even if not profiled):
update_experiment(exp_id, raw_result=..., score=..., detailed_description=...)Critical: Record ALL iterations, especially failures!
Why record failures?
Failure categories to track:
is_buggy=True): Crashes, correctness errorsExample from cache kernel optimization:
top_experiments() first; fall back to full queries only when additional details are needed.