com um clique
com um clique
矩阵乘法矩阵乘法 A[M, K] @ B[K, N] = C[M, N]中,大K维度矩阵乘法(K>>M,N)优化:针对M/N较小但K极大(如M=N=256,K=131072)的场景,Split-K切分K维度并行化、Workspace+Reduce替代全局同步,实现显著性能提升
Triton Ascend hard API restrictions and forbidden syntax. MUST-follow rules that apply to every kernel: forbidden control flow (return/break/continue/lambda/while), tensor slice/index restrictions, scalar conversion rules, BLOCK_SIZE upper bound. Violating any of these produces a compile or runtime error on Ascend.
Triton Ascend 性能优化通用策略: BLOCK_SIZE 选择 (1024-2048 for elementwise, must be <65536), grid configuration (use VEC_CORE_NUM / CUBE_CORE_NUM, 2D/3D grid for matmul / conv / reduce, 1D grid + inner loop for elementwise / pointwise), 256B alignment for memory transfers, autotune block-size patterns, fp16 / fp32 precision conversion. Bind via keywords like matmul, elementwise, reduce, block_size, grid, autotune, alignment, fp16, fp32, tile, interleaved-loop, cube-core, vec-core.
通过 adaptive_search 或 evolve 搜索式 workflow 生成优化算子。 后台 silent mode 执行,轮询监控进度。
适用于归约(reduce)类算子和含归约子步骤的复合算子(如归一化)的优化指南。典型算子包括:sum, mean, max, min, prod, argmax, argmin, cumsum, cumprod, softmax, logsoftmax, layernorm, rmsnorm, groupnorm, instancenorm, batchnorm, l1norm, l2norm, frobeniusnorm, var, std, average_pooling, sum_pooling 等。特别重要:当归约维度不是最后一维(如 dim=1 归约 shape=[B,F,D1,D2]),需要正确处理多维索引和两阶段归约。包含 PyTorch normalized_shape 多轴归一化语义说明。不适用于纯逐元素运算或矩阵乘法。如果算子是损失函数(先逐元素计算再全局归约),应选择 elementwise-reduce-fused 指南。
CPU C++ 算子核心概念、标准结构模式、KernelBench 代码规范和内嵌扩展方法
| name | cpu-optimization-arm |
| description | ARM CPU 架构性能优化技巧、NEON SIMD 向量化、数值稳定性和调试策略 |
| category | method |
| version | 1.0.0 |
| metadata | {"backend":"cpu","dsl":"cpp","architecture":"aarch64","optimization_techniques":"NEON, SIMD, cache optimization, loop unrolling, ARM-specific"} |
NEON (Advanced SIMD) 是 ARM 的 SIMD 指令集:
推荐方式: 让编译器自动向量化,通过编译选项启用:
# 在 load_inline 中添加 ARM 向量化选项
op_module = load_inline(
name="custom_op",
cpp_sources=cpp_source,
extra_cflags=[
"-O3", # 最高优化级别
"-mcpu=native", # 针对当前 ARM CPU 优化
"-ftree-vectorize", # 启用自动向量化
"-ffast-math", # 快速数学优化(可选)
],
verbose=True
)
注意: ARM 使用 -mcpu=native 而不是 -march=native。
简单方式(未优化):
torch::Tensor elementwise_add(torch::Tensor a, torch::Tensor b) {
if (!a.is_contiguous()) a = a.contiguous();
if (!b.is_contiguous()) b = b.contiguous();
torch::Tensor output = torch::zeros_like(a);
auto a_ptr = a.data_ptr<float>();
auto b_ptr = b.data_ptr<float>();
auto out_ptr = output.data_ptr<float>();
int64_t numel = a.numel();
// 简单循环
for (int64_t i = 0; i < numel; ++i) {
out_ptr[i] = a_ptr[i] + b_ptr[i];
}
return output;
}
优化方式(循环展开,便于 NEON 向量化):
torch::Tensor elementwise_add_optimized(torch::Tensor a, torch::Tensor b) {
if (!a.is_contiguous()) a = a.contiguous();
if (!b.is_contiguous()) b = b.contiguous();
torch::Tensor output = torch::zeros_like(a);
auto a_ptr = a.data_ptr<float>();
auto b_ptr = b.data_ptr<float>();
auto out_ptr = output.data_ptr<float>();
int64_t numel = a.numel();
// 循环展开 4 倍(匹配 NEON 对 float32 的处理能力)
int64_t i = 0;
int64_t step = 4;
for (; i + step <= numel; i += step) {
out_ptr[i] = a_ptr[i] + b_ptr[i];
out_ptr[i + 1] = a_ptr[i + 1] + b_ptr[i + 1];
out_ptr[i + 2] = a_ptr[i + 2] + b_ptr[i + 2];
out_ptr[i + 3] = a_ptr[i + 3] + b_ptr[i + 3];
}
// 处理剩余元素
for (; i < numel; ++i) {
out_ptr[i] = a_ptr[i] + b_ptr[i];
}
return output;
}
优化效果: 循环展开后,编译器更容易识别并生成 NEON 向量化指令,性能提升 2-4 倍。
关键差异: ARM NEON 对 float32 的并行度是 4,而 x64 AVX 是 8。
ARM 特性: NEON 指令通常需要多个周期,如果下一条指令使用上一条的结果寄存器,会产生停顿。
简单方式(有数据依赖):
float sum_with_dependency(const float* data, int64_t size) {
float sum = 0.0f;
for (int64_t i = 0; i < size; ++i) {
sum += data[i]; // 每次依赖前一次的 sum
}
return sum;
}
优化方式(消除依赖):
float sum_no_dependency(const float* data, int64_t size) {
// 使用 4 个独立累加器,消除数据依赖
float sum0 = 0.0f, sum1 = 0.0f, sum2 = 0.0f, sum3 = 0.0f;
int64_t i = 0;
for (; i + 4 <= size; i += 4) {
sum0 += data[i]; // 独立累加器
sum1 += data[i + 1]; // 无依赖
sum2 += data[i + 2]; // 可并行执行
sum3 += data[i + 3];
}
// 合并结果
float sum = sum0 + sum1 + sum2 + sum3;
// 处理剩余元素
for (; i < size; ++i) {
sum += data[i];
}
return sum;
}
关键优化: 使用多个累加器避免循环携带依赖,允许 NEON 流水线并行执行。
标准模式(适配 NEON):
torch::Tensor sum_reduction_optimized(torch::Tensor x) {
if (!x.is_contiguous()) x = x.contiguous();
torch::ScalarType dtype = x.scalar_type();
bool need_convert = (dtype != torch::kFloat32 && dtype != torch::kFloat64);
torch::Tensor input = need_convert ? x.to(torch::kFloat32) : x;
torch::Tensor output;
if (input.scalar_type() == torch::kFloat32) {
auto x_ptr = input.data_ptr<float>();
int64_t numel = input.numel();
// 4 个累加器(匹配 NEON 宽度)
float sum0 = 0.0f, sum1 = 0.0f, sum2 = 0.0f, sum3 = 0.0f;
int64_t i = 0;
for (; i + 4 <= numel; i += 4) {
sum0 += x_ptr[i];
sum1 += x_ptr[i + 1];
sum2 += x_ptr[i + 2];
sum3 += x_ptr[i + 3];
}
float result = sum0 + sum1 + sum2 + sum3;
// 处理剩余
for (; i < numel; ++i) {
result += x_ptr[i];
}
output = torch::tensor({result}, torch::kFloat32);
} else if (input.scalar_type() == torch::kFloat64) {
auto x_ptr = input.data_ptr<double>();
int64_t numel = input.numel();
// 2 个累加器(double 在 NEON 中宽度为 2)
double sum0 = 0.0, sum1 = 0.0;
int64_t i = 0;
for (; i + 2 <= numel; i += 2) {
sum0 += x_ptr[i];
sum1 += x_ptr[i + 1];
}
double result = sum0 + sum1;
for (; i < numel; ++i) {
result += x_ptr[i];
}
output = torch::tensor({result}, torch::kFloat64);
}
if (need_convert) output = output.to(dtype);
return output;
}
典型 ARM 架构(如 Apple M1):
原则: 分块处理大数据,提高缓存复用
// 矩阵乘法分块优化(适配 ARM 缓存)
torch::Tensor matmul_blocked(torch::Tensor A, torch::Tensor B) {
if (!A.is_contiguous()) A = A.contiguous();
if (!B.is_contiguous()) B = B.contiguous();
int64_t M = A.size(0);
int64_t K = A.size(1);
int64_t N = B.size(1);
torch::Tensor C = torch::zeros({M, N}, A.options());
auto a_ptr = A.data_ptr<float>();
auto b_ptr = B.data_ptr<float>();
auto c_ptr = C.data_ptr<float>();
// 分块大小:适配 L1 Cache(通常 32-64)
const int64_t BLOCK_SIZE = 32;
for (int64_t i = 0; i < M; i += BLOCK_SIZE) {
for (int64_t j = 0; j < N; j += BLOCK_SIZE) {
for (int64_t k = 0; k < K; k += BLOCK_SIZE) {
int64_t i_max = std::min(i + BLOCK_SIZE, M);
int64_t j_max = std::min(j + BLOCK_SIZE, N);
int64_t k_max = std::min(k + BLOCK_SIZE, K);
// 块内计算
for (int64_t ii = i; ii < i_max; ++ii) {
for (int64_t jj = j; jj < j_max; ++jj) {
float sum = 0.0f;
for (int64_t kk = k; kk < k_max; ++kk) {
sum += a_ptr[ii * K + kk] * b_ptr[kk * N + jj];
}
c_ptr[ii * N + jj] += sum;
}
}
}
}
}
return C;
}
torch::Tensor softmax_stable(torch::Tensor x) {
if (!x.is_contiguous()) x = x.contiguous();
torch::Tensor output = torch::zeros_like(x);
auto x_ptr = x.data_ptr<float>();
auto out_ptr = output.data_ptr<float>();
int64_t numel = x.numel();
// 找到最大值(防止 exp 溢出)
float max_val = x_ptr[0];
for (int64_t i = 1; i < numel; ++i) {
max_val = std::max(max_val, x_ptr[i]);
}
// 减去最大值后计算 exp(使用 4 个累加器)
float sum0 = 0.0f, sum1 = 0.0f, sum2 = 0.0f, sum3 = 0.0f;
int64_t i = 0;
for (; i + 4 <= numel; i += 4) {
float exp0 = std::exp(x_ptr[i] - max_val);
float exp1 = std::exp(x_ptr[i + 1] - max_val);
float exp2 = std::exp(x_ptr[i + 2] - max_val);
float exp3 = std::exp(x_ptr[i + 3] - max_val);
out_ptr[i] = exp0;
out_ptr[i + 1] = exp1;
out_ptr[i + 2] = exp2;
out_ptr[i + 3] = exp3;
sum0 += exp0;
sum1 += exp1;
sum2 += exp2;
sum3 += exp3;
}
float sum = sum0 + sum1 + sum2 + sum3;
// 处理剩余
for (; i < numel; ++i) {
float exp_val = std::exp(x_ptr[i] - max_val);
out_ptr[i] = exp_val;
sum += exp_val;
}
// 归一化
for (int64_t i = 0; i < numel; ++i) {
out_ptr[i] /= sum;
}
return output;
}
float kahan_sum(const float* data, int64_t size) {
float sum = 0.0f;
float c = 0.0f; // 补偿变量
for (int64_t i = 0; i < size; ++i) {
float y = data[i] - c;
float t = sum + y;
c = (t - sum) - y;
sum = t;
}
return sum;
}
torch::Tensor relu_optimized_arm(torch::Tensor x) {
// 1. 确保连续性
if (!x.is_contiguous()) x = x.contiguous();
// 2. 类型检查与转换
torch::ScalarType dtype = x.scalar_type();
bool need_convert = (dtype != torch::kFloat32 && dtype != torch::kFloat64);
torch::Tensor input = need_convert ? x.to(torch::kFloat32) : x;
// 3. 创建输出
torch::Tensor output = torch::zeros_like(input);
// 4. 优化的计算逻辑(适配 ARM NEON)
if (input.scalar_type() == torch::kFloat32) {
auto x_ptr = input.data_ptr<float>();
auto out_ptr = output.data_ptr<float>();
int64_t numel = input.numel();
// 循环展开 4 倍(匹配 NEON float32 宽度)
int64_t i = 0;
for (; i + 4 <= numel; i += 4) {
out_ptr[i] = std::max(0.0f, x_ptr[i]);
out_ptr[i + 1] = std::max(0.0f, x_ptr[i + 1]);
out_ptr[i + 2] = std::max(0.0f, x_ptr[i + 2]);
out_ptr[i + 3] = std::max(0.0f, x_ptr[i + 3]);
}
// 处理剩余元素
for (; i < numel; ++i) {
out_ptr[i] = std::max(0.0f, x_ptr[i]);
}
} else if (input.scalar_type() == torch::kFloat64) {
auto x_ptr = input.data_ptr<double>();
auto out_ptr = output.data_ptr<double>();
int64_t numel = input.numel();
// 循环展开 2 倍(double 在 NEON 中宽度为 2)
int64_t i = 0;
for (; i + 2 <= numel; i += 2) {
out_ptr[i] = std::max(0.0, x_ptr[i]);
out_ptr[i + 1] = std::max(0.0, x_ptr[i + 1]);
}
for (; i < numel; ++i) {
out_ptr[i] = std::max(0.0, x_ptr[i]);
}
}
// 5. 类型还原
if (need_convert) output = output.to(dtype);
return output;
}
Apple M 系列芯片使用统一内存架构,CPU 和 GPU 共享内存:
Apple Silicon 有性能核心(P-core)和效率核心(E-core):
-mcpu=native 自动优化-O3 优化?-mcpu=native(不是 -march)?extra_cflags = [
"-O3", # 最高优化级别
"-mcpu=native", # 针对当前 ARM CPU(注意是 mcpu 不是 march)
"-ftree-vectorize", # 自动向量化
"-ffast-math", # 快速数学(可选,牺牲部分精度)
]
关键差异: ARM 使用 -mcpu 而非 -march。
| 特性 | ARM (NEON) | x64 (AVX) |
|---|---|---|
| SIMD 宽度 | 128 位 | 256 位 (AVX2), 512 位 (AVX-512) |
| Float32 并行度 | 4 | 8 (AVX2), 16 (AVX-512) |
| Float64 并行度 | 2 | 4 (AVX2), 8 (AVX-512) |
| 循环展开倍数 (float32) | 4 倍 | 8 倍 |
| 循环展开倍数 (float64) | 2 倍 | 4 倍 |
| 累加器数量 (推荐) | 4 个 | 8 个 |
| 编译选项 | -mcpu=native | -march=native |
| 数据依赖敏感度 | 高(需特别注意) | 中 |
| 误区 | 说明 | 建议 |
|---|---|---|
| 照搬 x64 优化 | ARM 和 x64 有不同的并行度 | Float32 展开 4 倍(不是 8 倍) |
| 忽略数据依赖 | ARM NEON 指令延迟高,依赖影响大 | 使用多累加器消除依赖 |
使用 -march | ARM 应该用 -mcpu | 使用 -mcpu=native |
| 过度展开 | 展开超过 NEON 宽度无益 | Float32 最多 4 倍 |
-O3 -mcpu=native -ftree-vectorize-mcpu=native 而非 -march=native