with one click
代码生成Agent,负责将设计方案转换为可执行代码
npx skills add https://github.com/mindspore-ai/akg --skill coder-agentCopy and paste this command into Claude Code to install the skill
代码生成Agent,负责将设计方案转换为可执行代码
npx skills add https://github.com/mindspore-ai/akg --skill coder-agentCopy and paste this command into Claude Code to install the skill
| name | coder-agent |
| description | 代码生成Agent,负责将设计方案转换为可执行代码 |
| category | agent |
| version | 2.0.0 |
| license | MIT |
| structure | {"child_skills":["cuda-basics","triton-syntax"],"default_children":["triton-syntax"]} |
Coder Agent是 AKG Agents 中的核心代码生成组件,负责:
| 后端 | 架构 | DSL优先级 |
|---|---|---|
| NVIDIA GPU | CUDA | CUDA > Triton > OpenCL |
| AMD GPU | ROCm | OpenCL > HIP |
| Intel GPU | OneAPI | SYCL > OpenCL |
输入: 算法设计 + 目标后端 + 性能要求
↓
步骤1: 加载相关Skill(如cuda-basics, triton-syntax)
↓
步骤2: 生成初始代码框架
↓
步骤3: 填充计算逻辑
↓
步骤4: 应用优化技巧
↓
步骤5: 添加错误处理
↓
输出: 可编译的高性能代码
__global__ void matmul_kernel(
const float* A,
const float* B,
float* C,
int M, int N, int K
) {
// 共享内存
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
// 计算线程索引
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
// 分块计算
for (int tile = 0; tile < (K + TILE_SIZE - 1) / TILE_SIZE; ++tile) {
// 加载数据到共享内存
if (row < M && (tile * TILE_SIZE + threadIdx.x) < K)
As[threadIdx.y][threadIdx.x] = A[row * K + tile * TILE_SIZE + threadIdx.x];
else
As[threadIdx.y][threadIdx.x] = 0.0f;
if (col < N && (tile * TILE_SIZE + threadIdx.y) < K)
Bs[threadIdx.y][threadIdx.x] = B[(tile * TILE_SIZE + threadIdx.y) * N + col];
else
Bs[threadIdx.y][threadIdx.x] = 0.0f;
__syncthreads();
// 计算部分和
#pragma unroll
for (int k = 0; k < TILE_SIZE; ++k) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads();
}
// 写回结果
if (row < M && col < N) {
C[row * N + col] = sum;
}
}
@triton.jit
def matmul_kernel(
A_ptr, B_ptr, C_ptr,
M, N, K,
stride_am, stride_ak,
stride_bk, stride_bn,
stride_cm, stride_cn,
BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr
):
# 程序ID
pid_m = tl.program_id(0)
pid_n = tl.program_id(1)
# 块偏移
offs_am = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offs_bn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
offs_k = tl.arange(0, BLOCK_K)
# 指针
a_ptrs = A_ptr + offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak
b_ptrs = B_ptr + offs_k[:, None] * stride_bk + offs_bn[None, :] * stride_bn
# 累加器
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
# 分块计算
for k in range(0, K, BLOCK_K):
a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k, other=0.0)
b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k, other=0.0)
acc += tl.dot(a, b)
a_ptrs += BLOCK_K * stride_ak
b_ptrs += BLOCK_K * stride_bk
# 写回
c_ptrs = C_ptr + offs_am[:, None] * stride_cm + offs_bn[None, :] * stride_cn
tl.store(c_ptrs, acc, mask=(offs_am[:, None] < M) & (offs_bn[None, :] < N))
Coder Agent使用RAG(Retrieval-Augmented Generation)从成功案例中学习:
# 启用详细日志
coder.set_debug_mode(True)
# 生成带注释的代码
code = coder.generate(
design=design,
add_comments=True,
add_assertions=True
)
# 性能分析
profile = coder.profile_code(code)
print(f"Expected performance: {profile['gflops']} GFLOPS")
coder_agent:
dsl: triton # cuda, triton, opencl
backend: nvidia # nvidia, amd, intel
optimization_level: 2 # 0-3
strategy: iterative # conservative, iterative, aggressive
enable_rag: true
max_retries: 3
| DSL | 编译时间 | 运行性能 | 可移植性 |
|---|---|---|---|
| CUDA | 中 | 优秀 | 低(NVIDIA专用) |
| Triton | 快 | 优秀 | 中(GPU通用) |
| OpenCL | 慢 | 良好 | 高(跨平台) |
选择合适的DSL:
优化策略:
代码复用:
测试验证:
矩阵乘法矩阵乘法 A[M, K] @ B[K, N] = C[M, N]中,大K维度矩阵乘法(K>>M,N)优化:针对M/N较小但K极大(如M=N=256,K=131072)的场景,Split-K切分K维度并行化、Workspace+Reduce替代全局同步,实现显著性能提升
Triton Ascend hard API restrictions and forbidden syntax. MUST-follow rules that apply to every kernel: forbidden control flow (return/break/continue/lambda/while), tensor slice/index restrictions, scalar conversion rules, BLOCK_SIZE upper bound. Violating any of these produces a compile or runtime error on Ascend.
Triton Ascend 性能优化通用策略: BLOCK_SIZE 选择 (1024-2048 for elementwise, must be <65536), grid configuration (use VEC_CORE_NUM / CUBE_CORE_NUM, 2D/3D grid for matmul / conv / reduce, 1D grid + inner loop for elementwise / pointwise), 256B alignment for memory transfers, autotune block-size patterns, fp16 / fp32 precision conversion. Bind via keywords like matmul, elementwise, reduce, block_size, grid, autotune, alignment, fp16, fp32, tile, interleaved-loop, cube-core, vec-core.
通过 adaptive_search 或 evolve 搜索式 workflow 生成优化算子。 后台 silent mode 执行,轮询监控进度。
适用于归约(reduce)类算子和含归约子步骤的复合算子(如归一化)的优化指南。典型算子包括:sum, mean, max, min, prod, argmax, argmin, cumsum, cumprod, softmax, logsoftmax, layernorm, rmsnorm, groupnorm, instancenorm, batchnorm, l1norm, l2norm, frobeniusnorm, var, std, average_pooling, sum_pooling 等。特别重要:当归约维度不是最后一维(如 dim=1 归约 shape=[B,F,D1,D2]),需要正确处理多维索引和两阶段归约。包含 PyTorch normalized_shape 多轴归一化语义说明。不适用于纯逐元素运算或矩阵乘法。如果算子是损失函数(先逐元素计算再全局归约),应选择 elementwise-reduce-fused 指南。
CPU C++ 算子核心概念、标准结构模式、KernelBench 代码规范和内嵌扩展方法