Triton Ascend hard API restrictions and forbidden syntax. MUST-follow rules that apply to every kernel: forbidden control flow (return/break/continue/lambda/while), tensor slice/index restrictions, scalar conversion rules, BLOCK_SIZE upper bound. Violating any of these produces a compile or runtime error on Ascend.

2026-04-19

triton-ascend-optimization

software-developers

Triton Ascend 性能优化通用策略: BLOCK_SIZE 选择 (1024-2048 for elementwise, must be <65536), grid configuration (use VEC_CORE_NUM / CUBE_CORE_NUM, 2D/3D grid for matmul / conv / reduce, 1D grid + inner loop for elementwise / pointwise), 256B alignment for memory transfers, autotune block-size patterns, fp16 / fp32 precision conversion. Bind via keywords like matmul, elementwise, reduce, block_size, grid, autotune, alignment, fp16, fp32, tile, interleaved-loop, cube-core, vec-core.

2026-04-19

search-workflow

software-developers

通过 adaptive_search 或 evolve 搜索式 workflow 生成优化算子。后台 silent mode 执行，轮询监控进度。

2026-04-16

triton-ascend-reduce

software-developers

适用于归约(reduce)类算子和含归约子步骤的复合算子（如归一化）的优化指南。典型算子包括：sum, mean, max, min, prod, argmax, argmin, cumsum, cumprod, softmax, logsoftmax, layernorm, rmsnorm, groupnorm, instancenorm, batchnorm, l1norm, l2norm, frobeniusnorm, var, std, average_pooling, sum_pooling 等。特别重要：当归约维度不是最后一维（如 dim=1 归约 shape=[B,F,D1,D2]），需要正确处理多维索引和两阶段归约。包含 PyTorch normalized_shape 多轴归一化语义说明。不适用于纯逐元素运算或矩阵乘法。如果算子是损失函数（先逐元素计算再全局归约），应选择 elementwise-reduce-fused 指南。

2026-04-16

cpu-basics

software-developers

CPU C++ 算子核心概念、标准结构模式、KernelBench 代码规范和内嵌扩展方法

2026-04-13

cpu-optimization-arm

software-developers

ARM CPU 架构性能优化技巧、NEON SIMD 向量化、数值稳定性和调试策略

2026-04-13

cpu-optimization-x64

software-developers

x64 CPU 架构性能优化技巧、SIMD/AVX 向量化、数值稳定性和调试策略

2026-04-13

Showing top 8 of 117 collected skills in this repository.

#002

mindspore-lite

9 skills51updated 2026-07-02

6.8% of creator

skill

occupation

description

updated

onnx-model-conversion-and-deployment

software-developers

MindSpore Lite云侧推理 Ascend 后端离线转换（ONNX → MindIR）与推理部署全流程。覆盖固定 shape、动态分档、纯动态 shape 的转换策略，以及 MindIR 推理验证与部署注意事项。

2026-07-02

open-source-model-migration

software-developers

把开源算法模型适配到 MindSpore Lite 部署管线：按网络结构拆分导出 ONNX、ONNX Runtime 推理验证、ONNX→MindIR 转换、MindSpore Lite 推理实现，并交付文档与常见问题。用户想把某个开源模型迁移到 MSLite 部署时调用。

2026-07-02

performance-optimization

software-developers

MindSpore Lite（Ascend）模型性能优化总攻略。做基线/profiling、融合算子改写、推理免拷贝、PTQ int8 量化、精度对齐与归档时调用。本文为总览与索引，细化策略见 references/。

2026-07-02

lite-build

software-developers

Build configuration, CMake options, cross-compilation and packaging. Use when building MindSpore Lite, configuring CMake, cross-compiling for ARM/iOS/MCU, packaging release archives, or troubleshooting build errors.

2026-07-02

lite-converter

software-developers

Model conversion pipeline, parser development, optimization passes and quantization. Use when converting models to .ms, writing parser code, implementing optimizer passes, or configuring quantization.

2026-07-02

lite-debug-test

software-quality-assurance-analysts-and-testers

Debugging, unit testing, benchmarking and performance analysis. Use when running gtest, benchmark tools, profiling latency or accuracy, diagnosing operator precision issues, delegate fallback, or memory leaks.

2026-07-02

lite-device-side-infer

software-developers

Device-side inference with LiteRT, NNACL and hardware delegates. Use for mobile/IoT inference, Android/iOS integration, NPU/GPU/CoreML delegates, Micro codegen for MCU, on-device training, or C/C++/Java/Python API usage with .ms models.

2026-07-02

lite-kernel-dev

software-developers

Operator and kernel development, NNACL, delegates, custom kernel registration. Use when adding operators, implementing NNACL kernels, writing delegate adapters (NPU/CoreML/Ascend), registering custom kernels, or modifying operator schema.

2026-07-02

Showing top 8 of 9 collected skills in this repository.

#003

hyper-parallel

7 skills53updated 2026-06-29

5.3% of creator

skill

occupation

description

updated

platform-dev

software-developers

HyperParallel platform abstraction layer development. Use when adding new platform APIs, implementing cross-platform features (FSDP/HSDP/Pipeline/Activation Checkpoint), creating DTensorBase extensions, or modifying collective operations. Covers both PyTorch and MindSpore backends.

2026-06-29

autogit

software-developers

GitCode fork workflow automation. Use this skill whenever the user wants to commit code, push, create or append to a Pull Request, view PR status, squash commits, regenerate a PR description, or run lint checks against a GitCode `origin` (fork) + `upstream` repository. Supports both Chinese and English natural-language triggers (e.g. "帮我提交", "create PR", "看下 PR 状态") and slash-command shortcuts (`/commit`, `/create-pr`, etc.). The full trigger → subcommand mapping lives in the "When to Activate" section.

2026-06-22

dist-op-analysis

software-developers

Distributed operator analysis. Analyzes operator interfaces provided by the user and outputs a standardized implementation plan. Requires human confirmation before development begins.

2026-06-21

dist-op-dev

software-developers

Distributed operator development. Reads a confirmed plan file, implements operator code and tests, and runs until all executable tests pass. Goal mode — no step-by-step confirmation needed.

2026-06-21

code-review

software-quality-assurance-analysts-and-testers

Review HyperParallel code changes for distributed correctness, stream synchronization, memory safety, cross-platform consistency, and code quality. Use when reviewing PRs, code changes, or when the user mentions "review", "code review", or "check this".

2026-06-02

gate-doctor

software-quality-assurance-analysts-and-testers

Drive a red MindSpore-family GitCode PR gate to green end-to-end — read state, post /check-pr → /retest, classify every failure (root-cause fix in production code for PR-INDUCED, /retest only for random flakes, REVERT-BEFORE-MERGE temp comment-out for confirmed unrelated sticky flakes), then loop until both pr-check-pass AND ci-pipeline-passed labels appear. Emits one structured final report on exit. Trigger on: 触发门禁 / 重跑门禁 / 门禁失败 / PR 流水线挂了 / 修一下流水线 / autofix 门禁 / 把 PR 修绿 / 一直跑到通过 / 看下 #N 现在情况 / ut/st/Ascend 用例失败 / Smoke_Ascend 挂了 / 门禁随机挂 / gate is failing / CI is flaky / random retest pass / /retest / /check-pr, or any prompt naming a MindSpore GitCode PR alongside a request to investigate or fix CI. Default PR repo is mindspore/hyper-parallel; pass full URL or `mindspore/mindspore#N` for cross-repo.

2026-06-02

parallel-strategy-analyzer

software-developers

Analyze model architecture and hardware constraints to recommend optimal parallel strategy combinations (DP/FSDP/TP/PP/EP/CP) with memory, communication, compute, and pipeline bubble estimation.

2026-06-02

Showing 3 of 3 repositories

All repositories loaded