Use this skill when adding native LightX2V support for a new model or task: understand an upstream inference repo, map it onto LightX2V runner/model/weight/infer/scheduler/input-encoder/VAE conventions, convert or load weights, add configs and Wan-style scripts without hard-coded paths, implement CFG/KV cache/block-model offload/SP/CFG-parallel without batch dimensions, and verify output parity with the upstream pipeline.

2026-06-15

sycl-esimd-to-python-wheel

software-developers

Full pipeline for turning a SYCL/ESIMD GPU kernel into a Python-importable wheel package on Windows with Intel oneAPI 2025.x and conda. Covers every layer of the stack: ESIMD kernel (.cpp/.h) → Windows DLL (icpx) → PyTorch C++ extension (.pyd, CMake) → Python package → wheel (.whl, scikit-build-core). Use this skill whenever the user is working on Intel Arc GPU (Xe2 / BMG / PTL-H) SYCL or ESIMD kernels and wants to expose them to Python, package them as a wheel, set up a build script, debug build failures, or understand how the DLL + .pyd + wheel layers fit together. Also use it when they hit Windows-specific build issues like setvars.bat failing, cmake.exe producing no output, or ur_api.h not found.

2026-04-17

esimd-lsc-2d-gather-scatter

software-developers

LSC 2D block load/store, 1D block load/store, and gather/scatter operations in Intel ESIMD. Use this skill when working with lsc_load_2d, lsc_store_2d, lsc_prefetch_2d, config_2d_mem_access, block_load, block_store, gather, or scatter in ESIMD kernels. Covers 2D surface descriptors, transposed VNNI loads, tile size constraints, cache hints, and common pitfalls like the rvalue bit_cast_view bug and half transpose limitation.

2026-04-17

esimd-lsc-slm

software-developers

LSC Shared Local Memory (SLM) operations in Intel ESIMD. Use this skill when working with slm_init, slm_block_load, slm_block_store, lsc_slm_gather, lsc_slm_scatter, SLM layout design, barrier synchronization, named barriers, cooperative SLM loading, or any kernel that uses workgroup shared memory on Intel GPUs. Covers SLM size limits, bank conflicts, the lsc_slm_scatter transpose trick, and common pitfalls like forgetting slm_init or conditional barriers causing GPU hangs.

2026-04-17

intel-esimd-base

software-developers

Foundational Intel ESIMD GPU programming skill. Use this skill proactively whenever the user is writing, optimizing, or debugging any SYCL/ESIMD kernel for Intel GPUs — including Intel Arc, Iris Xe, or Data Center GPU Max. Covers kernel design, memory access patterns (block_load, gather, SLM), data types, vectorization, workgroup patterns, hardware characteristics, performance analysis, and troubleshooting. Trigger this even when the user does not explicitly say "ESIMD" — invoke it for any Intel GPU kernel development, performance bottleneck questions, or SYCL optimization tasks targeting Intel hardware.

2026-04-17

intel-esimd-fuse

software-developers

Expert guidance for implementing fused multi-operation kernels on Intel GPUs using ESIMD. Use this skill whenever the user needs to fuse multiple operations into a single kernel pass to minimize memory traffic, such as softmax + top-K + normalize, or any pipeline that chains reduction, selection, and normalization in one kernel. Also trigger for ESIMD softmax implementation, vectorized exp on simd<float,N> for a full row, detail::sum vs reduce pitfall (reduce silently returns 0), fused attention block selection with probability normalization, or any kernel that computes softmax probabilities and immediately selects the top-K entries. The main example is the fused softmax+topk+normalize V2 variant achieving 43.2 GB/s (43% bandwidth utilization) for seq_len=32K, N=128, K=8.

2026-04-17

intel-gpu-hw-info

software-developers

Definitive reference for Intel GPU hardware specifications across architectures. Covers Xe2 (Lunar Lake/LNL, Battlemage/BMG) and Xe3 (Panther Lake/PTL, Panther Lake-H/PTLH) GPU hardware: XE core counts, memory bandwidth, XMX/DPAS compute, GRF sizes, SLM limits, thread counts, EU layout, L3 cache, TDP. Use whenever the user asks about Intel GPU specs, hardware comparison, architecture differences, roofline parameters, or thread/memory limits. Trigger for questions like "how many XE cores", "what is BMG bandwidth", "PTL vs BMG", "Xe2 specs", "LNL GPU", etc.

2026-04-17

intel-gpu-kernel-opt

software-developers

General Intel GPU kernel optimization methodology. Use this skill when profiling or optimizing any ESIMD or SYCL kernel on Intel GPUs, performing roofline analysis, diagnosing bottlenecks (register spill, SLM bank conflicts, barrier overhead, memory coalescing), comparing Xe2 vs Xe3 hardware, or planning an optimization workflow. Covers VTune and GTPin profiling, key metrics (TFLOPS, GB/s, peak %), hardware comparison (Xe2: LNL/BMG vs Xe3: PTL/PTLH), and optimization patterns (prefetch, load/compute separation, loop unrolling, SIMD width selection). Xe2 is the architecture for Lunar Lake (LNL) and Battlemage (BMG); Xe3 is the architecture for Panther Lake (PTL) and Panther Lake-H (PTLH). Trigger for any Intel GPU performance question.

2026-04-17

Showing top 8 of 16 collected skills in this repository.

#002

LightLLM

13 skills4.2k344updated 2026-06-15

43% of creator

skill

occupation

description

updated

lightllm-profiler-control

software-developers

LightLLM profiler 使用说明。用于需要启动或停止 LightLLM 的 torch_profiler / nvtx profiling 功能时，尤其是查看 --enable_profiling、/profiler_start、/profiler_stop 的使用方法。

2026-06-15

test-model-common

software-developers

Common override guidance for all skills/test_model sub-skills. Applies to LightLLM model accuracy/speed tests that use lm_eval or lmms_eval, especially local-completions GSM8K runs.

2026-06-11

test-model-qwen3-5-0-8b-pd-nixl

software-developers

LightLLM Qwen3.5-0.8B PD disaggregation over NIXL gsm8k: pd_master on 8089, prefill on 8001, decode on 8002. Supports TP1 and TP2 runs by setting TP / PREFILL_CUDA_DEVICES / DECODE_CUDA_DEVICES. Qwen3.5 has linear-attention state transfer; use --pd_kv_page_size 2048 and --pd_kv_page_num 16. lm_eval hits pd_master URL. Requires UCX/RDMA env, nvidia_peermem check, curl warmup before lm_eval, registration wait in pd_master.log, and summary.txt. Includes optional repeated-prompt decode cache probe for linear-att page-boundary behavior.

2026-06-10

test-model-qwen3-8b-pd-nixl

software-developers

LightLLM Qwen3-8b PD disaggregation gsm8k: pd_master on 8089, prefill on 8001, decode on 8002, tp 2 each. Assign four GPUs via nvidia-smi then export PREFILL_CUDA_DEVICES / DECODE_CUDA_DEVICES (no fixed card IDs; no complex shell automation). UCX_NET_DEVICES and TLS for RDMA per cluster. lm_eval hits pd_master URL. HOST vs PD_MASTER_IP when co-located. Before lm_eval, must POST one completion via curl to pd_master for warmup verification. Requires LOG_DIR, MODEL_DIR, proxy cleared, no_proxy, summary.txt. Same-GPU model_infer + pd_*_trans need NVIDIA MPS for best KV copy perf; record MPS on/off in summary. Run check_nvidia_peermem.sh in this skill dir; record in summary.txt. Use for PD separation tests with either the default NIXL transport or NCCL transport.

2026-06-10

test-model-deepseekv32-ep

software-developers

Runs LightLLM DeepSeek-V3.2 EP MoE gsm8k: api_server with --tp 8 --dp 8 --enable_ep_moe, tool_call_parser deepseekv32, reasoning_parser deepseek-v3, graph_max_batch_size 32, mem_fraction 0.8, LOADWORKER 14, port 8000 aligned with lm_eval base_url. Requires a dedicated log directory, api_server and eval logs, summary.txt consolidated report. lm_eval uses tokenizer_backend=null (server-side tokenization) because local transformers does not recognize model_type deepseek_v32. Distinct from R1 MTP/Base flows. Use for V3.2 EP MoE gsm8k accuracy on LightLLM.

2026-06-05

test-model-deepseekr1-mtp-tp

software-developers

DeepSeek-R1 MTP-TP test: LightLLM api_server with MTP (EAGLE) draft, tensor parallel only (--tp 8, no --dp, no EP MoE), plus GSM8K lm_eval on localhost. Distinct from the MTP-EP-TPDP skill which uses --tp 8 --dp 8 and EP MoE. Requires a dedicated log directory, summary.txt, tokenizer aligned with MODEL_DIR. Use for TP-only MTP gsm8k accuracy runs.

2026-05-22

test-model-deepseekr1-base-tp

software-developers

Runs LightLLM DeepSeek-R1 baseline TP gsm8k: single api_server with --tp 8 and --batch_max_tokens only, no MTP draft, no --dp, no EP MoE (distinct from deepseekr1-mtp-tp which adds MTP). GSM8K lm_eval on localhost port 8089. Requires a dedicated log directory, api_server and eval logs under that tree, summary.txt as consolidated report, tokenizer aligned with MODEL_DIR. Use for baseline R1 tensor-parallel accuracy runs without MTP/EP.

2026-05-13

test-model-deepseekr1-mtp-ep

software-developers

Runs LightLLM DeepSeek-R1 EP MoE + MTP (EAGLE) server variants and GSM8K lm_eval against localhost. Requires each full run to use a dedicated log directory: persist every api_server process log under that tree (per-variant subdirectories recommended), write the consolidated summary to summary.txt in that same log directory, and keep artifacts separated from other test runs. Use when running DeepSeek-R1 MTP EP accuracy workflows or when the user asks to run these four server configurations one-by-one with logged results.

2026-05-13

Showing top 8 of 13 collected skills in this repository.

#003

LightX2V-Skills

1 skills00updated 2026-06-23

3.3% of creator

skill

occupation

description

updated

lightx2v-ai-video-generation

software-developers

Run LightX2V AI generation tasks from the command line via OpenAPI. Commands: lightx2v login, models, run, query, list, cancel, resume, delete, result, completion. Tasks: t2i, t2v, i2v, s2v, flf2v, t2av, i2av, vsr, animate, i2i. Use for: scriptable video/image generation, CI pipelines, agent automation, batch jobs. Triggers: lightx2v cli, lightx2v run, api key generation, text to video cli, text to image cli, download generated video, poll task status

2026-06-23

Showing 3 of 3 repositories

All repositories loaded