一键在 Manus 中运行任何 Skill

$pwd:

veomni-profile

Name: Veomni Profile
Author: ByteDance-Seed

// Use this skill for performance profiling and optimization. Two modes: (1) Analyze existing profile files (Chrome traces, memory snapshots) — write scripts to parse and summarize metrics per user requirements. (2) Generate profiles during development — configure ProfileConfig, run training, collect traces, analyze bottlenecks, and suggest optimizations. Trigger: 'profile', 'performance', 'slow', 'MFU', 'throughput', 'bottleneck', 'memory usage', 'trace', 'optimize training speed'.

在 Manus 中运行

$ git log --oneline --stat

stars:1,957

forks:197

updated:2026年4月13日 03:56

SKILL.md

readonly

related-skills.json

同仓库

veomni-migrate-transformers-v5.md

from "ByteDance-Seed/VeOmni"

Use this skill when adding or refreshing a patchgen-generated modeling file for a VeOmni model under veomni/models/transformers/<model>/generated/ — GPU-only or GPU+NPU, dense or MoE, text-only / VLM / Omni-thinker+talker. Covers: creating <model>_{gpu,npu}_patch_gen_config.py, using patchgen decorators (replace_class/override_method/replace_function/modify_init/add_post_import_block/drop_import_names), reusing sibling-model patches via name_map, handling MoE weight-loading (CheckpointTensorConverter + fused gate_up_proj layout), multimodal/VLM forward with Ulysses SP, excluding speech/vocoder subtrees in Omni models (talker/token2wav/DiT/BigVGAN), wiring __init__.py for the patchgen-generated classes, running codegen, and adding test cases. Trigger: 'port <model> to patchgen', 'add patchgen for <model>', 'transformers v5 migration', 'add NPU patchgen'. Do NOT edit files under generated/ manually — always regenerate via patchgen.

2026-05-302.0k

veomni-debug.md

from "ByteDance-Seed/VeOmni"

Use this skill for ANY bug, error, crash, wrong output, loss divergence, gradient explosion, test failure, CUDA error, distributed training hang, checkpoint load failure, or unexpected behavior. Covers both quick fixes (clear root cause) and complex debugging (unclear cause). Trigger: 'fix bug', 'fix error', 'broken', 'crash', 'doesn't work', 'fails with', 'loss NaN', 'training hangs', 'FSDP error', 'OOM'.

2026-05-292.0k

veomni-new-model.md

from "ByteDance-Seed/VeOmni"

Use this skill when adding support for a new model to VeOmni. Covers the full lifecycle: analyzing the HuggingFace model, creating model patches, defining parallel plans, writing configs, integrating with the trainer, and testing. Trigger: 'add model', 'support new model', 'integrate <model_name>', 'new model support'.

2026-05-292.0k

veomni-develop.md

from "ByteDance-Seed/VeOmni"

VeOmni-specific checklist for feature development and refactoring. Covers impact analysis across modalities, trainer hierarchy, data pipeline, and distributed code. Use before implementing any non-trivial change. For model-specific or ops-specific work, use veomni-new-model or veomni-new-op instead. Trigger: 'add feature', 'implement', 'refactor', 'reorganize', 'new capability'.

2026-05-192.0k

veomni-new-op.md

from "ByteDance-Seed/VeOmni"

Use this skill when adding a new optimized kernel or operator to veomni/ops/. Covers the full lifecycle: understanding VeOmni's ops architecture (KERNEL_REGISTRY + OpSlot dispatch, with a thin function-pointer shim for a few legacy global ops), implementing the kernel, registering it, adding tests, and documenting it. Trigger: 'add op', 'new kernel', 'add attention variant', 'new fused op', 'add triton kernel', 'optimize operator'.

2026-05-192.0k

veomni-uv-update.md

from "ByteDance-Seed/VeOmni"

Use this skill when updating dependencies managed by uv: bumping a package version, upgrading the uv tool itself, updating torch/CUDA stack, switching transformers version, or regenerating the lockfile. Trigger: 'update dependency', 'bump version', 'upgrade uv', 'update torch', 'update lockfile', 'uv sync fails'.

2026-05-192.0k

package.json

"author": "ByteDance-Seed"

"repository": "ByteDance-Seed/VeOmni"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件开发工程师计算机与数学类职业15-1252L4

Component

Location

Purpose

ProfileConfig

veomni/arguments/arguments_types.py

Config fields: enable, start_step, end_step, trace_dir, profile_memory, with_stack, etc.

create_profiler()

veomni/utils/helper.py

Builds torch.profiler.profile (CUDA) or torch_npu.profiler (NPU) with schedule

ProfileTraceCallback

veomni/trainer/callbacks/trace_callback.py

Integrates profiler into the training loop via BaseTrainer

VeomniFlopsCounter

veomni/utils/count_flops.py

Analytical FLOPs/MFU computation per model family

EnvironMeter

veomni/utils/helper.py

Step-level throughput metrics (tokens/s, FLOPs, MFU)

merge_chrome_trace.py

scripts/profile/merge_chrome_trace.py

Merge multi-rank Chrome traces for unified viewing

import json, gzip from collections import defaultdict def load_chrome_trace(path): opener = gzip.open if path.endswith('.gz') else open with opener(path, 'rt') as f: return json.load(f) def analyze_kernel_time(trace): """Group events by kernel name, sum durations.""" kernel_times = defaultdict(float) for event in trace.get('traceEvents', []): if event.get('cat') == 'kernel': kernel_times[event['name']] += event.get('dur', 0) return sorted(kernel_times.items(), key=lambda x: -x[1])

train: profile: enable: true start_step: 5 # skip warmup steps end_step: 10 # capture 5 steps trace_dir: ./profile_output record_shapes: true profile_memory: true # enable memory snapshot (CUDA only) with_stack: true # capture Python call stacks with_modules: true # annotate with nn.Module names rank0_only: true # profile only rank 0 to reduce overhead

source .venv/bin/activate # Single GPU python tasks/train_text.py --config configs/text/<model>.yaml # Multi-GPU (profile will capture per-rank traces) torchrun --nproc_per_node=8 tasks/train_text.py --config configs/text/<model>.yaml

Bottleneck

Typical solutions

Attention kernels dominate

Switch to FlashAttention 3/4 (veomni/ops/flash_attn/), check FA is actually active

NCCL communication > 30%

Increase compute/comm overlap, adjust FSDP reshard policy, try async SP

Memory OOM / high peak

Enable activation checkpointing, reduce micro-batch size, check for memory leaks

Data loading stalls

Increase num_workers, enable prefetch, check I/O throughput

Low MFU (< 40%)

Check dtype (bf16 vs fp32), verify tensor cores are used, check for host-device syncs

Uneven per-rank time

Check MoE load balancing, verify data distribution across ranks

Component

Location

Purpose

ProfileConfig

veomni/arguments/arguments_types.py

Config fields: enable, start_step, end_step, trace_dir, profile_memory, with_stack, etc.

create_profiler()

veomni/utils/helper.py

Builds torch.profiler.profile (CUDA) or torch_npu.profiler (NPU) with schedule

ProfileTraceCallback

veomni/trainer/callbacks/trace_callback.py

Integrates profiler into the training loop via BaseTrainer

VeomniFlopsCounter

veomni/utils/count_flops.py

Analytical FLOPs/MFU computation per model family

EnvironMeter

veomni/utils/helper.py

Step-level throughput metrics (tokens/s, FLOPs, MFU)

merge_chrome_trace.py

scripts/profile/merge_chrome_trace.py

Merge multi-rank Chrome traces for unified viewing

Bottleneck

Typical solutions

Attention kernels dominate

Switch to FlashAttention 3/4 (veomni/ops/flash_attn/), check FA is actually active

NCCL communication > 30%

Increase compute/comm overlap, adjust FSDP reshard policy, try async SP

Memory OOM / high peak

Enable activation checkpointing, reduce micro-batch size, check for memory leaks

Data loading stalls

Increase num_workers, enable prefetch, check I/O throughput

Low MFU (< 40%)

Check dtype (bf16 vs fp32), verify tensor cores are used, check for host-device syncs

Uneven per-rank time

Check MoE load balancing, verify data distribution across ranks

veomni-profile

VeOmni Profiling Infrastructure

Mode 1: Analyze Existing Profile Files

Steps

Mode 2: Generate Profiles During Development

Step 1: Configure Profiling

Step 2: Run Training

Step 3: Collect and Analyze

Step 4: Optimize

Step 5: Validate

NPU (Ascend) Profiling

VeOmni Profiling Infrastructure

Mode 1: Analyze Existing Profile Files

Steps

Mode 2: Generate Profiles During Development

Step 1: Configure Profiling

Step 2: Run Training

Step 3: Collect and Analyze

Step 4: Optimize

Step 5: Validate

NPU (Ascend) Profiling

veomni-profile

同仓库更多 Skills

同仓库更多 Skills

VeOmni Profiling Infrastructure

Mode 1: Analyze Existing Profile Files

Steps

Mode 2: Generate Profiles During Development

Step 1: Configure Profiling

Step 2: Run Training

Step 3: Collect and Analyze

Step 4: Optimize

Step 5: Validate

NPU (Ascend) Profiling

VeOmni Profiling Infrastructure

Mode 1: Analyze Existing Profile Files

Steps

Mode 2: Generate Profiles During Development

Step 1: Configure Profiling

Step 2: Run Training

Step 3: Collect and Analyze

Step 4: Optimize

Step 5: Validate

NPU (Ascend) Profiling