بنقرة واحدة
angelslim
// Tencent AngelSlim — accessible, comprehensive, and efficient toolkit for large model compression. Quantization (FP8/INT4/NVFP4/1.25-bit), pruning, speculative decoding (Eagle3), and diffusion model compression.
// Tencent AngelSlim — accessible, comprehensive, and efficient toolkit for large model compression. Quantization (FP8/INT4/NVFP4/1.25-bit), pruning, speculative decoding (Eagle3), and diffusion model compression.
Prepare and validate Parameter Golf record folders: self-contained train_gpt.py, README.md, submission.json, FineWeb SP1024 BPB accounting, artifact-size logging, run logs, and PR-ready folder hygiene.
Run Parameter Golf competition submissions on RunPod GPU Pods. Covers required operator inputs, RunPod pod specs, FineWeb SP1024 data caching, record-folder hygiene, torchrun launch commands, monitoring, artifact-size checks, and result collection.
Hands-on implementation template and API reference for writing, tuning, debugging, and benchmarking Triton GPU kernels. Covers the full triton.language API surface, autotuning patterns, profiling workflows, and production integration.
Hands-on implementation template and API reference for writing, tuning, debugging, and benchmarking Triton GPU kernels. Covers the full triton.language API surface, autotuning patterns, profiling workflows, and production integration.
DistilQwen2.5 — Alibaba's industrial practices for training distilled open lightweight language models. Knowledge distillation from Qwen2.5 72B into smaller 0.5B-7B models.
Intel Neural Compressor — SOTA low-bit LLM quantization (INT8/FP8/INT4/NVFP4), sparsity, pruning, and distillation for PyTorch, TensorFlow, and ONNX Runtime.
| name | angelslim |
| description | Tencent AngelSlim — accessible, comprehensive, and efficient toolkit for large model compression. Quantization (FP8/INT4/NVFP4/1.25-bit), pruning, speculative decoding (Eagle3), and diffusion model compression. |
| tags | ["angelslim","model-compression","quantization","pruning","speculative-decoding","tencent","zorai"] |
AngelSlim integrates mainstream compression algorithms into a unified framework with one-click access. Supports FP8/INT8/INT4/NVFP4/1.25-bit quantization, pruning, Eagle3 speculative decoding, and diffusion model compression for LLMs, VLMs, and audio models.
uv pip install angelslim
import angelslim as slim
# FP8 static quantization
model = slim.quantize(model, dtype="fp8_static", qconfig="default")
# INT4 GPTQ
model = slim.quantize(model, dtype="int4_gptq", dataset="wikitext2")
| Method | Precision | Best For |
|---|---|---|
| FP8-Static/Dynamic | 8-bit | General LLM deployment |
| INT4 GPTQ/AWQ/GPTAQ | 4-bit | Memory-constrained serving |
| NVFP4 | 4-bit (NVIDIA) | Blackwell GPUs |
| Sherry | 1.25-bit | Extreme compression |
| STQ1_0 | 1.25-bit | On-device deployment |
# Train Eagle3 draft model
slim.eagle3.train(model, draft_model_config)
# Inference with Eagle3
output = model.generate_with_eagle3(input_ids, max_new_tokens=256)