Run any Skill in Manus with one click

awq-quantization

Stars9,996

Forks745

UpdatedNovember 25, 2025 at 21:59

Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

Orchestra-Research

Orchestra-Research/AI-Research-SKILLs

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Computer and Information Research ScientistsComputer and Mathematical Occupations·SOC 15-1221

Software DevelopersSOC 15-1252

File Explorer

3 files

SKILL.md

readonly

More from this repository

same repository

model-merging

Orchestra-Research/AI-Research-SKILLs

Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.

2026-06-1610.0k

ara-compiler

Orchestra-Research/AI-Research-SKILLs

Compiles any research input — PDF papers, GitHub repositories, experiment logs, code directories, or raw notes — into a complete Agent-Native Research Artifact (ARA) with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph, and grounded evidence. Use when ingesting a paper or codebase into a structured, machine-executable knowledge package, building an ARA from scratch, or converting research outputs into a falsifiable, agent-traversable form.

2026-04-2810.0k

ara-research-manager

Orchestra-Research/AI-Research-SKILLs

Records research provenance as a post-task epilogue, scanning conversation history at the end of a coding or research session to extract decisions, experiments, dead ends, claims, heuristics, and pivots, and writing them into the ara/ directory with user-vs-AI provenance tags. Use as a session epilogue — never during execution — to maintain a faithful, auditable trace of how a research project actually evolved.

2026-04-2810.0k

ara-rigor-reviewer

Orchestra-Research/AI-Research-SKILLs

Performs ARA Seal Level 2 semantic epistemic review on Agent-Native Research Artifacts, scoring six dimensions (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and producing a constructive, severity-ranked report with a Strong Accept-to-Reject recommendation. Use after Level 1 structural validation passes, when an ARA needs an objective epistemic critique before publication or release.

2026-04-2810.0k

ml-paper-writing

Orchestra-Research/AI-Research-SKILLs

Write publication-ready ML/AI papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Use when drafting papers from research repos, structuring arguments, verifying citations, or preparing camera-ready submissions. For systems venues (OSDI, NSDI, ASPLOS, SOSP), use systems-paper-writing instead.

2026-04-1010.0k

presenting-conference-talks

Orchestra-Research/AI-Research-SKILLs

Generates conference presentation slides (Beamer LaTeX PDF and editable PPTX) from a compiled paper with speaker notes and talk script. Use when preparing oral talks, spotlight presentations, or invited talks for ML and systems conferences.

2026-04-1010.0k

name	awq-quantization
description	Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Optimization","AWQ","Quantization","4-Bit","Activation-Aware","Memory Optimization","Fast Inference","vLLM Integration","Marlin Kernels"]
dependencies	["autoawq","transformers>=4.45.0","torch>=2.0.0"]

AWQ (Activation-aware Weight Quantization)

4-bit quantization that preserves salient weights based on activation patterns, achieving 3x speedup with minimal accuracy loss.

When to use AWQ

Use AWQ when:

Need 4-bit quantization with <5% accuracy loss
Deploying instruction-tuned or chat models (AWQ generalizes better)
Want ~2.5-3x inference speedup over FP16
Using vLLM for production serving
Have Ampere+ GPUs (A100, H100, RTX 40xx) for Marlin kernel support

Use GPTQ instead when:

Need maximum ecosystem compatibility (more tools support GPTQ)
Working with ExLlamaV2 backend specifically
Have older GPUs without Marlin support

Use bitsandbytes instead when:

Need zero calibration overhead (quantize on-the-fly)
Want to fine-tune with QLoRA
Prefer simpler integration

Quick start

Installation

# Default (Triton kernels)
pip install autoawq

# With optimized CUDA kernels + Flash Attention
pip install autoawq[kernels]

# Intel CPU/XPU optimization
pip install autoawq[cpu]

Requirements: Python 3.8+, CUDA 11.8+, Compute Capability 7.5+

Load pre-quantized model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-AWQ"

model = AutoAWQForCausalLM.from_quantized(
    model_name,
    fuse_layers=True  # Enable fused attention for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate
inputs = tokenizer("Explain quantum computing", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantize your own model

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "mistralai/Mistral-7B-Instruct-v0.2"

# Load model and tokenizer
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,      # Use zero-point quantization
    "q_group_size": 128,     # Group size (128 recommended)
    "w_bit": 4,              # 4-bit weights
    "version": "GEMM"        # GEMM for batch, GEMV for single-token
}

# Quantize (uses pileval dataset by default)
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized("mistral-7b-awq")
tokenizer.save_pretrained("mistral-7b-awq")

Timing: ~10-15 min for 7B, ~1 hour for 70B models.

AWQ vs GPTQ vs bitsandbytes

Feature	AWQ	GPTQ	bitsandbytes
Speedup (4-bit)	~2.5-3x	~2x	~1.5x
Accuracy loss	<5%	~5-10%	~5-15%
Calibration	Minimal (128-1K tokens)	More extensive	None
Overfitting risk	Low	Higher	N/A
Best for	Production inference	GPU inference	Easy integration
vLLM support	Native	Yes	Limited

Key insight: AWQ assumes not all weights are equally important. It protects ~1% of salient weights identified by activation patterns, reducing quantization error without mixed-precision overhead.

Kernel backends

GEMM (default, batch inference)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # Best for batch sizes > 1
}

GEMV (single-token generation)

quant_config = {
    "version": "GEMV"  # 20% faster for batch_size=1
}

Limitation: Only batch size 1, not good for large context.

Marlin (Ampere+ GPUs)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    version="marlin"  # 2x faster on A100/H100
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-AWQ",
    quantization_config=config
)

Requirements: Compute Capability 8.0+ (A100, H100, RTX 40xx)

ExLlamaV2 (AMD compatible)

config = AwqConfig(
    bits=4,
    version="exllama"  # Faster prefill, AMD GPU support
)

HuggingFace Transformers integration

Direct loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-alpha-AWQ",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ")

Fused modules (recommended)

from transformers import AwqConfig, AutoModelForCausalLM

config = AwqConfig(
    bits=4,
    fuse_max_seq_len=512,  # Max sequence length for fusing
    do_fuse=True           # Enable fused attention/MLP
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-OpenOrca-AWQ",
    quantization_config=config
)

Note: Fused modules cannot combine with FlashAttention2.

vLLM integration

from vllm import LLM, SamplingParams

# vLLM auto-detects AWQ models
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half"
)

sampling = SamplingParams(temperature=0.7, max_tokens=200)
outputs = llm.generate(["Explain AI"], sampling)

Performance benchmarks

Memory reduction

Model	FP16	AWQ 4-bit	Reduction
Mistral 7B	14 GB	5.5 GB	2.5x
Llama 2-13B	26 GB	10 GB	2.6x
Llama 2-70B	140 GB	35 GB	4x

Inference speed (RTX 4090)

Model	Prefill (tok/s)	Decode (tok/s)	Memory
Mistral 7B GEMM	3,897	114	5.55 GB
TinyLlama 1B GEMV	5,179	431	2.10 GB
Llama 2-13B GEMM	2,279	74	10.28 GB

Accuracy (perplexity)

Model	FP16	AWQ 4-bit	Degradation
Llama 3 8B	8.20	8.48	+3.4%
Mistral 7B	5.25	5.42	+3.2%
Qwen2 72B	4.85	4.95	+2.1%

Custom calibration data

# Use custom dataset for domain-specific models
model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data="wikitext",       # Or custom list of strings
    max_calib_samples=256,       # More samples = better accuracy
    max_calib_seq_len=512        # Sequence length
)

# Or provide your own samples
calib_samples = [
    "Your domain-specific text here...",
    "More examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_samples)

Multi-GPU deployment

model = AutoAWQForCausalLM.from_quantized(
    "TheBloke/Llama-2-70B-AWQ",
    device_map="auto",  # Auto-split across GPUs
    max_memory={0: "40GB", 1: "40GB"}
)

Supported models

35+ architectures including:

Llama family: Llama 2/3, Code Llama, Mistral, Mixtral
Qwen: Qwen, Qwen2, Qwen2.5-VL
Others: Falcon, MPT, Phi, Yi, DeepSeek, Gemma
Multimodal: LLaVA, LLaVA-Next, Qwen2-VL

Common issues

CUDA OOM during quantization:

# Reduce batch size
model.quantize(tokenizer, quant_config=quant_config, max_calib_samples=64)

Slow inference:

# Enable fused layers
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)

AMD GPU support:

# Use ExLlama backend
config = AwqConfig(bits=4, version="exllama")

Deprecation notice

AutoAWQ is officially deprecated. For new projects, consider:

vLLM llm-compressor: https://github.com/vllm-project/llm-compressor
MLX-LM: For Mac devices with Apple Silicon

Existing quantized models remain usable.

References

Paper: AWQ: Activation-aware Weight Quantization (arXiv:2306.00978) - MLSys 2024 Best Paper
GitHub: https://github.com/casper-hansen/AutoAWQ
MIT Han Lab: https://github.com/mit-han-lab/llm-awq
Models: https://huggingface.co/models?library=awq