| name | hqq-quantization |
| description | Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers. |
| version | 1.0.0 |
| author | Orchestra Research |
| license | MIT |
| tags | ["Quantization","HQQ","Optimization","Memory Efficiency","Inference","Model Compression"] |
| dependencies | ["hqq>=0.2.0","torch>=2.0.0"] |
HQQ - Half-Quadratic Quantization
Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.
When to use HQQ
Use HQQ when:
- Quantizing models without calibration data (no dataset needed)
- Need fast quantization (minutes vs hours for GPTQ/AWQ)
- Deploying with vLLM or HuggingFace Transformers
- Fine-tuning quantized models with LoRA/PEFT
- Experimenting with extreme quantization (2-bit, 1-bit)
Key advantages:
- No calibration: Quantize any model instantly without sample data
- Multiple backends: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference
- Flexible precision: 8/4/3/2/1-bit with configurable group sizes
- Framework integration: Native HuggingFace and vLLM support
- PEFT compatible: Fine-tune quantized models with LoRA
Use alternatives instead:
- AWQ: Need calibration-based accuracy, production serving
- GPTQ: Maximum accuracy with calibration data available
- bitsandbytes: Simple 8-bit/4-bit without custom backends
- llama.cpp/GGUF: CPU inference, Apple Silicon deployment
Quick start
Installation
pip install hqq
pip install hqq[torch]
pip install hqq[torchao]
pip install hqq[bitblas]
pip install hqq[marlin]
Basic quantization
from hqq.core.quantize import BaseQuantizeConfig, HQQLinear
import torch.nn as nn
config = BaseQuantizeConfig(
nbits=4,
group_size=64,
axis=1
)
linear = nn.Linear(4096, 4096)
hqq_linear = HQQLinear(linear, config)
output = hqq_linear(input_tensor)
Quantize full model with HuggingFace
from transformers import AutoModelForCausalLM, HqqConfig
quantization_config = HqqConfig(
nbits=4,
group_size=64,
axis=1
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quantization_config,
device_map="auto"
)
Core concepts
Quantization configuration
HQQ uses BaseQuantizeConfig to define quantization parameters:
from hqq.core.quantize import BaseQuantizeConfig
config_4bit = BaseQuantizeConfig(
nbits=4,
group_size=64,
axis=1
)
config_2bit = BaseQuantizeConfig(
nbits=2,
group_size=16,
axis=1
)
layer_configs = {
"self_attn.q_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.k_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"self_attn.v_proj": BaseQuantizeConfig(nbits=4, group_size=64),
"mlp.gate_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.up_proj": BaseQuantizeConfig(nbits=2, group_size=32),
"mlp.down_proj": BaseQuantizeConfig(nbits=4, group_size=64),
}
HQQLinear layer
The core quantized layer that replaces nn.Linear:
from hqq.core.quantize import HQQLinear
import torch
linear = torch.nn.Linear(4096, 4096)
hqq_layer = HQQLinear(linear, config)
W_q = hqq_layer.W_q
scale = hqq_layer.scale
zero = hqq_layer.zero
W_dequant = hqq_layer.dequantize()
Backends
HQQ supports multiple inference backends for different hardware:
from hqq.core.quantize import HQQLinear
backends = [
"pytorch",
"pytorch_compile",
"aten",
"torchao_int4",
"gemlite",
"bitblas",
"marlin",
]
HQQLinear.set_backend("torchao_int4")
hqq_layer.set_backend("marlin")
Backend selection guide:
| Backend | Best For | Requirements |
|---|
| pytorch | Compatibility | Any GPU |
| pytorch_compile | Moderate speedup | torch>=2.0 |
| aten | Good balance | CUDA GPU |
| torchao_int4 | 4-bit inference | torchao installed |
| marlin | Maximum 4-bit speed | Ampere+ GPU |
| bitblas | Flexible bit-widths | bitblas installed |
HuggingFace integration
Load pre-quantized models
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
inputs = tokenizer("Hello, world!", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
Quantize and save
from transformers import AutoModelForCausalLM, HqqConfig
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
model.save_pretrained("./llama-8b-hqq-4bit")
model.push_to_hub("my-org/Llama-3.1-8B-HQQ-4bit")
Mixed precision quantization
from transformers import AutoModelForCausalLM, HqqConfig
config = HqqConfig(
nbits=4,
group_size=64,
dynamic_config={
"attn": {"nbits": 4, "group_size": 64},
"mlp": {"nbits": 2, "group_size": 32}
}
)
vLLM integration
Serve HQQ models with vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
quantization="hqq",
dtype="float16"
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["What is machine learning?"], sampling_params)
vLLM with custom HQQ config
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B",
quantization="hqq",
quantization_config={
"nbits": 4,
"group_size": 64
}
)
PEFT/LoRA fine-tuning
Fine-tune quantized models
from transformers import AutoModelForCausalLM, HqqConfig
from peft import LoraConfig, get_peft_model
quant_config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quant_config,
device_map="auto"
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
QLoRA-style training
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./hqq-lora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
fp16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator
)
trainer.train()
Quantization workflows
Workflow 1: Quick model compression
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
model.save_pretrained("./llama-8b-hqq")
tokenizer.save_pretrained("./llama-8b-hqq")
Workflow 2: Optimize for inference speed
from hqq.core.quantize import HQQLinear
from transformers import AutoModelForCausalLM, HqqConfig
config = HqqConfig(nbits=4, group_size=64)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="auto"
)
HQQLinear.set_backend("marlin")
import torch
model = torch.compile(model)
import time
inputs = tokenizer("Hello", return_tensors="pt").to(model.device)
start = time.time()
for _ in range(10):
model.generate(**inputs, max_new_tokens=100)
print(f"Avg time: {(time.time() - start) / 10:.2f}s")
Best practices
- Start with 4-bit: Best quality/size tradeoff for most models
- Use group_size=64: Good balance; smaller for extreme quantization
- Choose backend wisely: Marlin for 4-bit Ampere+, TorchAO for flexibility
- Verify quality: Always test generation quality after quantization
- Mixed precision: Keep attention at higher precision, compress MLP more
- PEFT training: Use LoRA r=16-32 for good fine-tuning results
Common issues
Out of memory during quantization:
from hqq.models.hf.base import AutoHQQHFModel
model = AutoHQQHFModel.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="sequential"
)
Slow inference:
from hqq.core.quantize import HQQLinear
HQQLinear.set_backend("marlin")
model = torch.compile(model, mode="reduce-overhead")
Poor quality at 2-bit:
config = BaseQuantizeConfig(
nbits=2,
group_size=16,
axis=1
)
References
Resources