| name | llama-cpp |
| description | Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization. |
| version | 2.0.0 |
| author | Orchestra Research |
| license | MIT |
| dependencies | ["llama-cpp-python>=0.2.0"] |
| metadata | {"hermes":{"tags":["llama.cpp","GGUF","Quantization","CPU Inference","Apple Silicon","Edge Deployment","Non-NVIDIA","AMD GPUs","Intel GPUs","Embedded","Model Compression"]}} |
llama.cpp + GGUF
Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
When to use
Use llama.cpp + GGUF when:
- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
- Need flexible quantization (2–8 bit with K-quants)
- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
- Want a single binary deploy without Docker/Python
Key advantages:
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
- No Python runtime required (pure C/C++)
- K-quants + imatrix for better low-bit quality
- OpenAI-compatible server built in
- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
Use alternatives instead:
- vLLM — NVIDIA GPUs, PagedAttention, Python-first, max throughput
- TensorRT-LLM — Production NVIDIA (A100/H100), maximum speed
- AWQ/GPTQ — Calibrated quantization for NVIDIA-only deployments
- bitsandbytes — Simple HuggingFace transformers integration
- HQQ — Fast calibration-free quantization
Quick start
Install
brew install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make
make GGML_METAL=1
make GGML_CUDA=1
make LLAMA_HIP=1
pip install llama-cpp-python
Download a pre-quantized GGUF
huggingface-cli download \
TheBloke/Llama-2-7B-Chat-GGUF \
llama-2-7b-chat.Q4_K_M.gguf \
--local-dir models/
Or convert a HuggingFace model to GGUF
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
python convert_hf_to_gguf.py ./llama-3.1-8b \
--outfile llama-3.1-8b-f16.gguf \
--outtype f16
./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
Run inference
./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256
./llama-cli -m model.Q4_K_M.gguf --interactive
./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
Serve an OpenAI-compatible API
./llama-server \
-m model.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \
-c 4096 \
--parallel 4 \
--cont-batching
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
Quantization formats (GGUF)
K-quant methods (recommended)
| Type | Bits | Size (7B) | Quality | Use Case |
|---|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
| Q4_K_M | 4.5 | ~4.1 GB | High | Recommended default |
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
Variant suffixes — _S (Small, faster, lower quality), _M (Medium, balanced), _L (Large, better quality).
Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist but always prefer K-quants for better quality/size ratio.
IQ quantization — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require --imatrix.
Task-specific defaults:
- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
- Code generation: Q5_K_M or Q6_K (higher precision helps)
- Technical / medical: Q6_K or Q8_0
- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
- Raspberry Pi / edge: Q2_K or Q3_K_S
Conversion workflows
Basic: HF → GGUF → quantized
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16
./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
With importance matrix (imatrix) — better low-bit quality
imatrix gives 10–20% perplexity improvement at Q4, essential at Q3 and below.
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
cat > calibration.txt << 'EOF'
The quick brown fox jumps over the lazy dog.
Machine learning is a subset of artificial intelligence.
EOF
./llama-imatrix -m model-f16.gguf \
-f calibration.txt \
--chunk 512 \
-o model.imatrix \
-ngl 35
./llama-quantize --imatrix model.imatrix \
model-f16.gguf model-q4_k_m.gguf Q4_K_M
Multi-quant batch
#!/bin/bash
MODEL="llama-3.1-8b-f16.gguf"
IMATRIX="llama-3.1-8b.imatrix"
./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
done
Quality testing (perplexity)
./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512
Python bindings (llama-cpp-python)
Basic generation
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
n_threads=8,
)
output = llm(
"What is machine learning?",
max_tokens=256,
temperature=0.7,
stop=["</s>", "\n\n"],
)
print(output["choices"][0]["text"])
Chat completion + streaming
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3",
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"},
],
max_tokens=256,
temperature=0.7,
)
print(response["choices"][0]["message"]["content"])
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
print(chunk["choices"][0]["text"], end="", flush=True)
Embeddings
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
vec = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(vec)}")
Hardware acceleration
Apple Silicon (Metal)
make clean && make GGML_METAL=1
./llama-cli -m model.gguf -ngl 99 -p "Hello"
llm = Llama(
model_path="model.gguf",
n_gpu_layers=99,
n_threads=1,
)
Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
NVIDIA (CUDA)
make clean && make GGML_CUDA=1
./llama-cli -m model.gguf -ngl 35 -p "Hello"
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20
./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
AMD (ROCm)
make LLAMA_HIP=1
./llama-cli -m model.gguf -ngl 999
CPU
./llama-cli -m model.gguf -t 8 -p "Hello"
make LLAMA_OPENBLAS=1
llm = Llama(
model_path="model.gguf",
n_gpu_layers=0,
n_threads=8,
n_batch=512,
)
Performance benchmarks
CPU (Llama 2-7B Q4_K_M)
| CPU | Threads | Speed |
|---|
| Apple M3 Max (Metal) | 16 | 50 tok/s |
| AMD Ryzen 9 7950X | 32 | 35 tok/s |
| Intel i9-13900K | 32 | 30 tok/s |
GPU offloading on RTX 4090
| Layers GPU | Speed | VRAM |
|---|
| 0 (CPU only) | 30 tok/s | 0 GB |
| 20 (hybrid) | 80 tok/s | 8 GB |
| 35 (all) | 120 tok/s | 12 GB |
Supported models
- LLaMA family: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
- Mistral family: Mistral 7B, Mixtral 8x7B/8x22B
- Other: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
Find GGUF models: https://huggingface.co/models?library=gguf
Ecosystem integrations
Ollama
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
TEMPLATE """{{ .System }}
{{ .Prompt }}"""
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF
ollama create mymodel -f Modelfile
ollama run mymodel "Hello!"
LM Studio
- Place GGUF file in
~/.cache/lm-studio/models/
- Open LM Studio and select the model
- Configure context length and GPU offload, start inference
text-generation-webui
cp model-q4_k_m.gguf text-generation-webui/models/
python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
OpenAI client → llama-server
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
print(response.choices[0].message.content)
Best practices
- Use K-quants — Q4_K_M is the recommended default
- Use imatrix for Q4 and below (calibration improves quality substantially)
- Offload as many layers as VRAM allows — start high, reduce by 5 on OOM
- Thread count — match physical cores, not logical
- Batch size — increase
n_batch (e.g. 512) for faster prompt processing
- Context — start at 4096, grow only as needed (memory scales with ctx)
- Flash Attention — add
--flash-attn if your build supports it
Common issues (quick fixes)
Model loads slowly — use --mmap for memory-mapped loading.
Out of memory (GPU) — reduce -ngl, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35)
Garbage output — wrong chat_format, temperature too high, or model file corrupted. Test with temperature=0.1 and verify FP16 baseline works.
Connection refused (server) — bind to --host 0.0.0.0, check lsof -i :8080.
See references/troubleshooting.md for the full playbook.
References
- advanced-usage.md — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
- quantization.md — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
- server.md — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
- optimization.md — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
- troubleshooting.md — install/convert/quantize/inference/server issues, Apple Silicon, debugging
Resources