with one click
with one click
| name | llama-cpp |
| description | llama.cpp local GGUF inference + HF Hub model discovery. |
| version | 2.1.2 |
| author | Orchestra Research |
| license | MIT |
| dependencies | ["llama-cpp-python>=0.2.0"] |
| platforms | ["linux","macos","windows"] |
| metadata | {"hermes":{"tags":["llama.cpp","GGUF","Quantization","Hugging Face Hub","CPU Inference","Apple Silicon","Edge Deployment","AMD GPUs","Intel GPUs","NVIDIA","URL-first"]}} |
Use this skill for local GGUF inference, quant selection, or Hugging Face repo discovery for llama.cpp.
llama-server or llama-cli command from the Hub.gguf files and sizes for a repoPrefer URL workflows before asking for hf, Python, or custom scripts.
https://huggingface.co/models?apps=llama.cpp&sort=trendingsearch=<term> for a model familynum_parameters=min:0,max:24B or similar when the user has size constraintshttps://huggingface.co/<repo>?local-app=llama.cppllama-server or llama-cli command?local-app=llama.cpp URL as page text or HTML and extract the section under Hardware compatibility:
UD-Q4_K_M or IQ4_NL_XLhttps://huggingface.co/api/models/<repo>/tree/main?recursive=truetype is file and path ends with .ggufpath and size as the source of truth for filenames and byte sizesmmproj-*.gguf projector files and BF16/ shard fileshttps://huggingface.co/<repo>/tree/main only as a human fallbackllama-server -hf <repo>:<QUANT>llama-server --hf-repo <repo> --hf-file <filename.gguf># macOS / Linux (simplest)
brew install llama.cpp
winget install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
Use this when the tree API shows custom file naming or the exact HF snippet is missing.
llama-server \
--hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
--hf-file Phi-3-mini-4k-instruct-q4.gguf \
-c 4096
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a limerick about Python exceptions"}
]
}'
pip install llama-cpp-python (CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir; Metal: CMAKE_ARGS="-DGGML_METAL=on" ...).
from llama_cpp import Llama
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35, # 0 for CPU, 99 to offload everything
n_threads=8,
)
out = llm("What is machine learning?", max_tokens=256, temperature=0.7)
print(out["choices"][0]["text"])
llm = Llama(
model_path="./model-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=35,
chat_format="llama-3", # or "chatml", "mistral", etc.
)
resp = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"},
],
max_tokens=256,
)
print(resp["choices"][0]["message"]["content"])
# Streaming
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
print(chunk["choices"][0]["text"], end="", flush=True)
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
vec = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(vec)}")
You can also load a GGUF straight from the Hub:
llm = Llama.from_pretrained(
repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
filename="*Q4_K_M.gguf",
n_gpu_layers=35,
)
Use the Hub page first, generic heuristics second.
Q4_K_M.Q5_K_M or Q6_K if memory allows.Q3_K_M, IQ variants, or Q2 variants only if the user explicitly prioritizes fit over quality.mmproj-*.gguf separately. The projector is not the main model file.UD-Q4_K_M, report UD-Q4_K_M.When the user asks what GGUFs exist, return:
Ignore unless requested:
Use the tree API for this step:
https://huggingface.co/api/models/<repo>/tree/main?recursive=trueFor a repo like unsloth/Qwen3.6-35B-A3B-GGUF, the local-app page can show quant chips such as UD-Q4_K_M, UD-Q5_K_M, UD-Q6_K, and Q8_0, while the tree API exposes exact file paths such as Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and Qwen3.6-35B-A3B-Q8_0.gguf with byte sizes. Use the tree API to turn a quant label into an exact filename.
Use these URL shapes directly:
https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
https://huggingface.co/<repo>?local-app=llama.cpp
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
https://huggingface.co/<repo>/tree/main
When answering discovery requests, prefer a compact structured result like:
Repo: <repo>
Recommended quant from HF: <label> (<size>)
llama-server: <command>
Other GGUFs:
- <filename> - <size>
- <filename> - <size>
Source URLs:
- <local-app URL>
- <tree API URL>
Decomposition playbook + anti-temptation rules for an orchestrator profile routing work through Kanban. The "don't do the work yourself" rule and the basic lifecycle are auto-injected into every kanban worker's system prompt; this skill is the deeper playbook when you're specifically playing the orchestrator role.
Gmail, Calendar, Drive, Docs, Sheets via gws CLI or Python.
Configure and use Honcho memory with Hermes -- cross-session user modeling, multi-profile peer isolation, observation config, dialectic reasoning, session summaries, and context budget enforcement. Use when setting up Honcho, troubleshooting memory, managing profiles with Honcho peers, or tuning observation, recall, and dialectic settings.
Migrate a user's OpenClaw customization footprint into Hermes Agent. Imports Hermes-compatible memories, SOUL.md, command allowlists, user skills, and selected workspace assets from ~/.openclaw, then reports exactly what could not be migrated and why.
Configure, extend, or contribute to Hermes Agent.
Operate the Antigravity CLI (agy): plugins, auth, sandbox.