name

qwen-mtp-gguf

description

Complete agent-ready workflow for Qwen-family MTP or nextn GGUF conversion and release. Use when Codex or another coding agent needs to inspect a user's machine, estimate disk/RAM requirements from Hugging Face model sizes and requested quant formats, bootstrap llama.cpp and Python dependencies, extract MTP heads from a matching official/base Qwen model, merge them into a fine-tuned or target safetensors model, run local HF/GGUF smoke tests with Qwen chat formatting, quantize to GGUF, and optionally upload or resume Hugging Face releases.

Qwen MTP GGUF

Operating Principle

Run this as a staged release pipeline, not a blind conversion:

Identify the target model and matching MTP source model.
Preflight the environment, disk, RAM, tools, token access, and config compatibility.
Ask the user to choose upload strategy when it affects disk use.
Prepare the target HF directory and inject only missing MTP/nextn tensors.
Convert to a temporary F16 GGUF, smoke-test locally, then quantize.
Upload with resume support, or keep local artifacts if the user chooses local-only.

MTP/GGUF conversion and llama.cpp quantization do not require a GPU. GPU acceleration can make later inference tests faster, but the default smoke test should use CPU mode (-ngl 0) so it works on common machines.

Required User Inputs

Ask for missing items only when they cannot be inferred safely:

Target model: HF repo ID or local HF model directory.
Matching MTP source model: the official/base Qwen-family repo with the same architecture, size, tokenizer, and MTP layout.
Output mode: local-only, stream upload, or batch upload.
Output repo ID when uploading.
Quantization formats. Default full matrix: q2_k,q3_k_s,q3_k_m,q3_k_l,iq4_xs,q4_k_s,q4_k_m,q5_k_s,q5_k_m,q6_k,q8_0,bf16.

If the target and MTP source configs disagree on core architecture fields, stop and ask before continuing.

Upload Strategy Choice

When the user has not specified a strategy, explain the tradeoff briefly and ask:

stream: quantize one GGUF, upload it, then delete it. Lowest peak disk use and best default for large models.
batch: quantize everything first, then upload. Useful when network is unstable but requires much more disk.
local-only: prepare all GGUF files locally without uploading.

Use stream when the user asks for a full large-model release and does not care about keeping local copies.

Environment Bootstrap

Use scripts/bootstrap_qwen_mtp_env.sh when llama.cpp or Python dependencies are missing.

bash scripts/bootstrap_qwen_mtp_env.sh --prefix ./qwen-mtp-env --backend cpu
source ./qwen-mtp-env/.venv/bin/activate

Backend options are cpu, cuda, metal, and vulkan. Prefer cpu unless the user explicitly wants accelerated smoke tests or already has a configured GPU toolchain.

Preflight

Always run preflight before downloading large model weights:

python3 scripts/qwen_mtp_gguf_pipeline.py \
  --source-repo owner/target-qwen-finetune \
  --mtp-source-repo owner/matching-base-qwen-with-mtp \
  --output-repo owner/target-qwen-mtp-gguf \
  --work-root ./mtp-gguf-work \
  --llama-cpp ./qwen-mtp-env/llama.cpp \
  --token-env HF_TOKEN \
  --upload-strategy stream \
  --preflight-only

Review preflight_report.md before running. It reports:

OS, CPU cores, total RAM, detected GPU hint, Python version.
Required commands and packages.
llama.cpp converter, quantizer, and llama-cli status.
Target model size from HF file metadata or local files.
Minimal MTP shard download size.
Estimated F16/BF16 and quantized GGUF sizes.
Required free disk with a safety factor.
Recommended RAM and config mismatch warnings.

Full Pipeline

python3 scripts/qwen_mtp_gguf_pipeline.py \
  --source-repo owner/target-qwen-finetune \
  --mtp-source-repo owner/matching-base-qwen-with-mtp \
  --output-repo owner/target-qwen-mtp-gguf \
  --work-root ./mtp-gguf-work \
  --llama-cpp ./qwen-mtp-env/llama.cpp \
  --filename-prefix target-qwen-MTP \
  --token-env HF_TOKEN \
  --upload-strategy stream \
  --private \
  --smoke-test-before-upload \
  --cleanup-after-upload

The pipeline:

Downloads or copies the target HF model.
Detects existing MTP/nextn tensors and validates index mappings.
Downloads only the MTP source shards listed in the source index.
Saves mtp_heads.safetensors, updates model.safetensors.index.json, and validates all new keys.
Converts F16 for quantization and BF16 directly from the prepared HF directory.
Runs GGUF smoke tests before upload when enabled. Prefer the model's embedded chat template when llama.cpp supports it; use ChatML only as a minimal fallback for load/generation sanity checks.
Lists remote HF files before upload and skips completed GGUFs on resume.
Deletes local GGUF files only after confirmed upload when cleanup is enabled.

Smoke Tests

Use the GGUF smoke test for release validation:

python3 scripts/qwen_gguf_smoke_test.py \
  --model ./mtp-gguf-work/target-qwen-MTP-GGUF/target-qwen-MTP-Q8_0.gguf \
  --llama-cli ./qwen-mtp-env/llama.cpp/build/bin/llama-cli \
  --prompt "State the capital of France in one short sentence." \
  --gpu-layers 0

Use the HF smoke test only when the machine can load the HF model:

python3 scripts/qwen_hf_smoke_test.py \
  --model ./mtp-gguf-work/target-qwen-MTP-HF \
  --prompt "Write one concise sentence about MTP inference."

For Qwen-family chat formatting, use apply_chat_template(..., add_generation_prompt=True, tokenize=False) on the HF side. For GGUF, prefer the chat template embedded by llama.cpp conversion or copied from tokenizer_config.json; use raw ChatML only as a fallback smoke test, not as the quality/reasoning benchmark template.

Agent Compatibility

This skill is Codex-native, but the workflow is intentionally agent-agnostic:

Claude Code, Codex, OpenCode, Qwen Code, Hermes, and similar agents can run the same scripts from a shell.
The agent should call --preflight-only, summarize blockers, ask for the upload strategy if needed, then run the full command.
Keep user-specific tokens, paths, private repo names, and logs out of public PRs and model cards.

References

Read references/environment-and-sizing.md before changing preflight logic or resource thresholds.
Read references/technical-flow.md before changing extraction, injection, conversion, or upload behavior.
Read references/agent-integration.md when packaging this for another agent framework.
Read references/troubleshooting.md when conversion, tensor lookup, disk, RAM, or upload steps fail.