| name | model-merging |
| description | Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies. |
| version | 1.0.0 |
| author | Orchestra Research |
| license | MIT |
| tags | ["Emerging Techniques","Model Merging","Mergekit","SLERP","TIES","DARE","Task Arithmetic","Model Fusion","No Retraining","Multi-Capability","Arcee AI"] |
| dependencies | ["mergekit","transformers","torch"] |
Model Merging: Combining Pre-trained Models
When to Use This Skill
Use Model Merging when you need to:
- Combine capabilities from multiple fine-tuned models without retraining
- Create specialized models by blending domain-specific expertise (math + coding + chat)
- Improve performance beyond single models (often +5-10% on benchmarks)
- Reduce training costs - no GPUs needed, merges run on CPU
- Experiment rapidly - create new model variants in minutes, not days
- Preserve multiple skills - merge without catastrophic forgetting
Success Stories: Marcoro14-7B-slerp (best on Open LLM Leaderboard 02/2024), many top HuggingFace models use merging
Tools: mergekit (Arcee AI), LazyMergekit, Model Soup
Installation
git clone https://github.com/arcee-ai/mergekit.git
cd mergekit
pip install -e .
pip install mergekit
pip install transformers torch
Quick Start
Simple Linear Merge
merge_method: linear
models:
- model: mistralai/Mistral-7B-v0.1
parameters:
weight: 0.5
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
weight: 0.5
dtype: bfloat16
mergekit-yaml config.yml ./merged-model --cuda
python -m transformers.models.auto --model_name_or_path ./merged-model
SLERP Merge (Best for 2 Models)
merge_method: slerp
slices:
- sources:
- model: mistralai/Mistral-7B-v0.1
layer_range: [0, 32]
- model: teknium/OpenHermes-2.5-Mistral-7B
layer_range: [0, 32]
parameters:
t: 0.5
dtype: bfloat16
Core Concepts
1. Merge Methods
Linear (Model Soup)
- Simple weighted average of parameters
- Fast, works well for similar models
- Can merge 2+ models (
w1 + w2 + ... = 1)
SLERP (Spherical Linear Interpolation)
- Interpolates along sphere in weight space
- Preserves magnitude of weight vectors
- Best for merging 2 models
- Smoother than linear
merged = (sin((1-t)*θ) / sin(θ)) * model1 + (sin(t*θ) / sin(θ)) * model2
Task Arithmetic
- Extract "task vectors" (fine-tuned - base)
- Combine task vectors, add to base
- Good for merging multiple specialized models (
merged = base + α₁·tv₁ + α₂·tv₂)
TIES-Merging
- Task arithmetic + sparsification
- Resolves sign conflicts in parameters
- Best for merging many task-specific models
DARE (Drop And REscale)
- Randomly drops fine-tuned parameters
- Rescales remaining parameters
- Reduces redundancy, maintains performance
2. Configuration Structure
merge_method: <method>
base_model: <path>
models:
- model: <path/to/model1>
parameters:
weight: <float>
density: <float>
- model: <path/to/model2>
parameters:
weight: <float>
parameters:
dtype: <dtype>
slices:
tokenizer:
Merge Methods Guide
Linear Merge
Best for: Simple model combinations, equal weighting
merge_method: linear
models:
- model: WizardLM/WizardMath-7B-V1.1
parameters:
weight: 0.4
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
weight: 0.3
- model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
parameters:
weight: 0.3
dtype: bfloat16
SLERP Merge
Best for: Two models, smooth interpolation
merge_method: slerp
slices:
- sources:
- model: mistralai/Mistral-7B-v0.1
layer_range: [0, 32]
- model: teknium/OpenHermes-2.5-Mistral-7B
layer_range: [0, 32]
parameters:
t: 0.5
dtype: bfloat16
Layer-specific SLERP:
merge_method: slerp
slices:
- sources:
- model: model_a
layer_range: [0, 32]
- model: model_b
layer_range: [0, 32]
parameters:
t:
- filter: self_attn
value: 0.3
- filter: mlp
value: 0.7
- value: 0.5
dtype: bfloat16
Task Arithmetic
Best for: Combining specialized skills
merge_method: task_arithmetic
base_model: mistralai/Mistral-7B-v0.1
models:
- model: WizardLM/WizardMath-7B-V1.1
parameters:
weight: 0.5
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
weight: 0.3
- model: ajibawa-2023/Code-Mistral-7B
parameters:
weight: 0.2
dtype: bfloat16
TIES-Merging
Best for: Many models, resolving conflicts
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
models:
- model: WizardLM/WizardMath-7B-V1.1
parameters:
density: 0.5
weight: 1.0
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
density: 0.5
weight: 1.0
- model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
parameters:
density: 0.5
weight: 1.0
parameters:
normalize: true
dtype: bfloat16
DARE Merge
Best for: Reducing redundancy
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
models:
- model: WizardLM/WizardMath-7B-V1.1
parameters:
density: 0.5
weight: 0.6
- model: teknium/OpenHermes-2.5-Mistral-7B
parameters:
density: 0.5
weight: 0.4
parameters:
int8_mask: true
dtype: bfloat16
Advanced Patterns
Layer-wise Merging
merge_method: passthrough
slices:
- sources:
- model: mistralai/Mistral-7B-v0.1
layer_range: [0, 16]
- sources:
- model: teknium/OpenHermes-2.5-Mistral-7B
layer_range: [16, 32]
dtype: bfloat16
MoE from Merged Models
merge_method: moe
base_model: mistralai/Mistral-7B-v0.1
experts:
- source_model: WizardLM/WizardMath-7B-V1.1
positive_prompts:
- "math"
- "calculate"
- source_model: teknium/OpenHermes-2.5-Mistral-7B
positive_prompts:
- "chat"
- "conversation"
- source_model: ajibawa-2023/Code-Mistral-7B
positive_prompts:
- "code"
- "python"
dtype: bfloat16
Tokenizer Merging
merge_method: linear
models:
- model: mistralai/Mistral-7B-v0.1
- model: custom/specialized-model
tokenizer:
source: "union"
tokens:
<|special_token|>:
source: "custom/specialized-model"
Best Practices
1. Model Compatibility
models = [
"mistralai/Mistral-7B-v0.1",
"teknium/OpenHermes-2.5-Mistral-7B",
]
models = [
"meta-llama/Llama-2-7b-hf",
"mistralai/Mistral-7B-v0.1",
]
2. Weight Selection
models:
- model: model_a
parameters:
weight: 0.6
- model: model_b
parameters:
weight: 0.4
models:
- model: model_a
parameters:
weight: 0.8
- model: model_b
parameters:
weight: 0.8
Unsupervised Coefficient Tuning (no labeled data needed)
Instead of manual search, use generation consistency: merge with several candidate coefficients, generate responses on a small unlabeled subset, and pick the coefficient whose outputs are most similar to those of its neighbors. Consistent outputs signal a stable, well-performing merge region (AdaMMS, arXiv:2503.23733).
candidates = [0.3, 0.4, 0.5, 0.6, 0.7]
for alpha in candidates:
merged_paths[alpha] = merge_with_coefficient(alpha, model_a, model_b)
responses[alpha] = generate_responses(merged_paths[alpha], eval_prompts)
best_alpha = max(candidates, key=lambda a: generation_consistency(a, responses))
See references/coefficient-tuning.md for the full algorithm, similarity metrics, multi-coefficient search, and end-to-end pipeline.
3. Method Selection
merge_method = "slerp"
merge_method = "linear"
merge_method = "ties"
merge_method = "dare_ties"
4. Density Tuning (TIES/DARE)
parameters:
density: 0.8
parameters:
density: 0.5
parameters:
density: 0.9
5. Layer-specific Merging
Preserve the base model's first/last layers (often best left untouched) and merge only the middle via merge_method: passthrough with slices — see the Layer-wise Merging pattern above.
Evaluation & Testing
Benchmark Merged Models
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("./merged-model")
test_prompts = {
"math": "Calculate: 25 * 17 =",
"code": "Write a Python function to reverse a string:",
"chat": "What is the capital of France?",
}
for task, prompt in test_prompts.items():
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(f"{task}: {tokenizer.decode(outputs[0])}")
Common Benchmarks
- Open LLM Leaderboard: General capabilities
- MT-Bench: Multi-turn conversation
- MMLU: Multitask accuracy
- HumanEval: Code generation
- GSM8K: Math reasoning
Production Deployment
Save and Upload
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("./merged-model")
model.push_to_hub("username/my-merged-model")
tokenizer.push_to_hub("username/my-merged-model")
Quantize Merged Model
python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf
python quantize_gptq.py ./merged-model --bits 4 --group_size 128
Common Pitfalls
- Mismatched architectures — only merge models that share the same architecture (e.g., don't mix Llama and Mistral).
- Over-weighting one model (e.g.,
0.95 / 0.05) — keep weights balanced, typically in the 0.3–0.7 range.
- Skipping evaluation — always benchmark a merged model before deploying (see the Evaluation & Testing section above).
Resources
See Also
references/methods.md - Deep dive into merge algorithms
references/examples.md - Real-world merge configurations
references/evaluation.md - Benchmarking and testing strategies
references/coefficient-tuning.md - Unsupervised coefficient search via generation consistency (AdaMMS, arXiv:2503.23733)