with one click
kailash-align
// Kailash Align — MANDATORY for LLM fine-tuning/alignment/serving. 12 methods (DPO, RLHF, LoRA, GRPO, SFT), GGUF, Ollama/vLLM. Raw TRL/PEFT BLOCKED.
// Kailash Align — MANDATORY for LLM fine-tuning/alignment/serving. 12 methods (DPO, RLHF, LoRA, GRPO, SFT), GGUF, Ollama/vLLM. Raw TRL/PEFT BLOCKED.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | kailash-align |
| description | Kailash Align — MANDATORY for LLM fine-tuning/alignment/serving. Raw TRL/PEFT BLOCKED. |
LLM fine-tuning and alignment framework built on TRL (Transformer Reinforcement Learning). 12 supported methods, LoRA adapter management, evaluation-before-serving, and deployment to Ollama/vLLM.
Python-only for v1 — GPU required for training.
pip install kailash-align # Core (torch, transformers, trl>=1.0, peft)
pip install kailash-align[rlhf] # + QLoRA (bitsandbytes)
pip install kailash-align[eval] # + benchmarks (lm-eval)
pip install kailash-align[serve] # + GGUF/Ollama (llama-cpp-python, gguf)
pip install kailash-align[online] # + fast generation (vllm, CUDA only)
pip install kailash-align[full] # Everything
| Category | Methods | Data Format | Reward Needed |
|---|---|---|---|
| offline | sft, dpo, cpo | text / prompt+chosen+rejected | No |
| unpaired | kto, bco | prompt+completion+label | No |
| monolithic | orpo | prompt+chosen+rejected | No |
| online | grpo, rloo, ppo, online_dpo, xpo, nash_md | prompt only | Yes (except online_dpo) |
Special combo: sft_then_dpo — two-stage SFT then DPO with adapter chaining.
from kailash_align import AlignmentConfig, AlignmentPipeline
config = AlignmentConfig(
method="dpo",
base_model_id="meta-llama/Llama-3.1-8B",
)
pipeline = AlignmentPipeline(config=config)
result = await pipeline.train(dataset=preference_dataset, adapter_name="my-dpo-adapter")
# result.adapter_id, result.metrics, result.training_time
AlignmentConfig
│
▼
AlignmentPipeline.train()
│
▼
AlignmentEvaluator.evaluate() ← MANDATORY before serving
│
▼
AlignmentServing.deploy() ← Ollama / vLLM / GGUF export
│
▼
KaizenModelBridge.load() ← Connect to Kaizen agents
Eval-before-serve is mandatory. No model reaches production without passing evaluation against the base model.
| Engine | Purpose |
|---|---|
| AlignmentPipeline | Training orchestration via MethodRegistry |
| AdapterRegistry | LoRA adapter versioning + stage transitions |
| AlignmentEvaluator | lm-eval-harness benchmarking |
| AlignmentServing | GGUF export + Ollama + vLLM deployment |
| KaizenModelBridge | Connect fine-tuned models to Kaizen Delegate |
| OnPremModelCache | Air-gapped model preparation |
| Scenario | Recommended Method | Why |
|---|---|---|
| First fine-tune, have instruction data | sft | Simplest, most stable |
| Have preference pairs (chosen/rejected) | dpo | No reward model needed, stable |
| Want reasoning/math improvement | grpo | Group-relative optimization |
| Limited preference data, have labels | kto | Works with binary labels, not pairs |
| Want SFT + preference in one run | orpo | Combined objective, single training |
| Need maximum control, have reward model | ppo | Classic RLHF, most flexible |
| Two-stage refinement | sft_then_dpo | SFT baseline then preference alignment |
AlignmentConfig --> AlignmentPipeline --> MethodRegistry --> TRL Trainer
│
_lazy_import()
│
SFTTrainer / DPOTrainer / GRPOTrainer / ...
All TRL trainers are lazy-imported via _lazy_import() — no heavy torch imports at module level.
trust_remote_code=False on all model/tokenizer loadingmath.isfinite()shell=True)max_adapters=10,000, max_versions_per_adapter=1,000All config classes are @dataclass(frozen=True) with __post_init__ validation:
from kailash_align import AlignmentConfig
from kailash_align.configs import GRPOConfig
config = AlignmentConfig(
method="grpo",
base_model_id="meta-llama/Llama-3.1-8B",
grpo=GRPOConfig(num_generations=4, kl_coef=0.001),
reward_funcs=["accuracy"],
# bf16 and fp16 are mutually exclusive (validated in __post_init__)
bf16=True,
)
Set AlignmentConfig.loss_type to use DPO variants without new trainer code:
ipo, simpo, robust, bco_pair, sppo_hard, aot, aot_pair, nca_pair
config = AlignmentConfig(
method="dpo",
base_model_id="meta-llama/Llama-3.1-8B",
loss_type="simpo", # SimPO variant of DPO
)
trust_remote_code=False always — no arbitrary code execution from model reposmath.isfinite()