| name | slime-user |
| description | Guide for using SLIME (LLM post-training framework for RL Scaling). Use when working with SLIME for reinforcement learning training of language models, including setup, configuration, training execution, multi-turn interactions, custom reward models, tool calling scenarios, or troubleshooting SLIME workflows. Covers GRPO, GSPO, PPO, Reinforce++, multi-agent RL, VLM training, FSDP/Megatron backends, SGLang integration, dynamic sampling, and custom generation functions. |
SLIME User Guide
SLIME is an LLM post-training framework for RL Scaling developed by THUDM. It supports various RL algorithms (GRPO, GSPO, PPO, Reinforce++), multiple training backends (Megatron, FSDP), and advanced features like multi-turn interactions, tool calling, and dynamic sampling.
Quick Start Workflow
For First-Time Users
-
Environment Setup
- Use Docker:
docker pull slimerl/slime:latest
- Or build from source: See
docs/en/get_started/quick_start.md
- Hardware: Supports H100/H200, B200 series
-
Download Model and Data
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
-
Convert Weights (Megatron backend only)
source scripts/models/qwen3-4B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen3-4B \
--save /root/Qwen3-4B_torch_dist
-
Run Training
bash scripts/run-qwen3-4B.sh
For Experienced Users
When user needs specific functionality:
- Multi-turn/tool calling: Read references/examples_reference.md Search-R1 section
- Custom reward models: See custom RM pattern in examples reference
- FSDP instead of Megatron: Use
--train-backend fsdp, skip weight conversion
- Large-scale training: See multi-node examples (GLM-4.5, DeepSeek-R1)
- Source code exploration: Check references/source_code_reference.md
Documentation Navigation
SLIME has extensive documentation. Use this guide to find what you need quickly.
Essential Documentation (Read These First)
- Quick Start Guide:
docs/en/get_started/quick_start.md - Setup and first training run
- Usage Guide:
docs/en/get_started/usage.md - Comprehensive parameter reference
- Example Docs:
docs/en/examples/qwen3-4B.md or docs/en/examples/glm4-9B.md
For detailed navigation of all documentation, see references/doc_navigation.md.
Common Tasks → Documentation Mapping
| Task | Documentation |
|---|
| First-time setup | docs/en/get_started/quick_start.md |
| Understanding parameters | docs/en/get_started/usage.md |
| Basic training (8 GPUs) | docs/en/examples/qwen3-4B.md |
| Multi-turn tool use | examples/search-r1/ |
| Custom generation logic | docs/en/get_started/customization.md |
| Multi-node training | docs/en/examples/glm4.5-355B-A32B.md |
| FSDP backend | docs/en/get_started/usage.md (FSDP section) |
| VLM training | examples/geo3k_vlm/ |
| Troubleshooting | docs/en/get_started/qa.md |
Core Concepts
Training Loop
SLIME uses a "Rollout → Train" loop:
- Rollout: Generate responses using SGLang inference
- Reward: Compute rewards using reward model
- Train: Update model weights using Megatron/FSDP
- Repeat for
--num-rollout iterations
Key Constraint
rollout-batch-size × n-samples-per-prompt = global-batch-size × num-steps-per-rollout
Resource Allocation Modes
Colocated (training and inference share GPUs):
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--colocate \
--sglang-mem-fraction-static 0.7
Disaggregated (separate GPUs for training/inference):
--actor-num-nodes 1 \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4
Parameter Quick Reference
Essential Parameters
Model Loading:
--hf-checkpoint: HuggingFace model path (for SGLang and FSDP)
--ref-load: Megatron reference model checkpoint
--load: Megatron actor checkpoint (resume training)
--save: Save path for checkpoints
Data:
--prompt-data: JSONL dataset path
--input-key: Field name for prompts (default: "prompt")
--label-key: Field name for labels (default: "label")
--metadata-key: Field name for metadata (default: "metadata")
--apply-chat-template: Apply tokenizer chat template
Rollout:
--rollout-batch-size: Prompts per rollout
--n-samples-per-prompt: Responses per prompt
--rollout-max-response-len: Max response length
--rollout-temperature: Sampling temperature
Training:
--num-rollout: Total training iterations
--num-steps-per-rollout: Optimizer steps per rollout (default: 1)
--global-batch-size: Samples per optimizer step
--advantage-estimator: RL algorithm (grpo, gspo, ppo, reinforce_plus_plus)
Reward Model:
--rm-type: Built-in RM type (e.g., "deepscaler")
--custom-rm-path: Custom RM function path
Backends:
--train-backend: Training backend (megatron or fsdp)
--rollout-num-gpus-per-engine: GPUs per SGLang engine (like tp_size)
For complete parameter reference, see docs/en/get_started/usage.md.
Common Workflows
1. Standard Single-Turn Training
Use example scripts as templates:
scripts/run-qwen3-4B.sh: Basic 8xH100 setup
scripts/run-glm4-9B.sh: With dynamic sampling
Key sections in script:
source scripts/models/qwen3-4B.sh
CKPT_ARGS=(--hf-checkpoint /root/Qwen3-4B ...)
ROLLOUT_ARGS=(
--rollout-batch-size 32
--n-samples-per-prompt 8
--rm-type deepscaler
)
GRPO_ARGS=(--advantage-estimator grpo ...)
ray job submit ... -- python3 train.py \
${MODEL_ARGS[@]} ${CKPT_ARGS[@]} ${ROLLOUT_ARGS[@]} ...
2. Multi-Turn Tool Calling
For multi-turn scenarios (like Search-R1):
-
Prepare Data with metadata:
{
"question": "User query",
"final_answer": "Expected answer",
"metadata": "{\"session_id\": \"123\", \"tool_code\": \"...\"}"
}
-
Implement Custom Generation Function:
async def generate(args, sample: Sample, sampling_params) -> Sample:
for turn in range(max_turns):
model_output = await call_sglang(...)
sample.loss_mask += [1] * len(model_tokens)
tool_output = await execute_tool(...)
sample.loss_mask += [0] * len(tool_tokens)
if action == "answer":
break
sample.tokens = prompt_tokens + response_tokens
sample.response_length = len(response_tokens)
return sample
-
Configure Custom Functions:
--custom-generate-function-path my_module.generate \
--custom-rm-path my_module.reward_func \
--metadata-key metadata
See examples/search-r1/ for complete example.
3. Dynamic Sampling (DAPO-style)
Filter low-quality samples during generation:
ROLLOUT_ARGS+=(
--over-sampling-batch-size 64 \
--rollout-batch-size 32 \
--dynamic-sampling-filter-path \
slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
)
How it works:
- Samples 64 prompts (over-sampling)
- Filters groups based on reward diversity
- Keeps only 32 prompts × 8 samples that pass filter
- Automatically resamples if too many filtered out
4. FSDP Backend (No Weight Conversion)
--train-backend fsdp \
--hf-checkpoint /root/Qwen3-4B \
--gradient-checkpointing \
--context-parallel-size 2
Benefits:
- No HF → Megatron weight conversion needed
- Directly load HuggingFace checkpoints
- Simpler setup for supported models
See examples/geo3k_vlm/ and docs/en/get_started/usage.md FSDP section.
5. Multi-Node Training
-
Start Ray cluster:
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
-
Submit job:
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"env_vars": {"PYTHONPATH": "/root/Megatron-LM/"}}' \
-- python3 train.py \
--actor-num-nodes 8 \
--actor-num-gpus-per-node 8 \
...
See docs/en/examples/glm4.5-355B-A32B.md for large-scale example.
Customization Guide
Custom Reward Model
Implement async function:
async def my_reward_func(args, sample: Sample, **kwargs) -> float:
prompt = sample.prompt
response = sample.response
label = sample.label
reward = compute_score(response, label)
return float(reward)
Use with: --custom-rm-path module.path:my_reward_func
Custom Generation Function
Implement async function:
async def my_generate(args, sample: Sample, sampling_params) -> Sample:
from slime.utils.processing_utils import load_tokenizer
tokenizer = load_tokenizer(args.hf_checkpoint, trust_remote_code=True)
from slime.utils.http_utils import post
output = await post(
f"http://{args.sglang_router_ip}:{args.sglang_router_port}/generate",
{"text": sample.prompt, "sampling_params": sampling_params}
)
prompt_tokens = tokenizer(sample.prompt, add_special_tokens=False)["input_ids"]
response_tokens = tokenizer(output["text"], add_special_tokens=False)["input_ids"]
sample.tokens = prompt_tokens + response_tokens
sample.response_length = len(response_tokens)
sample.response = output["text"]
sample.truncated = output["meta_info"]["finish_reason"]["type"] == "length"
return sample
Use with: --custom-generate-function-path module.path:my_generate
Custom Dynamic Filter
Implement filter function:
def my_filter(args, samples: list[Sample], **kwargs) -> bool:
return all(sample.reward > 0.5 for sample in samples)
Use with: --dynamic-sampling-filter-path module.path:my_filter
Examples Reference
For detailed examples and patterns, see references/examples_reference.md.
Quick finder:
- Basic math training:
scripts/run-qwen3-4B.sh
- Multi-turn tool use:
examples/search-r1/
- Vision-language RL:
examples/geo3k_vlm/
- Large-scale MOE:
docs/en/examples/glm4.5-355B-A32B.md
- Custom generation:
examples/search-r1/search_r1_logic.py
- FSDP backend:
examples/geo3k_vlm/
Source Code Reference
For source code exploration, see references/source_code_reference.md.
Key files:
- Arguments:
slime/utils/arguments.py
- Rollout:
slime/rollout/sglang_rollout.py
- Sample type:
slime/utils/types.py
- Reward models:
slime/rollout/rm_hub/
- Conversion tools:
tools/convert_hf_to_torch_dist.py
Troubleshooting
Common Issues
OOM during colocated training:
- Reduce
--sglang-mem-fraction-static (try 0.7 or 0.6)
- Reduce
--max-tokens-per-gpu
- Enable gradient checkpointing:
--recompute-granularity full
Mismatched batch sizes:
- Ensure:
rollout-batch-size × n-samples-per-prompt = global-batch-size × num-steps-per-rollout
Weight conversion errors:
- Check model config matches exactly (e.g.,
--rotary-base)
- Use FSDP backend to skip conversion:
--train-backend fsdp
Multi-node communication issues:
- Set environment variables:
GLOO_SOCKET_IFNAME, NCCL_SOCKET_IFNAME
- See
docs/en/get_started/quick_start.md multi-node section
SGLang concurrency issues:
- Limit concurrency:
--sglang-server-concurrency 160
- Increase CUDA graphs:
--sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)
For more troubleshooting, see docs/en/get_started/qa.md.
Additional Resources
Reference Files
External Links