Run any Skill in Manus with one click

slime-user

Guide for using SLIME (LLM post-training framework for RL Scaling). Use when working with SLIME for reinforcement learning training of language models, including setup, configuration, training execution, multi-turn interactions, custom reward models, tool calling scenarios, or troubleshooting SLIME workflows. Covers GRPO, GSPO, PPO, Reinforce++, multi-agent RL, VLM training, FSDP/Megatron backends, SGLang integration, dynamic sampling, and custom generation functions.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/yzlnew/infra-skills --skill slime-user

Copy and paste this command into Claude Code to install the skill

Source

yzlnew/infra-skills

Stars128

Forks9

UpdatedJanuary 14, 2026 at 05:46

File Explorer

4 files

SKILL.md

readonly

SLIME User Guide

SLIME is an LLM post-training framework for RL Scaling developed by THUDM. It supports various RL algorithms (GRPO, GSPO, PPO, Reinforce++), multiple training backends (Megatron, FSDP), and advanced features like multi-turn interactions, tool calling, and dynamic sampling.

Quick Start Workflow

For First-Time Users

Environment Setup
- Use Docker: docker pull slimerl/slime:latest
- Or build from source: See docs/en/get_started/quick_start.md
- Hardware: Supports H100/H200, B200 series

Download Model and Data

hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k

Convert Weights (Megatron backend only)

source scripts/models/qwen3-4B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/Qwen3-4B \
    --save /root/Qwen3-4B_torch_dist

Run Training
```
bash scripts/run-qwen3-4B.sh
```

For Experienced Users

When user needs specific functionality:

Multi-turn/tool calling: Read references/examples_reference.md Search-R1 section
Custom reward models: See custom RM pattern in examples reference
FSDP instead of Megatron: Use --train-backend fsdp, skip weight conversion
Large-scale training: See multi-node examples (GLM-4.5, DeepSeek-R1)
Source code exploration: Check references/source_code_reference.md

Documentation Navigation

SLIME has extensive documentation. Use this guide to find what you need quickly.

Essential Documentation (Read These First)

Quick Start Guide: docs/en/get_started/quick_start.md - Setup and first training run
Usage Guide: docs/en/get_started/usage.md - Comprehensive parameter reference
Example Docs: docs/en/examples/qwen3-4B.md or docs/en/examples/glm4-9B.md

For detailed navigation of all documentation, see references/doc_navigation.md.

Common Tasks → Documentation Mapping

Task	Documentation
First-time setup	`docs/en/get_started/quick_start.md`
Understanding parameters	`docs/en/get_started/usage.md`
Basic training (8 GPUs)	`docs/en/examples/qwen3-4B.md`
Multi-turn tool use	`examples/search-r1/`
Custom generation logic	`docs/en/get_started/customization.md`
Multi-node training	`docs/en/examples/glm4.5-355B-A32B.md`
FSDP backend	`docs/en/get_started/usage.md` (FSDP section)
VLM training	`examples/geo3k_vlm/`
Troubleshooting	`docs/en/get_started/qa.md`

Core Concepts

Training Loop

SLIME uses a "Rollout → Train" loop:

Rollout: Generate responses using SGLang inference
Reward: Compute rewards using reward model
Train: Update model weights using Megatron/FSDP
Repeat for --num-rollout iterations

Key Constraint

rollout-batch-size × n-samples-per-prompt = global-batch-size × num-steps-per-rollout

Resource Allocation Modes

Colocated (training and inference share GPUs):

--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--colocate \
--sglang-mem-fraction-static 0.7

Disaggregated (separate GPUs for training/inference):

--actor-num-nodes 1 \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4

Parameter Quick Reference

Essential Parameters

Model Loading:

--hf-checkpoint: HuggingFace model path (for SGLang and FSDP)
--ref-load: Megatron reference model checkpoint
--load: Megatron actor checkpoint (resume training)
--save: Save path for checkpoints

Data:

--prompt-data: JSONL dataset path
--input-key: Field name for prompts (default: "prompt")
--label-key: Field name for labels (default: "label")
--metadata-key: Field name for metadata (default: "metadata")
--apply-chat-template: Apply tokenizer chat template

Rollout:

--rollout-batch-size: Prompts per rollout
--n-samples-per-prompt: Responses per prompt
--rollout-max-response-len: Max response length
--rollout-temperature: Sampling temperature

Training:

--num-rollout: Total training iterations
--num-steps-per-rollout: Optimizer steps per rollout (default: 1)
--global-batch-size: Samples per optimizer step
--advantage-estimator: RL algorithm (grpo, gspo, ppo, reinforce_plus_plus)

Reward Model:

--rm-type: Built-in RM type (e.g., "deepscaler")
--custom-rm-path: Custom RM function path

Backends:

--train-backend: Training backend (megatron or fsdp)
--rollout-num-gpus-per-engine: GPUs per SGLang engine (like tp_size)

For complete parameter reference, see docs/en/get_started/usage.md.

Common Workflows

1. Standard Single-Turn Training

Use example scripts as templates:

scripts/run-qwen3-4B.sh: Basic 8xH100 setup
scripts/run-glm4-9B.sh: With dynamic sampling

Key sections in script:

# Load model config
source scripts/models/qwen3-4B.sh

# Configure checkpoints
CKPT_ARGS=(--hf-checkpoint /root/Qwen3-4B ...)

# Configure rollout
ROLLOUT_ARGS=(
  --rollout-batch-size 32
  --n-samples-per-prompt 8
  --rm-type deepscaler
)

# Configure algorithm
GRPO_ARGS=(--advantage-estimator grpo ...)

# Run training
ray job submit ... -- python3 train.py \
  ${MODEL_ARGS[@]} ${CKPT_ARGS[@]} ${ROLLOUT_ARGS[@]} ...

2. Multi-Turn Tool Calling

For multi-turn scenarios (like Search-R1):

Prepare Data with metadata:

{
  "question": "User query",
  "final_answer": "Expected answer",
  "metadata": "{\"session_id\": \"123\", \"tool_code\": \"...\"}"
}

Implement Custom Generation Function:

async def generate(args, sample: Sample, sampling_params) -> Sample:
    for turn in range(max_turns):
        # Generate action
        model_output = await call_sglang(...)
        sample.loss_mask += [1] * len(model_tokens)  # Train on actions

        # Execute tool
        tool_output = await execute_tool(...)
        sample.loss_mask += [0] * len(tool_tokens)  # Mask tool outputs

        if action == "answer":
            break

    sample.tokens = prompt_tokens + response_tokens
    sample.response_length = len(response_tokens)
    return sample

Configure Custom Functions:

--custom-generate-function-path my_module.generate \
--custom-rm-path my_module.reward_func \
--metadata-key metadata

See examples/search-r1/ for complete example.

3. Dynamic Sampling (DAPO-style)

Filter low-quality samples during generation:

ROLLOUT_ARGS+=(
  --over-sampling-batch-size 64 \
  --rollout-batch-size 32 \
  --dynamic-sampling-filter-path \
    slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
)

How it works:

Samples 64 prompts (over-sampling)
Filters groups based on reward diversity
Keeps only 32 prompts × 8 samples that pass filter
Automatically resamples if too many filtered out

4. FSDP Backend (No Weight Conversion)

--train-backend fsdp \
--hf-checkpoint /root/Qwen3-4B \
--gradient-checkpointing \
--context-parallel-size 2

Benefits:

No HF → Megatron weight conversion needed
Directly load HuggingFace checkpoints
Simpler setup for supported models

See examples/geo3k_vlm/ and docs/en/get_started/usage.md FSDP section.

5. Multi-Node Training

Start Ray cluster:

# Head node
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8

# Worker nodes
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8

Submit job:

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"env_vars": {"PYTHONPATH": "/root/Megatron-LM/"}}' \
  -- python3 train.py \
  --actor-num-nodes 8 \
  --actor-num-gpus-per-node 8 \
  ...

See docs/en/examples/glm4.5-355B-A32B.md for large-scale example.

Customization Guide

Custom Reward Model

Implement async function:

async def my_reward_func(args, sample: Sample, **kwargs) -> float:
    # Access sample fields
    prompt = sample.prompt
    response = sample.response
    label = sample.label

    # Compute reward
    reward = compute_score(response, label)
    return float(reward)

Use with: --custom-rm-path module.path:my_reward_func

Custom Generation Function

Implement async function:

async def my_generate(args, sample: Sample, sampling_params) -> Sample:
    # Load tokenizer
    from slime.utils.processing_utils import load_tokenizer
    tokenizer = load_tokenizer(args.hf_checkpoint, trust_remote_code=True)

    # Generate response (call SGLang API or custom logic)
    from slime.utils.http_utils import post
    output = await post(
        f"http://{args.sglang_router_ip}:{args.sglang_router_port}/generate",
        {"text": sample.prompt, "sampling_params": sampling_params}
    )

    # Set sample fields
    prompt_tokens = tokenizer(sample.prompt, add_special_tokens=False)["input_ids"]
    response_tokens = tokenizer(output["text"], add_special_tokens=False)["input_ids"]

    sample.tokens = prompt_tokens + response_tokens
    sample.response_length = len(response_tokens)
    sample.response = output["text"]
    sample.truncated = output["meta_info"]["finish_reason"]["type"] == "length"

    return sample

Use with: --custom-generate-function-path module.path:my_generate

Custom Dynamic Filter

Implement filter function:

def my_filter(args, samples: list[Sample], **kwargs) -> bool:
    # Return True to keep samples, False to discard
    return all(sample.reward > 0.5 for sample in samples)

Use with: --dynamic-sampling-filter-path module.path:my_filter

Examples Reference

For detailed examples and patterns, see references/examples_reference.md.

Quick finder:

Basic math training: scripts/run-qwen3-4B.sh
Multi-turn tool use: examples/search-r1/
Vision-language RL: examples/geo3k_vlm/
Large-scale MOE: docs/en/examples/glm4.5-355B-A32B.md
Custom generation: examples/search-r1/search_r1_logic.py
FSDP backend: examples/geo3k_vlm/

Source Code Reference

For source code exploration, see references/source_code_reference.md.

Key files:

Arguments: slime/utils/arguments.py
Rollout: slime/rollout/sglang_rollout.py
Sample type: slime/utils/types.py
Reward models: slime/rollout/rm_hub/
Conversion tools: tools/convert_hf_to_torch_dist.py

Troubleshooting

Common Issues

OOM during colocated training:

Reduce --sglang-mem-fraction-static (try 0.7 or 0.6)
Reduce --max-tokens-per-gpu
Enable gradient checkpointing: --recompute-granularity full

Mismatched batch sizes:

Ensure: rollout-batch-size × n-samples-per-prompt = global-batch-size × num-steps-per-rollout

Weight conversion errors:

Check model config matches exactly (e.g., --rotary-base)
Use FSDP backend to skip conversion: --train-backend fsdp

Multi-node communication issues:

Set environment variables: GLOO_SOCKET_IFNAME, NCCL_SOCKET_IFNAME
See docs/en/get_started/quick_start.md multi-node section

SGLang concurrency issues:

Limit concurrency: --sglang-server-concurrency 160
Increase CUDA graphs: --sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 256)

For more troubleshooting, see docs/en/get_started/qa.md.

Additional Resources

Reference Files

Doc Navigation: references/doc_navigation.md - Find documentation quickly
Examples Reference: references/examples_reference.md - Example scripts and patterns
Source Code Reference: references/source_code_reference.md - Code structure and key functions

External Links

GitHub Repository: https://github.com/THUDM/slime
Docker Image: slimerl/slime:latest
Megatron-LM: https://github.com/NVIDIA/Megatron-LM
SGLang: https://github.com/sgl-project/sglang

slime-user

More from this repository

More from this repository

SLIME User Guide

Quick Start Workflow

For First-Time Users

For Experienced Users

Documentation Navigation

Essential Documentation (Read These First)

Common Tasks → Documentation Mapping

Core Concepts

Training Loop

Key Constraint

Resource Allocation Modes

Parameter Quick Reference

Essential Parameters

Common Workflows

1. Standard Single-Turn Training

2. Multi-Turn Tool Calling

3. Dynamic Sampling (DAPO-style)

4. FSDP Backend (No Weight Conversion)

5. Multi-Node Training

Customization Guide

Custom Reward Model

Custom Generation Function

Custom Dynamic Filter

Examples Reference

Source Code Reference

Troubleshooting

Common Issues

Additional Resources

Reference Files

External Links

SLIME User Guide

Quick Start Workflow

For First-Time Users

For Experienced Users

Documentation Navigation

Essential Documentation (Read These First)

Common Tasks → Documentation Mapping

Core Concepts

Training Loop

Key Constraint

Resource Allocation Modes

Parameter Quick Reference

Essential Parameters

Common Workflows

1. Standard Single-Turn Training

2. Multi-Turn Tool Calling

3. Dynamic Sampling (DAPO-style)

4. FSDP Backend (No Weight Conversion)

5. Multi-Node Training

Customization Guide

Custom Reward Model

Custom Generation Function

Custom Dynamic Filter

Examples Reference

Source Code Reference

Troubleshooting

Common Issues

Additional Resources

Reference Files

External Links