一键在 Manus 中运行任何 Skill

verl-rl-training

星标9,996

分支745

更新时间2026年1月29日 01:35

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

Orchestra-Research

Orchestra-Research/AI-Research-SKILLs

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

相关职业SOC

基于 SOC 职业分类

软件开发工程师计算机与数学类职业·SOC 15-1252

文件资源管理器

3 个文件

SKILL.md

readonly

同仓库更多 Skills

同仓库

model-merging

Orchestra-Research/AI-Research-SKILLs

Merge multiple fine-tuned models using mergekit to combine capabilities without retraining. Use when creating specialized models by blending domain-specific expertise (math + coding + chat), improving performance beyond single models, or experimenting rapidly with model variants. Covers SLERP, TIES-Merging, DARE, Task Arithmetic, linear merging, and production deployment strategies.

2026-06-1610.0k

ara-compiler

Orchestra-Research/AI-Research-SKILLs

Compiles any research input — PDF papers, GitHub repositories, experiment logs, code directories, or raw notes — into a complete Agent-Native Research Artifact (ARA) with cognitive layer (claims, concepts, heuristics), physical layer (configs, code stubs), exploration graph, and grounded evidence. Use when ingesting a paper or codebase into a structured, machine-executable knowledge package, building an ARA from scratch, or converting research outputs into a falsifiable, agent-traversable form.

2026-04-2810.0k

ara-research-manager

Orchestra-Research/AI-Research-SKILLs

Records research provenance as a post-task epilogue, scanning conversation history at the end of a coding or research session to extract decisions, experiments, dead ends, claims, heuristics, and pivots, and writing them into the ara/ directory with user-vs-AI provenance tags. Use as a session epilogue — never during execution — to maintain a faithful, auditable trace of how a research project actually evolved.

2026-04-2810.0k

ara-rigor-reviewer

Orchestra-Research/AI-Research-SKILLs

Performs ARA Seal Level 2 semantic epistemic review on Agent-Native Research Artifacts, scoring six dimensions (evidence relevance, falsifiability, scope calibration, argument coherence, exploration integrity, methodological rigor) and producing a constructive, severity-ranked report with a Strong Accept-to-Reject recommendation. Use after Level 1 structural validation passes, when an ARA needs an objective epistemic critique before publication or release.

2026-04-2810.0k

ml-paper-writing

Orchestra-Research/AI-Research-SKILLs

Write publication-ready ML/AI papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Use when drafting papers from research repos, structuring arguments, verifying citations, or preparing camera-ready submissions. For systems venues (OSDI, NSDI, ASPLOS, SOSP), use systems-paper-writing instead.

2026-04-1010.0k

presenting-conference-talks

Orchestra-Research/AI-Research-SKILLs

Generates conference presentation slides (Beamer LaTeX PDF and editable PPTX) from a compiled paper with speaker notes and talk script. Use when preparing oral talks, spotlight presentations, or invited talks for ML and systems conferences.

2026-04-1010.0k

name	verl-rl-training
description	Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.
version	1.0.0
author	Orchestra Research
license	MIT
tags	["Reinforcement Learning","RLHF","GRPO","PPO","Post-Training","Distributed Training"]
dependencies	["verl>=0.3.0","torch>=2.0.0","ray>=2.41.0","vllm>=0.8.2","transformers>=4.40.0"]

verl: Volcano Engine Reinforcement Learning for LLMs

verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.

When to Use verl

Choose verl when you need:

Production-ready RL training at scale (tested up to 671B parameters)
Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
Multi-turn rollout with tool calling for agentic workflows
Vision-language model RL training

Consider alternatives when:

You need Megatron-native training → use slime or miles
You want PyTorch-native abstractions with Monarch → use torchforge
You only need simple SFT/DPO → use TRL or Axolotl

Key Features

Training backends: FSDP, FSDP2, Megatron-LM
Rollout engines: vLLM, SGLang, HuggingFace Transformers
Algorithms: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
Models: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
Advanced: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools

Installation

# Option 1: pip install
pip install verl[vllm]  # or verl[sglang] for SGLang backend

# Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest

# Option 3: From source
git clone https://github.com/volcengine/verl.git
cd verl && pip install -e .[vllm,math]

Quick Start: GRPO Training

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=~/data/gsm8k/train.parquet \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    trainer.n_gpus_per_node=8

Core Architecture

verl uses a HybridFlow programming model separating control flow from computation:

┌─────────────────────────────────────────────────────────┐
│ Single-Process Controller (Ray)                         │
│ - Orchestrates: rollout → reward → train → sync        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Multi-Process Workers                                   │
│ ├── ActorRolloutRefWorker (policy + generation)        │
│ ├── CriticWorker (value estimation, PPO only)          │
│ └── RewardManager (model-based or rule-based rewards)  │
└─────────────────────────────────────────────────────────┘

Workflow 1: Math Reasoning with GRPO

Use this workflow for training reasoning models on math tasks like GSM8K or MATH.

Prerequisites Checklist

GPU cluster with 8+ GPUs (H100 recommended)
Dataset in parquet format with prompt and reward_model columns
Base model from HuggingFace Hub

Step 1: Prepare Dataset

import pandas as pd

data = [
    {
        "prompt": [{"role": "user", "content": "What is 15 + 27?"}],
        "reward_model": {"ground_truth": "42"}
    },
    # ... more examples
]
df = pd.DataFrame(data)
df.to_parquet("train.parquet")

Step 2: Define Reward Function

# reward_function.py
import re

def compute_reward(responses, ground_truths):
    rewards = []
    for response, gt in zip(responses, ground_truths):
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if match and match.group(1).strip() == gt.strip():
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    return rewards

Step 3: Create Training Config

# config/grpo_math.yaml
algorithm:
  adv_estimator: grpo
  gamma: 1.0
  lam: 1.0

data:
  train_files: /path/to/train.parquet
  val_files: /path/to/val.parquet
  train_batch_size: 256
  max_prompt_length: 512
  max_response_length: 2048

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-7B-Instruct
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.001
    ppo_mini_batch_size: 64
  rollout:
    name: vllm
    n: 8  # samples per prompt
    temperature: 0.7
    top_p: 0.95

trainer:
  total_epochs: 3
  n_gpus_per_node: 8
  save_freq: 100

Step 4: Launch Training

python3 -m verl.trainer.main_ppo \
    --config-path config \
    --config-name grpo_math \
    trainer.experiment_name=grpo_math_qwen7b

Step 5: Monitor and Validate

Check WandB/TensorBoard for loss curves
Verify reward is increasing over steps
Run evaluation on held-out test set

Workflow 2: PPO with Critic Model

Use this workflow when you need value-based advantage estimation (GAE).

Key Differences from GRPO

Requires separate critic model
Uses Generalized Advantage Estimation (GAE)
Better for tasks with dense rewards

Configuration

algorithm:
  adv_estimator: gae  # Use GAE instead of GRPO
  gamma: 0.99
  lam: 0.95

critic:
  model:
    path: Qwen/Qwen2.5-7B-Instruct  # Can be same or different from actor
  ppo_mini_batch_size: 64

actor_rollout_ref:
  actor:
    use_kl_loss: true
    kl_loss_coef: 0.02
    clip_ratio: 0.2  # PPO clipping

Launch with Critic

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=gae \
    critic.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.n_gpus_per_node=8

Workflow 3: Large-Scale Training with Megatron

Use this workflow for models >70B parameters or when you need expert parallelism.

Prerequisites

Install Megatron-LM bridge: pip install mbridge
Convert model to Megatron format
Multi-node cluster with NVLink/InfiniBand

Configuration for 70B+ Models

actor_rollout_ref:
  model:
    path: /path/to/megatron/checkpoint
    backend: megatron
  actor:
    strategy: megatron
    tensor_model_parallel_size: 8
    pipeline_model_parallel_size: 2
  rollout:
    name: vllm
    tensor_parallel_size: 8

Launch Multi-Node

# On head node
ray start --head --port=6379

# On worker nodes
ray start --address='head_ip:6379'

# Launch training
python3 -m verl.trainer.main_ppo \
    trainer.nnodes=4 \
    trainer.n_gpus_per_node=8

Configuration Reference

Algorithm Selection

Algorithm	`adv_estimator`	Use Case
GRPO	`grpo`	Critic-free, math/reasoning
PPO/GAE	`gae`	Dense rewards, value estimation
REINFORCE++	`reinforce_plus_plus`	Variance reduction
RLOO	`rloo`	Leave-one-out baseline
ReMax	`remax`	Maximum reward baseline
OPO	`opo`	Optimal policy optimization

Key Parameters

# Rollout parameters
actor_rollout_ref.rollout.n: 8              # Samples per prompt
actor_rollout_ref.rollout.temperature: 0.7  # Sampling temperature
actor_rollout_ref.rollout.top_p: 0.95       # Nucleus sampling

# Training parameters
actor_rollout_ref.actor.lr: 1e-6            # Learning rate
actor_rollout_ref.actor.ppo_mini_batch_size: 64
actor_rollout_ref.actor.clip_ratio: 0.2     # PPO clip range

# KL control
actor_rollout_ref.actor.use_kl_loss: true
actor_rollout_ref.actor.kl_loss_coef: 0.001
algorithm.kl_ctrl.target_kl: 0.1            # For adaptive KL control

Common Issues and Solutions

Issue: OOM During Rollout

Symptoms: CUDA out of memory during generation phase

Solutions:

# Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

# Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true

# Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2
actor_rollout_ref.actor.fsdp_config.offload_policy: true

Issue: Training Instability

Symptoms: Loss spikes, reward collapse

Solutions:

# Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7

# Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01

# Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0

Issue: Slow Weight Sync

Symptoms: Long pauses between rollout and training

Solutions:

# Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2

# Enable async weight transfer
trainer.async_weight_update=true

Issue: vLLM Version Mismatch

Symptoms: Import errors or generation failures

Solution: Use compatible versions:

pip install vllm>=0.8.5,<=0.12.0
# Avoid vLLM 0.7.x (known bugs)

Advanced Topics

Multi-Turn Tool Calling

See references/multi-turn.md for agentic workflows with tool use.

Vision-Language Models

actor_rollout_ref:
  model:
    path: Qwen/Qwen2.5-VL-7B-Instruct
  rollout:
    name: vllm
    enable_vision: true

LoRA Training

actor_rollout_ref:
  actor:
    lora:
      enabled: true
      r: 16
      alpha: 32
      target_modules: ["q_proj", "v_proj"]

Resources

Documentation: https://verl.readthedocs.io/
Paper: https://arxiv.org/abs/2409.19256
GitHub: https://github.com/volcengine/verl
Recipes: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
Community: Slack at verl-project