| name | precise-sde-consistent-rl-flow-matching |
| description | Precise — SDE-consistent stochastic sampling for RL post-training of flow-matching models with clean-latent posterior mean freezing. |
Precise: SDE-Consistent RL Flow Matching
Overview
Stochastic sampler design for RL post-training of flow-matching models. Decomposes sampler into exploration amount + discretization faithfulness. Uses "clean-latent posterior mean freezing" to ensure SDE-consistency at small step counts.
Core Methodology
Problem
- Flow-matching models use deterministic ODE; RL needs stochastic policy
- Standard reverse-time SDE introduces excess noise at small step counts
- Exploration vs. denoising stability tradeoff
Solution: Precise Framework
- Sampler Decomposition: Exploration amount + discretization faithfulness
- SDE Schedule Design: Balance exploration vs. denoising stability
- Clean-Latent Posterior Mean Freezing: Freeze posterior mean during sampling
- SDE-Consistency: Ensure discretization matches continuous SDE at small steps
Key Insight
At small step counts, standard samplers add excess discretization noise. Precise freezes the clean-latent posterior mean to keep the denoising trajectory SDE-consistent.
Implementation Steps
- Derive SDE schedule from ODE: add noise proportional to exploration needs
- Identify clean-latent posterior mean in flow-matching architecture
- Freeze posterior mean during sampling process
- Tune exploration amount based on reward landscape
- Use small step counts for RL efficiency
Applications
- RL post-training for image generation models
- Diffusion policy alignment (PickScore, HPSv2.1)
- Flow-matching models with reward optimization
- Text-to-image generation RL
Pitfalls
- Don't: Use standard reverse-time SDE directly at small steps
- Check: Posterior mean freezing correctly implemented
- Monitor: Training time vs. prior methods (should see reduction)
Related Skills
- [[som-score-based-meanflow-policy-optimization]] — MeanFlow one-step policy
- [[daca-grpo-denoising-credit-assignment]] — Denoising-aware GRPO
Activation Keywords
Precise, SDE-consistent sampling, flow-matching RL, stochastic sampler design, posterior mean freezing, diffusion RL post-training, clean-latent freezing, small-step sampling
Source
arXiv:2605.23522 — Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models