| name | oppo-token-credit-assignment |
| description | Oracle-Prompted Policy Optimization (OPPO) for token-level credit assignment in LLM reasoning via Bayesian value recursion |
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
arXiv: 2605.21851 | Submitted 21 May 2026
Core Concept
GRPO assigns a single trajectory-level advantage to every token, diluting the signal at pivotal reasoning steps and injecting noise at uninformative ones. OPPO (Oracle-Prompted Policy Optimization) solves this via a closed-form Bayesian update that yields token-level advantages without a learned value network or additional rollouts.
Key Methodology
Core Observation
The oracle signal used by distillation-style methods for local (per-token) discrimination is also the natural Bayesian update of the model's belief about eventual success. Accumulating this signal along a trajectory yields a running estimate of success probability at every position.
Token-Level Advantage
- A first-order analysis factorizes the advantage into:
- Per-token discrimination signal (used by distillation methods)
- State weight that concentrates credit on genuinely pivotal tokens
- Directional variance-reduction guarantee
Two Estimators
- Self-oracle: Reuses the student model as the scorer; recovers on-policy distillation reward as a strict special case
- Teacher-oracle: Delegates scoring to a stronger frozen model
Cost
- One extra forward pass
- No learned critic/value network
- No additional rollouts
Performance Gains
- Outperforms GRPO, DAPO, and SDPO on 7 math/science/code reasoning benchmarks
- Gains widen monotonically with response length
- Significant gains on AMC'23 and AIME'24
Application Scenarios
- LLM reasoning post-training
- Any RLVR scenario where per-token credit assignment matters
- Long-chain reasoning problems where signal dilution is severe
Activation Keywords
- OPPO, Oracle-Prompted Policy Optimization
- token-level credit assignment
- Bayesian value recursion
- GRPO token advantage dilution
- critic-free RL