| name | delta-discriminative-token-credit-assignment |
| description | DelTA (Discriminative Token Credit Assignment) methodology for Reinforcement Learning from Verifiable Rewards (RLVR). Introduces a discriminator view of RLVR updates showing policy-gradient implicitly acts as a linear discriminator over token-gradient vectors. Proposes token-level coefficient estimation to amplify discriminative directions and downweight shared patterns (e.g. formatting tokens). Outperforms baselines by 3.26 pts on Qwen3-8B and 2.62 pts on Qwen3-14B across math benchmarks. Use when: improving token-level credit assignment in RLVR/GRPO, LLM reasoning RL post-training, reducing influence of formatting tokens in policy gradients. Activation: DelTA, discriminative token credit assignment, token-level RLVR, RLVR discriminator, token coefficient estimation, side-specific token gradients. |
DelTA: Discriminative Token Credit Assignment for RLVR
Paper: arXiv:2605.21467 | Submitted: 20 May 2026
Authors: Kaiyi Zhang, Wei Wu, Yankai Lin
Core Problem
Reinforcement Learning from Verifiable Rewards (RLVR) improves LLM reasoning but suffers from a fundamental limitation: response-level rewards mask token-level credit assignment. Under standard sequence-level RLVR, the policy gradient update direction implicitly acts as a linear discriminator over token-gradient vectors, determining which token probabilities increase or decrease. However, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. This centroid construction can be dominated by shared high-frequency patterns (e.g., formatting tokens, punctuation), diluting sparse yet discriminative directions that truly distinguish high-reward from low-reward responses.
Key Insights
1. Discriminator View of RLVR Updates
The policy gradient update direction can be decomposed into:
- Positive-side centroid: Advantage-weighted average of token gradients from high-reward responses
- Negative-side centroid: Same from low-reward responses
- The effective update direction is the difference between these centroids → acts as a linear separator
2. The Dilution Problem
When most tokens are shared (formatting, structure tokens), the centroids become dominated by these shared patterns, reducing the contrast between positive and negative sides. This means rare but informative tokens (the actual reasoning steps that differ between correct and incorrect answers) get diluted.
3. DelTA Solution
DelTA estimates token-level coefficients to:
- Amplify token-gradient directions specific to one side (positive or negative)
- Downweight shared or weakly discriminative directions
- Reweight the self-normalized RLVR surrogate loss
This makes the effective side-wise centroids more contrastive, reshaping the RLVR update direction toward truly discriminative tokens.
Algorithm Details
Token Coefficient Estimation
For each token position t in the response:
coefficient_t = f(token_gradient_t, positive_centroid, negative_centroid)
Where f measures how aligned a token's gradient direction is with the difference between positive and negative centroids. Tokens pointing strongly toward one side get higher coefficients; shared-direction tokens get lower coefficients.
Modified RLVR Objective
The reweighted surrogate loss:
L_DelTA = Σ_t coefficient_t * L_RLVR(t)
Where L_RLVR(t) is the self-normalized RLVR loss at token t.
Experimental Results
| Benchmark | Qwen3-8B Base | Qwen3-14B Base |
|---|
| Average improvement | +3.26 pts | +2.62 pts |
| GSM8K | ✓ | ✓ |
| MATH | ✓ | ✓ |
| Code generation | ✓ (transfer) | ✓ (transfer) |
| Out-of-domain | ✓ (transfer) | ✓ (transfer) |
- Additional validation on code generation tasks and different backbone architectures
- Strong out-of-domain generalization demonstrated
Relationship to Existing Skills
- [[advantage-collapse-grpo-avspo]] - Addresses different GRPO limitation (advantage collapse vs. token credit dilution)
- [[daca-grpo-denoising-credit-assignment]] - Also improves GRPO credit assignment, but for diffusion models
- [[gcpo-cooperative-policy-optimization]] - Cooperative alternative to GRPO with different mechanism
- [[gaussian-grpo]] - Statistical improvement to GRPO
Implementation Notes
- Compatible with existing GRPO/PPO training pipelines
- Requires access to per-token gradient vectors (modify backward pass)
- Self-normalized loss structure avoids additional hyperparameter tuning
- No significant computational overhead beyond gradient computation
- Implementation available in public repository (see paper)
Use Cases
- LLM Reasoning RL Post-Training: Token-level credit improves math/code reasoning
- RLVR Training Refinement: Drop-in replacement for standard sequence-level RLVR
- Formatting-Robust Training: Reduces sensitivity to response format variations
Activation Keywords
DelTA, discriminative token credit assignment, token-level RLVR, RLVR discriminator, token coefficient estimation, side-specific token gradients, token-gradient centroid, advantage-weighted token gradients, self-normalized RLVR surrogate, formatting token dilution