name	selfplay-data-gating-collapse
description	Analysis of data gating vs reward grounding in self-play RL for LLMs, revealing the Grounded Proposer Paradox and two-stage phase transitions

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

arXiv: 2605.22217 | Submitted 21 May 2026

Core Concept

Self-play RL trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. While recent systems report strong reasoning gains, collapse and instability are widely observed. This paper argues collapse is governed by data-level gating, not reward design — and reveals a surprising Grounded Proposer Paradox.

Key Findings

Two Distinct Levers

Data-level gate: Decides which proposer-generated tasks enter the training pool
Reward signal: Updates the policy on tasks already admitted

Asymmetry (Key Result)

A strict gate is sufficient for stability under every reward variant tested
No reward variant is sufficient once the gate is removed

Grounded Proposer Paradox

A proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver. Reason: ground-truth access concentrates training on clean tasks that form the fastest path to a spurious self-consistent attractor.

Phase Transition

Replacing the binary gate with a continuous strictness parameter ε reveals:

Training-side metrics decouple at low ε (training metrics improve but validation doesn't)
Validation accuracy holds until ε is much higher
A genuine two-stage phase transition exists in self-play dynamics

Implementation Points

Data filtering (gating) is more critical than reward calibration for self-play stability
Ground-truth access for proposers is not always beneficial — can accelerate collapse
Binary gating should be replaced with continuous gating (ε parameter) for finer control
Monitor both training and validation metrics — they decouple at moderate gating strictness

Application Scenarios

Self-play RL for LLM reasoning improvement
Multi-agent co-evolution training systems
Any system where a proposer and solver co-evolve without human labels

Activation Keywords

self-play RL collapse
data gating vs reward grounding
Grounded Proposer Paradox
self-play phase transition
proposer-solver co-evolution

name	selfplay-data-gating-collapse
description	Analysis of data gating vs reward grounding in self-play RL for LLMs, revealing the Grounded Proposer Paradox and two-stage phase transitions

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

arXiv: 2605.22217 | Submitted 21 May 2026

Core Concept

Key Findings

Two Distinct Levers

Data-level gate: Decides which proposer-generated tasks enter the training pool
Reward signal: Updates the policy on tasks already admitted

Asymmetry (Key Result)

A strict gate is sufficient for stability under every reward variant tested
No reward variant is sufficient once the gate is removed

Grounded Proposer Paradox

Phase Transition

Replacing the binary gate with a continuous strictness parameter ε reveals:

Training-side metrics decouple at low ε (training metrics improve but validation doesn't)
Validation accuracy holds until ε is much higher
A genuine two-stage phase transition exists in self-play dynamics

Implementation Points

Data filtering (gating) is more critical than reward calibration for self-play stability
Ground-truth access for proposers is not always beneficial — can accelerate collapse
Binary gating should be replaced with continuous gating (ε parameter) for finer control
Monitor both training and validation metrics — they decouple at moderate gating strictness

Application Scenarios

Self-play RL for LLM reasoning improvement
Multi-agent co-evolution training systems
Any system where a proposer and solver co-evolve without human labels

Activation Keywords

self-play RL collapse
data gating vs reward grounding
Grounded Proposer Paradox
self-play phase transition
proposer-solver co-evolution

selfplay-data-gating-collapse

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

Core Concept

Key Findings

Two Distinct Levers

Asymmetry (Key Result)

Grounded Proposer Paradox

Phase Transition

Implementation Points

Application Scenarios

Activation Keywords

More from this repository

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

Core Concept

Key Findings

Two Distinct Levers

Asymmetry (Key Result)

Grounded Proposer Paradox

Phase Transition

Implementation Points

Application Scenarios

Activation Keywords

More from this repository