Run any Skill in Manus with one click

agpo-adaptive-group-policy-optimization

AGPO (Adaptive Group Policy Optimization) methodology — a critic-free refinement of GRPO that uses group-level statistics to adaptively control update magnitude and exploration. Uses a shared probe-derived statistical state to drive adaptive clipping (based on reward dispersion, skewness, probe entropy, policy entropy, KL drift) and bidirectional adaptive temperature sampling. Outperforms PPO/GRPO on 9 math/STEM benchmarks with Qwen2.5-14B. Use when: improving GRPO training stability, reducing hyperparameter tuning burden in LLM RL post-training, adaptive exploration for reasoning models. Activation: AGPO, adaptive group policy optimization, adaptive clipping GRPO, bidirectional adaptive temperature, critic-free RLVR, statistical feedback control.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/hiyenwong/ai_collection --skill agpo-adaptive-group-policy-optimization

Copy and paste this command into Claude Code to install the skill

Source

hiyenwong/ai_collection

Stars1

Forks0

UpdatedJune 4, 2026 at 02:00

SKILL.md

readonly

name

agpo-adaptive-group-policy-optimization

description

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Paper: arXiv:2605.20722 | Submitted: 20 May 2026 Authors: Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

Core Problem

PPO/GRPO in LLM reasoning training uses fixed clipping thresholds and fixed decoding temperature, making training brittle and requiring extensive hyperparameter tuning. When reward distributions shift during training (e.g., as the model improves, response quality distribution changes), fixed parameters become suboptimal.

Key Innovations

1. Shared Probe-Driven Statistical State

AGPO uses a shared probe (a lightweight model head) that provides group-level statistical signals:

Reward dispersion and skewness: How spread out and asymmetric rewards are within a group
Probe vote entropy: Uncertainty of the probe's evaluations
Policy entropy: How diverse the model's token probabilities are
Step-wise KL drift: How much the policy changes per step

These statistics drive two adaptive controllers.

2. Adaptive Clipping

Instead of a fixed ε clip parameter (standard in PPO/GRPO), AGPO sets the trust-region size dynamically:

epsilon_t = g(reward_dispersion, skewness, probe_entropy, policy_entropy, KL_drift)

When uncertainty is high → wider clipping (allow more exploration) When confidence is high → tighter clipping (conservative learning)

3. Bidirectional Adaptive Temperature Sampling

Instead of a fixed decoding temperature, AGPO uses bidirectional adjustment:

temperature_t = base_temperature + delta * centered_uncertainty

Heats (increases temperature) when uncertainty is above a running baseline → more exploration
Cools (decreases temperature) when uncertainty is below baseline → more exploitation
Centered relative to a running baseline for stability

Experimental Results

Model	Benchmark	AGPO	PPO	GRPO
Qwen2.5-14B	GSM8K	67.3%	-	-
Qwen2.5-14B	MATH	40.5%	-	-
Llama-3-8B	Math avg	✓	-	-
Gemma-2-9B	Math avg	✓	-	-

Gains transfer across multiple backbone architectures
Ablations confirm both adaptive clipping and adaptive temperature are complementary
Public implementation available

Relationship to Existing Skills

[[advantage-collapse-grpo-avspo]] - Addresses GRPO advantage collapse via different mechanism (virtual samples vs statistical adaptation)
[[gcpo-cooperative-policy-optimization]] - Cooperative GRPO variant addressing exploration collapse
[[d2evo-dual-difficulty-self-evolution]] - Difficulty-aware sample selection (complementary to AGPO)
[[delta-discriminative-token-credit-assignment]] - Token-level credit (complementary: AGPO controls group-level dynamics)
[[gaussian-grpo]] - GRPO improvement via Gaussian modeling

Implementation Notes

Critic-free: No need for a separate critic network (unlike PPO)
Shared probe: Lightweight head, minimal overhead
Running statistics: Maintain running baseline for temperature centering
Complementary to other GRPO improvements: Can be combined with DelTA, D²Evo, etc.
Public implementation: see paper for repository link

Use Cases

LLM Reasoning RL Post-Training: Direct drop-in for GRPO/PPO
Hyperparameter-Sensitive Training: Reduces tuning burden for clipping and temperature
Training With Shifting Distributions: Adapts to changing reward landscape as model improves

Activation Keywords

AGPO, adaptive group policy optimization, adaptive clipping GRPO, bidirectional adaptive temperature, critic-free RLVR, statistical feedback control, probe-derived statistics, trust-region adaptation, dual statistical feedback, adaptive exploration LLM RL, Qwen2.5 RL training, group-level statistics, reward dispersion clipping, policy entropy adaptation

More from this repository

same repository

attachment-representations-interbrain-synchrony

hiyenwong/ai_collection

Attachment representations in early childhood as independent endogenous driver of interbrain synchrony during remote cooperation. Novel Remote Partner-Belief Manipulation paradigm isolates attachment representations by manipulating partner-belief. EEG synchrony concentrated at P4 channel (right TPJ). Activation: attachment, interbrain synchrony, EEG hyperscanning, child-adult interaction, attachment representations, social neuroscience, partner-belief manipulation, early childhood, mother-child interaction, brain synchronization, attachment security, social-emotional development.

2026-06-041

sleep-replay-acceleration-sharp

hiyenwong/ai_collection

SHARP (Sleep-based Hierarchical Accelerated Replay) 方法论 — 睡眠启发的分层加速回放框架用于长程非平稳时序模式识别。受啮齿动物慢波睡眠中加速回放启发，通过分离记忆模块和模式识别模块实现无反向传播的长程信用分配。适用于流式时序学习、长程依赖建模、神经科学启发的 AI 架构。触发词：睡眠回放、加速回放、SHARP、时序学习、长程依赖、流式学习、慢波睡眠、hierarchical replay

2026-06-041

piston-control-two-ion-quantum

hiyenwong/ai_collection

Inverse-engineering methodology for piston operations in trapped-ion quantum devices. One ion serves as classical piston driven by Coulomb interaction with quantum-controlled ion. Stationary state determined self-consistently. Inverse-engineering protocols enable precise control of classical ion motion. Provides route toward controlled piston dynamics in microscopic quantum devices.

2026-06-041

quantum-fault-trees-minimal-cut

hiyenwong/ai_collection

Quantum fault tree analysis methodology using quantum computing. Extends classical reliability engineering fault trees to quantum domain. Identifies minimal cut sets in system reliability analysis using quantum algorithms. Applicable to safety-critical systems, cyber-physical systems, and quantum system reliability engineering.

2026-06-041

adaptive-hybrid-feature-fusion-medical

hiyenwong/ai_collection

Adaptive Hybrid Quantum-Classical Feature Fusion methodology for medical image classification. Addresses optimization asymmetries between quantum and classical paradigms using Temperature-Scaled Hybrid Fusion (TSHF), Dynamic Hybrid Fusion (DHF), and Static Hybrid Fusion (SHF) strategies. Use when designing hybrid quantum-classical ML pipelines for healthcare/medical imaging, especially when combining ResNet backbones with variational quantum circuits for diagnostic tasks.

2026-06-041

adaptive-spiking-neuron-asn

hiyenwong/ai_collection

Adaptive Spiking Neuron (ASN) methodology for vision and language modeling. Implements trainable membrane potential dynamics with adaptive firing mechanisms for efficient Spiking Neural Networks (SNNs). Activation: adaptive spiking neuron, ASN, spiking neural network vision language, SNN adaptive neuron, neuromorphic vision language model.

2026-06-041

Source

hiyenwong

hiyenwong/ai_collection

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

name

agpo-adaptive-group-policy-optimization

description

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Paper: arXiv:2605.20722 | Submitted: 20 May 2026 Authors: Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

Core Problem

Key Innovations

1. Shared Probe-Driven Statistical State

AGPO uses a shared probe (a lightweight model head) that provides group-level statistical signals:

Reward dispersion and skewness: How spread out and asymmetric rewards are within a group
Probe vote entropy: Uncertainty of the probe's evaluations
Policy entropy: How diverse the model's token probabilities are
Step-wise KL drift: How much the policy changes per step

These statistics drive two adaptive controllers.

2. Adaptive Clipping

Instead of a fixed ε clip parameter (standard in PPO/GRPO), AGPO sets the trust-region size dynamically:

epsilon_t = g(reward_dispersion, skewness, probe_entropy, policy_entropy, KL_drift)

When uncertainty is high → wider clipping (allow more exploration) When confidence is high → tighter clipping (conservative learning)

3. Bidirectional Adaptive Temperature Sampling

Instead of a fixed decoding temperature, AGPO uses bidirectional adjustment:

temperature_t = base_temperature + delta * centered_uncertainty

Heats (increases temperature) when uncertainty is above a running baseline → more exploration
Cools (decreases temperature) when uncertainty is below baseline → more exploitation
Centered relative to a running baseline for stability

Experimental Results

Model	Benchmark	AGPO	PPO	GRPO
Qwen2.5-14B	GSM8K	67.3%	-	-
Qwen2.5-14B	MATH	40.5%	-	-
Llama-3-8B	Math avg	✓	-	-
Gemma-2-9B	Math avg	✓	-	-

Gains transfer across multiple backbone architectures
Ablations confirm both adaptive clipping and adaptive temperature are complementary
Public implementation available

Relationship to Existing Skills

[[advantage-collapse-grpo-avspo]] - Addresses GRPO advantage collapse via different mechanism (virtual samples vs statistical adaptation)
[[gcpo-cooperative-policy-optimization]] - Cooperative GRPO variant addressing exploration collapse
[[d2evo-dual-difficulty-self-evolution]] - Difficulty-aware sample selection (complementary to AGPO)
[[delta-discriminative-token-credit-assignment]] - Token-level credit (complementary: AGPO controls group-level dynamics)
[[gaussian-grpo]] - GRPO improvement via Gaussian modeling

Implementation Notes

Critic-free: No need for a separate critic network (unlike PPO)
Shared probe: Lightweight head, minimal overhead
Running statistics: Maintain running baseline for temperature centering
Complementary to other GRPO improvements: Can be combined with DelTA, D²Evo, etc.
Public implementation: see paper for repository link

Use Cases

LLM Reasoning RL Post-Training: Direct drop-in for GRPO/PPO
Hyperparameter-Sensitive Training: Reduces tuning burden for clipping and temperature
Training With Shifting Distributions: Adapts to changing reward landscape as model improves