Skip to main content
تشغيل أي مهارة في Manus
بنقرة واحدة

rl-reward

// Build RL reward signals using the OpenJudge framework. Covers choosing between pointwise and pairwise reward strategies based on RL algorithm, task type, and cost; aggregating multi-dimensional pointwise scores into a scalar reward; pairwise tournament reward for GRPO on subjective tasks (net win rate across group rollouts); generating preference pairs for DPO/RLAIF; and normalizing scores for training stability. Use when building reward models, scoring rollouts for GRPO/REINFORCE, generating preference data for DPO, or doing Best-of-N selection.

$ git log --oneline --stat
stars:٦١٩
forks:٥٢
updated:١٦ مارس ٢٠٢٦ في ٠٤:٢٣
مستكشف الملفات
3 ملفات
SKILL.md
readonly