com um clique
run-rlhf-code-experiment
// Plan, run, and report a small RLHF Book code experiment.
// Plan, run, and report a small RLHF Book code experiment.
Start or verify live preview for RLHF Book course lecture slides without breaking relative image assets. Use when serving, opening, checking, or debugging `teach/course/lec*.md` slides.
Validate changes before PR submission
Get feedback from Gemini API on a diagram image for textbook quality review.
Create a new PR or push commits to an existing PR for the current branch.
| name | run-rlhf-code-experiment |
| description | Plan, run, and report a small RLHF Book code experiment. |
| allowed-tools | Bash(uv:*), Bash(git:*), Read, Edit |
Use this skill when the user wants to run, adapt, compare, or document an experiment from code/.
code/policy_gradients/README.md.code/reward_models/README.md.code/direct_alignment/README.md.code/rejection_sampling/README.md.cd code/.cd code/ && uv sync only when needed.uv run python, never bare python.--samples and --epochs.--max_samples or copy a YAML with a smaller sample count.data.size before changing algorithm logic.max_train_samples, max_test_samples, or num_completions_per_prompt in a copied YAML.[1 background task] [1 monitor]). Keep checking the monitor periodically for loss, metrics, W&B URLs, OOMs, dataset download errors, and stalled output.WANDB_MODE=disabled or use the module's no-W&B flag when available.Report enough detail for another reader to reproduce the result:
avg_correctness, avg_format, avg_binary, loss, and whether sampled groups contain reward contrast.accuracy, margins, chosen_rewards, rejected_rewards, and sample generations. IPO loss scale is not directly comparable to DPO loss scale.If the run exposes a new setup requirement, failure mode, or useful workflow shortcut, update the relevant README, code/CLAUDE.md, or this skill before finishing.