원클릭으로 Manus에서 모든 스킬 실행

$pwd:

train-rl

Name: Train Rl
Author: OpenPipe

// RL training reference for the ART framework. Use when the user asks to create, write, or help with an RL training script, reinforcement learning, GRPO, reward functions, RULER scoring, rollout functions, or anything related to RL fine-tuning.

Manus에서 실행

$ git log --oneline --stat

stars:9,870

forks:876

updated:2026년 3월 27일 20:16

SKILL.md

readonly

related-skills.json

같은 저장소

fix-art-issues.md

from "OpenPipe/ART"

Fix a GitHub issue on OpenPipe/ART and open a PR. Use when the user asks to fix, solve, or work on an ART issue, or references a GitHub issue URL containing "OpenPipe/ART". Triggers: "fix ART issue", "solve this issue" with an OpenPipe/ART URL, "work on ART #N".

2026-02-289.9k

train-sft.md

from "OpenPipe/ART"

SFT training reference for the ART framework. Use when the user asks to create, write, or help with an SFT training script, fine-tune a model, train from a JSONL dataset, do distillation, or anything related to supervised fine-tuning.

2026-02-189.9k

package.json

"author": "OpenPipe"

"repository": "OpenPipe/ART"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

데이터 과학자컴퓨터 및 수학직15-2051L4

소프트웨어 개발자L4

name	train-rl
description	RL training reference for the ART framework. Use when the user asks to create, write, or help with an RL training script, reinforcement learning, GRPO, reward functions, RULER scoring, rollout functions, or anything related to RL fine-tuning.

ART RL Workflow

Use this skill when the user wants an RL script or help adapting an existing ART agent for RL.

Keep the process simple:

Inspect the existing environment or agent first.
Ask one question at a time until the blocking decisions are resolved.
Generate a runnable script that reuses the real rollout/tool logic instead of approximating it.

This skill is an interactive wizard. Do not write the script immediately.

Rules:

Ask one question at a time.
Wait for the user's answer before asking the next question.
Use the repo to make recommendations, but still ask the user to confirm every required choice.
Do not skip required questions just because the code suggests a likely answer.
Do not generate the final script until all required questions below have been answered.

Blocking Decisions

You must resolve these before writing the final script:

Is the environment replayable?
Is this multi-turn RL or single-turn training on static examples?
Is reward programmatic, RULER, or custom?
Is the backend ServerlessBackend or LocalBackend?

Required Questions

You must collect answers for all of these, one at a time, before generating code:

Replayability confirmation
What behavior the agent should optimize
Where training scenarios come from
If multi-turn: how turns work and when the episode ends
Reward choice
If RULER: judge model
Training split or source
Validation split or source
Base model
Project name
Run name
Backend
Hyperparameters: use defaults or customize
Iteration mode: fixed-dataset epoch loop or manual/open-ended loop

If the repo already makes an answer likely, present that as a recommendation and ask the user to confirm or correct it. That still counts as a question and still requires a user response.

1. Replayability

Inspect the repo before asking.

Recommend multi-turn RL when episodes can be recreated and rolled out repeatedly from the same initial state.
Recommend single-turn static training when the task depends on live humans, mutable production systems, or other unreproducible state.

If replayability is clear, say so and ask for confirmation. Example:

This looks replayable because each episode starts from fixed local state and the tools only read from it, so I recommend multi-turn RL. Please confirm there is no hidden live dependency.

If it is not clear, ask whether the task has a replayable environment or only logged/static scenarios.

2. Task Shape

Ask only for the details needed to implement the rollout, but do not skip the required task questions.

Always gather:

What the agent is supposed to accomplish.
Where training scenarios come from.

For multi-turn tasks, also gather:

What the observation is on each turn.
What actions or tools the agent can take.
When the episode ends.

When adapting an existing agent:

Preserve its prompt and tool behavior by default.
Reuse its real tool execution path and message schema.
Prefer storing structured terminal outputs such as final_answer directly on the trajectory when useful.
If the existing environment already has a natural typed terminal object or evaluation artifact, keep it in the trajectory or structured logs instead of reducing everything to free-form text.

3. Reward Choice

Start from RULER as the default.

Use this rule:

Choose programmatic reward only when correctness is robustly checkable with code.
Choose RULER for open-text answers or tool-use behavior where exact matching is brittle.
Choose custom only when the task genuinely mixes multiple reward sources.

Explain RULER briefly once:

RULER is an LLM judge that compares trajectories within a group and scores which ones are better.

If the user chooses programmatic reward:

Put the score in trajectory.reward.
Keep extra signals in trajectory.metrics.
Do not invent weak heuristic rewards.
If the environment already exposes robust auxiliary signals such as correctness, source overlap, pass/fail, completion rate, or tool-error rate, keep logging them even when they are not the main reward.

If the user chooses RULER:

Prefer ruler_score_group(...) with the default rubric.
Recommend OPENAI_API_KEY validation at startup.
Recommend openai/gpt-5.4 as the default judge model.
Ask which judge model to use before generating the script.

4. Data Splits and Validation

For fixed datasets:

Prefer a held-out validation split.
Prefer capped periodic validation during training.
Do not run a full held-out pass at step 0 unless the user asks for it.

For validation:

Log validation groups at a concrete training step.
Make sure the metric used for checkpoint cleanup is actually logged.
With the default await model.delete_checkpoints(), validation must produce val/reward.

5. Base Parameters

Ask for, explicitly and separately:

Base model
Project name
Run name
Backend

Do not present a single "recommended starting point" model by default. Offer all allowed base models:

OpenPipe/Qwen3-14B-Instruct
Qwen/Qwen3-30B-A3B-Instruct-2507
meta-llama/Llama-3.1-8B-Instruct

Environment requirements:

ServerlessBackend: require WANDB_API_KEY
RULER: require OPENAI_API_KEY

6. Hyperparameters

Ask whether to use these starting defaults or customize them:

Learning rate: 1e-5
Rollouts per group: 4
Groups per step: 2

Iteration defaults:

For fixed datasets, prefer iterate_dataset(..., initial_step=await model.get_step()).
For non-fixed/open-ended generation, use a manual step loop.

Implementation Guardrails

These are the main ART-specific rules that matter in practice:

Reuse the real environment/agent entrypoints when they already exist.
Prefer building Trajectory.messages_and_choices directly for multi-turn tool use.
Use backend.train(model, trajectory_groups, ...) plus await model.log(...).
Call await backend.close() before exit.
Catch recoverable inference or tool errors after the trajectory has started and return a partial trajectory with numeric or bool metrics.
Pass art.TrajectoryGroup(...) awaitables directly into art.gather_trajectory_groups(...). Do not await them early.
If using RULER rescoring, prefer after_each=lambda group: ruler_score_group(...).
Preserve group.exceptions if you rebuild groups after rollout.
Default max_exceptions to scale with the active batch size, typically args.rollouts_per_group * len(batch.items) for training and the analogous validation batch size. Do not hard-code a small fixed value unless the user explicitly wants that.
For fixed datasets, make resume behavior explicit with initial_step=await model.get_step().
Delete old checkpoints by default after successful training and validation logging unless the user wants to keep all of them.

Minimal Training Pattern

Use this as the default pattern for fixed datasets with RULER:

from art.rewards import ruler_score_group
from art.utils.iterate_dataset import iterate_dataset


async def rollout(model: TrainableModel, scenario: Scenario) -> art.Trajectory:
    ...


for batch in iterate_dataset(
    train_scenarios,
    groups_per_step=args.groups_per_step,
    num_epochs=args.num_epochs,
    initial_step=await model.get_step(),
):
    train_groups = await art.gather_trajectory_groups(
        [
            art.TrajectoryGroup(
                (rollout(model, scenario) for _ in range(args.rollouts_per_group)),
                metadata={"scenario_id": scenario.id},
            )
            for scenario in batch.items
        ],
        after_each=lambda group: ruler_score_group(
            group,
            judge_model=args.judge_model,
        ),
        max_exceptions=args.rollouts_per_group * len(batch.items),
    )

    train_result = await backend.train(
        model,
        train_groups,
        learning_rate=args.learning_rate,
    )

    await model.log(
        train_groups,
        metrics=train_result.metrics,
        step=train_result.step,
        split="train",
    )

    if should_validate(train_result.step):
        val_groups = await art.gather_trajectory_groups(
            [
                art.TrajectoryGroup(
                    (rollout(model, scenario) for _ in range(args.rollouts_per_group)),
                    metadata={"scenario_id": scenario.id},
                )
                for scenario in validation_scenarios
            ],
            after_each=lambda group: ruler_score_group(
                group,
                judge_model=args.judge_model,
            ),
            max_exceptions=args.rollouts_per_group * len(validation_scenarios),
        )
        await model.log(
            val_groups,
            metrics={"reward": ...},
            step=train_result.step,
            split="val",
        )
        await model.delete_checkpoints()

Final Script Requirements

Every generated script should:

Validate required environment variables early.
Follow the repo's existing env-loading convention.
Use the selected backend.
Print the final inference model name and a short usage example.
Close the backend cleanly.

If you fail to find enough information from the repo, say what is missing and ask the next single blocking question. Do not fabricate environment behavior, reward logic, or dataset structure.

train-rl

이 저장소의 다른 Skills

이 저장소의 다른 Skills

ART RL Workflow

Blocking Decisions

Required Questions

1. Replayability

2. Task Shape

3. Reward Choice

4. Data Splits and Validation

5. Base Parameters

6. Hyperparameters

Implementation Guardrails

Minimal Training Pattern

Final Script Requirements

ART RL Workflow

Blocking Decisions

Required Questions

1. Replayability

2. Task Shape

3. Reward Choice

4. Data Splits and Validation

5. Base Parameters

6. Hyperparameters

Implementation Guardrails

Minimal Training Pattern

Final Script Requirements