一键导入
rewardkit
// Write Harbor task verifiers using Reward Kit. Use when creating or editing a task's tests/ directory, adding grading criteria, setting up LLM/agent judges, or designing verifiers that produce a reward score.
// Write Harbor task verifiers using Reward Kit. Use when creating or editing a task's tests/ directory, adding grading criteria, setting up LLM/agent judges, or designing verifiers that produce a reward score.
| name | rewardkit |
| description | Write Harbor task verifiers using Reward Kit. Use when creating or editing a task's tests/ directory, adding grading criteria, setting up LLM/agent judges, or designing verifiers that produce a reward score. |
Help the user write task verifiers with Reward Kit. Reward Kit is a lightweight Python package that turns a directory of criteria files into a reward score. Each criterion is a Python function call or a TOML judge file; folders become separate rewards.
Put criteria alongside test.sh in the task's tests/ directory:
tests/
├── test.sh
├── checks.py # programmatic criteria
└── judge.toml # optional LLM/agent judge
tests/test.sh:
#!/bin/bash
uvx --from 'harbor-rewardkit==0.1.*' rewardkit /tests
This runs all criteria in /tests/ against the workspace at /app and writes
/logs/verifier/reward.json. Defaults match Harbor's conventions — no extra config needed.
If judge criteria need API keys, pass them through task.toml:
[verifier.env]
ANTHROPIC_API_KEY = "${ANTHROPIC_API_KEY}"
Ask whether Reward Kit should run in the agent's shared environment or in a separate verifier environment. Prefer a separate verifier environment when judge prompts, grading dependencies, API keys, or clean-room checks should not be available to the agent:
[environment]
network_mode = "no-network" # Agent env baseline — offline during agent.run()
[verifier]
environment_mode = "separate"
[verifier.environment]
network_mode = "public" # Verifier env baseline — LLM judge API calls
docker_image = "python:3.12-slim"
In shared mode, the verifier runs in the agent container and inherits
[environment].network_mode. Put [verifier].network_mode only when verify()
needs different network access than the agent phase (a phase override, not a
baseline). If agent and verifier need different baselines without runtime
switching, use environment_mode = "separate" and set
[verifier.environment].network_mode.
Judge criteria that call external APIs need a public baseline or allowlist on
the verifier environment. Programmatic checks that only read local files can use
no-network.
In separate mode, tests/ is the verifier image build context and must provide
/tests/test.sh at runtime; Harbor does not upload tests/ into the running
verifier container.
Call built-ins from any .py file in tests/:
import rewardkit as rk
rk.file_exists("output.txt")
rk.file_contains("output.txt", "hello")
rk.command_succeeds("python main.py", weight=2.0)
rk.json_key_equals("result.json", "status", "ok")
All criteria accept weight (default 1.0) and isolated (default False, runs in
overlayfs so side effects don't leak).
file_exists, file_not_exists, file_contains, file_contains_regex,
file_matches, files_equal, diff_ratiocommand_succeeds, command_output_contains, command_output_matches,
command_output_matches_regex (30s default timeout, optional cwd)json_key_equals, json_path_equals, csv_cell_equals, xlsx_cell_equals
(needs [office] extra), sqlite_query_equalshttp_status_equals, http_response_containsimage_similarity, image_size_equals (needs [image] extra)trajectory_tool_used, trajectory_tool_not_used, trajectory_turn_countFor extras, install with uv tool install harbor-rewardkit[all].
Use the @criterion decorator. First parameter is always workspace: Path. Returns
bool or float:
from pathlib import Path
from rewardkit import criterion
@criterion
def has_valid_output(workspace: Path) -> bool:
return (workspace / "output.txt").read_text().strip() != ""
Zero-parameter criteria auto-register. Criteria with extra args must be called via rk:
@criterion(description="output has at least {n} lines")
def has_n_lines(workspace: Path, n: int) -> bool:
return len((workspace / "output.txt").read_text().splitlines()) >= n
rk.has_n_lines(10, weight=2.0)
rk.has_n_lines(50, weight=1.0)
For criteria shared across reward subdirs, define with shared=True in a root-level file
and call from subdirs.
For subjective checks (quality, readability, edge cases), create a TOML file:
[judge]
judge = "anthropic/claude-sonnet-4-6" # LiteLLM model string
files = ["/app/main.py"]
[[criterion]]
description = "Is the code correct?"
type = "binary"
[[criterion]]
description = "How readable is the code?"
type = "likert"
points = 5
weight = 2.0
Criterion types:
binary — yes/no → 1.0 or 0.0likert — 1..points, normalized to [0, 1]numeric — min..max, normalized to [0, 1]Agent judges shell out to a CLI and can explore the filesystem:
[judge]
judge = "claude-code"
model = "anthropic/claude-sonnet-4-6"
isolated = true
[[criterion]]
description = "Does the solution handle edge cases?"
type = "binary"
Slower and more expensive than LLM judges, but they can run commands and inspect files.
[judge] optionstimeout (default 300), reasoning_effort (low|medium|high), reference (path to
reference solution), atif-trajectory (evaluate the agent's trajectory), weight,
prompt_template (custom prompt with {criteria} placeholder).
[scoring]
aggregation = "all_pass" # weighted_mean | all_pass | any_pass | threshold
threshold = 0.7 # only for threshold
Only affects aggregation within this TOML file.
Put criteria in subdirectories — each becomes a separate reward:
tests/
├── test.sh
├── correctness/
│ └── check.py
├── structure/
│ └── files_exist.py
└── quality/
└── quality.toml
Produces:
{ "correctness": 0.75, "structure": 1.0, "quality": 0.6 }
/logs/verifier/reward.json — per-reward scores/logs/verifier/reward-details.json — per-criterion results, judge reasoning, errorsIn a multi-step task, each step has its own tests/ under
steps/{name}/tests/, and the verifier runs once per step. Reward Kit behaves
the same as in a single-step task: for each step it reads /tests, runs the
criteria against /app, and writes /logs/verifier/reward.json for that step.
Harbor then aggregates per-step results into a trial-level reward via
multi_step_reward_strategy in task.toml — aggregation happens outside
Reward Kit, so don't try to encode cross-step logic in your criteria.
A task-level tests/ directory (at the task root) is uploaded to /tests
first, then the step's own tests/ is layered on top (same-name files win).
Put shared helpers (common checks.py functions with shared=True, fixture
files, a fallback test.sh) at the task level, and step-specific criteria
under each step.
Multi-reward subdirectories still work within a step: steps/foo/tests/
can contain correctness/, structure/, quality/ — each produces a
separate reward key for that step, and multi_step_reward_strategy = "mean"
averages each key across steps. Use "final" when the last step is an
end-to-end check whose rewards already represent the full task.
@criterion when logic is task-specific but still programmatic.isolated=True for any criterion that runs mutating commands, so it doesn't
corrupt the workspace for other criteria.See examples/tasks/reward-kit-example/ in the Harbor repo.
Create a new Harbor task for evaluating agents. Use when the user wants to scaffold, build, or design a new task, benchmark problem, or eval. Guides through instruction writing, environment setup, verifier design (pytest vs Reward Kit vs custom), and solution scripting.
Existing task skill that should remain after job-level skill injection.
Write the proof file for the Harbor runtime skill injection example.
Publish a Harbor task or dataset to the registry. Use when the user wants to upload, publish, or share tasks or datasets/benchmarks on the Harbor registry.
Scaffold a new Harbor benchmark adapter by running `harbor adapter init` and then guide implementation using the Adapters Agent Guide as the authoritative spec.
Create or reuse Hugging Face dataset PRs for `harborframework/parity-experiments` and upload Harbor parity/oracle result folders efficiently with sparse checkout, raw git pushes, and Git LFS.