원클릭으로
designing-bench-task
// Use when designing a new bench_env task suite, adding several new tasks to an existing suite, or critiquing a task-set proposal for a mobile-gym App — before any `class FooTask(...)` is written under `bench_env/task/`.
// Use when designing a new bench_env task suite, adding several new tasks to an existing suite, or critiquing a task-set proposal for a mobile-gym App — before any `class FooTask(...)` is written under `bench_env/task/`.
Use when adding or modifying offline judge tests for bench_env tasks — specifically entries in `OFFLINE_JUDGE_POSITIVE_CASES` / `OFFLINE_JUDGE_NEGATIVE_CASES` in `bench_env/tests/<suite>/test_tasks.py`, or writing live tests. Triggers after a new task is added, or when tightening judge coverage.
Use when writing or modifying `check_goals()` / `get_answer()` / App `check_*` methods in `bench_env/task/`, or when reviewing a draft task's judge correctness. Triggers include adding a new task, editing a judge method, or diagnosing a judge false-positive/negative.
| name | designing-bench-task |
| description | Use when designing a new bench_env task suite, adding several new tasks to an existing suite, or critiquing a task-set proposal for a mobile-gym App — before any `class FooTask(...)` is written under `bench_env/task/`. |
Rushing from "here's an App" to "here are 5 task classes" produces low-difficulty suites whose judge logic can't actually verify completion. Design must precede code.
Authoritative reference: bench_env/docs/task/TASK_AUTHORING_GUIDE.md (reading §1 + §2 once is required; this skill enforces its gates).
Produce both as plain text in the conversation before writing any task class. If you catch yourself opening tasks.py / defs/<TaskName>.py, stop and produce them.
A table with one row per distinct feature area. Columns:
| Page/feature | Source file(s) | User-visible actions | Observable state path |
You must actually read: manifest.ts, navigation.declaration.ts, data/defaults.json, state.ts, pages/*, and the suite's app.py accessor if it exists. No skipping "because the app looks simple."
For every function you plan to parameterize, confirm defaults.json / state.ts provides ≥3 varied entries. If it doesn't, either propose expanding defaults, or drop parameterization for that function.
For each proposed task, answer in 1-2 lines each before writing code (this is the soundness/completeness audit later enforced by TASK_AUTHORING_GUIDE §2.7 "Reliability requirements"):
If any answer surfaces a flaw (common: initial state already equals criteria; ground truth not unique; answer requires subjective judgement), iterate the design in text — do not defer the fix to code review.
| Excuse | Reality |
|---|---|
| "App is tiny, audit is overkill" | Audit surfaces the data gap so you can close it before writing code. |
| "Judge predict is slow, I'll see issues when coding" | Design bugs (init=goal, non-unique ground truth) are 10× cheaper to fix in text. |
| "These tasks are obvious, pre-sim is busywork" | A task obvious enough to skip pre-sim is obvious enough to answer the 4 questions in 30 seconds. |
| "I'll produce both artifacts and code in one pass" | Then when code inherits a design flaw, you've wasted the coding pass. Gate is gate. |
class FooTask(...)TASK_AUTHORING_GUIDE.md §2.7 "Reliability requirements" + §4.7 "Authoring check_goals"For actual code discipline: see the writing-bench-task-judge skill and bench_env/docs/task/TASK_CODE_SPEC.md. For tests: testing-bench-task skill + bench_env/docs/task/TASK_TESTING_GUIDE.md.