Use when designing a new bench_env task suite, adding several new tasks to an existing suite, or critiquing a task-set proposal for a mobile-gym App — before any `class FooTask(...)` is written under `bench_env/task/`.
Use when adding or modifying offline judge tests for bench_env tasks — specifically entries in `OFFLINE_JUDGE_POSITIVE_CASES` / `OFFLINE_JUDGE_NEGATIVE_CASES` in `bench_env/tests/<suite>/test_tasks.py`, or writing live tests. Triggers after a new task is added, or when tightening judge coverage.
Use when writing or modifying `check_goals()` / `get_answer()` / App `check_*` methods in `bench_env/task/`, or when reviewing a draft task's judge correctness. Triggers include adding a new task, editing a judge method, or diagnosing a judge false-positive/negative.