| name | add-benchmark |
| description | Add a new simulation benchmark to the VLA evaluation harness. Use this skill whenever the user wants to integrate, create, or add a new benchmark or simulation environment — e.g. 'add ManiSkill3', 'integrate OmniGibson', 'hook up a new sim'. Also use when they ask how benchmarks are structured or want to understand the benchmark interface. |
Add Benchmark
Integrate a new simulation benchmark into vla-eval. Benchmarks run inside Docker containers and communicate with model servers over WebSocket + msgpack.
1. Gather requirements
Ask the user for (if not already provided):
- Benchmark name (e.g.
maniskill3)
- Simulation framework (e.g. MuJoCo, SAPIEN, PyBullet, Isaac Sim)
- Key pip dependencies needed inside Docker
- Observation format — cameras, resolution, whether to include proprioceptive state
- Action space — dimension, format (e.g. 7-DoF delta EEF + gripper)
- Success condition — how to detect task completion
- Max steps per episode
2. Create the benchmark module
Create src/vla_eval/benchmarks/<name>/:
src/vla_eval/benchmarks/<name>/
├── __init__.py # empty
├── benchmark.py # main implementation
└── utils.py # optional helpers
Subclass StepBenchmark from vla_eval.benchmarks.base and implement the required methods:
from typing import Any
import numpy as np
from vla_eval.benchmarks.base import StepBenchmark, StepResult
from vla_eval.specs import DimSpec
from vla_eval.types import Action, EpisodeResult, Observation, Task
class MyBenchmark(StepBenchmark):
def __init__(self, **kwargs: Any) -> None:
super().__init__()
...
def get_tasks(self) -> list[Task]:
...
def reset(self, task: Task) -> Any:
...
def step(self, action: Action) -> StepResult:
...
def make_obs(self, raw_obs: Any, task: Task) -> Observation:
...
def get_step_result(self, step_result: StepResult) -> EpisodeResult:
...
def check_done(self, step_result: StepResult) -> bool:
return step_result.done
def get_action_spec(self) -> dict[str, DimSpec]:
...
def get_observation_spec(self) -> dict[str, DimSpec]:
...
def get_metric_keys(self) -> dict[str, str]:
return {"success": "mean"}
def get_metadata(self) -> dict[str, Any]:
return {}
def cleanup(self) -> None:
...
Async bridge (automatic)
StepBenchmark auto-bridges your sync methods to the async Benchmark parent interface. The orchestrator/runners call the async methods (start_episode, apply_action, get_observation, is_done, get_result) — you never implement those directly.
Key patterns from existing implementations
- Lazy imports: Put heavy sim imports (
torch, robosuite, sapien) inside methods, not at module level.
- Env reuse: LIBERO reuses envs across episodes of the same task. SimplerEnv creates fresh envs per episode. Choose based on your sim's reset semantics.
- Action processing: Model servers output raw continuous actions. Convert to sim-specific format in
step() (e.g. discretize gripper, convert euler→axis-angle).
- Image preprocessing: Handle non-standard images (flipped, wrong resolution) in
make_obs().
- EGL headless rendering: Add
os.environ.setdefault("PYOPENGL_PLATFORM", "egl") at module top if the sim uses OpenGL.
Optional: external dataset acquisition
If the benchmark needs licence-restricted scene/data files that can't ship in the docker image (e.g. ToS-gated downloads), do the lazy fetch inside _init_*() / reset() using the shared primitives in vla_eval.dirs:
from vla_eval.dirs import assets_cache, ensure_license
def _ensure_assets(self, data_path: Path) -> None:
if (data_path / "ready_marker").exists():
return
ensure_license(
"my-dataset-tos",
url="https://example.com/license",
description="My benchmark dataset ToS (~N GiB).",
)
data_path.mkdir(parents=True, exist_ok=True)
ensure_license reads stdin in interactive contexts and falls back to the VLA_EVAL_ACCEPTED_LICENSES env var (forwarded by vla-eval run --accept-license <id>). The eval YAML's volume mount should resolve the host path with the same XDG-aware precedence so vla-eval run and the in-container fetch agree:
volumes:
- "${oc.env:VLA_EVAL_ASSETS_CACHE,${oc.env:VLA_EVAL_HOME,${oc.env:XDG_CACHE_HOME,${oc.env:HOME}/.cache}/vla-eval}/assets}/<bench>:<container_data_path>"
Reference: Behavior1KBenchmark._ensure_assets() in benchmarks/behavior1k/benchmark.py.
3. Create config YAML
Create configs/<name>_eval.yaml:
server:
url: "ws://localhost:8000"
docker:
image: ghcr.io/allenai/vla-evaluation-harness/<name>:latest
env: []
volumes: []
output_dir: "./results"
benchmarks:
- benchmark: "vla_eval.benchmarks.<name>.benchmark:MyBenchmark"
mode: sync
episodes_per_task: 50
params:
suite: default
seed: 7
benchmark field: module.path:ClassName import string
params: arbitrary dict passed to constructor — no schema enforcement
max_steps: omit to use get_metadata()["max_steps"], or set explicitly to override
4. Create Dockerfile
Create docker/Dockerfile.<name>:
ARG BASE_IMAGE
FROM ${BASE_IMAGE}
# Install benchmark-specific dependencies
RUN pip install <benchmark-packages>
# Copy benchmark code
COPY src/vla_eval/benchmarks/<name>/ src/vla_eval/benchmarks/<name>/
All benchmark Dockerfiles inherit from the base image (docker/Dockerfile.base) which already installs the harness. Your Dockerfile only needs to add benchmark-specific dependencies and code.
5. Register in build/push scripts
Add to the BENCHMARKS array in docker/build.sh and the IMAGES array in docker/push.sh:
BENCHMARKS=(... <name> ...)
Underscores in names are auto-converted to hyphens for Docker image names (e.g. mikasa_robo → mikasa-robo).
6. Verify
make check
make test
vla-eval test --validate
vla-eval test -c configs/<name>_eval.yaml
Don't add tests/test_<name>_benchmark.py with mocked sim modules.
tests/ is for harness mechanics, not per-sim integration. Fake
omnigibson / sapien / mujoco modules drift from upstream each
release and miss the real bugs (import paths, action encoding,
physics determinism). Verify via the smoke test above.
Reference implementations
| Benchmark | File | Key patterns |
|---|
| LIBERO | benchmarks/libero/benchmark.py | MuJoCo tabletop, env reuse, suite-specific max_steps, image flip |
| SimplerEnv | benchmarks/simpler/benchmark.py | SAPIEN+Vulkan, new env per episode, euler→axis-angle conversion |
| CALVIN | benchmarks/calvin/benchmark.py | PyBullet, chained subtasks, delta actions, hardcoded normalization |