Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

add-benchmark

Name: Add Benchmark
Author: allenai

// Add a new simulation benchmark to the VLA evaluation harness. Use this skill whenever the user wants to integrate, create, or add a new benchmark or simulation environment — e.g. 'add ManiSkill3', 'integrate OmniGibson', 'hook up a new sim'. Also use when they ask how benchmarks are structured or want to understand the benchmark interface.

Ejecutar en Manus

$ git log --oneline --stat

stars:331

forks:28

updated:30 de abril de 2026, 03:26

SKILL.md

readonly

related-skills.json

mismo repositorio

run-evaluation.md

from "allenai/vla-evaluation-harness"

Run a VLA model evaluation against a simulation benchmark. Use this skill whenever the user wants to evaluate, benchmark, test, or run a model on a sim environment — even if they say it casually like 'try OpenVLA on LIBERO' or 'get me CALVIN scores'. Covers the full workflow: serving the model, launching the benchmark, sharding for speed, merging results, and interpreting output.

2026-05-07331

add-model-server.md

from "allenai/vla-evaluation-harness"

Add a new VLA model server to the evaluation harness. Use this skill whenever the user wants to integrate, create, or add a new model — e.g. 'add OpenVLA server', 'integrate RT-2', 'hook up my model', 'write a model server'. Also use when they ask how model servers work or want to understand the server interface.

2026-04-29331

package.json

"author": "allenai"

"repository": "allenai/vla-evaluation-harness"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Desarrolladores de softwareOcupaciones informáticas y matemáticas15-1252L4

name	add-benchmark
description	Add a new simulation benchmark to the VLA evaluation harness. Use this skill whenever the user wants to integrate, create, or add a new benchmark or simulation environment — e.g. 'add ManiSkill3', 'integrate OmniGibson', 'hook up a new sim'. Also use when they ask how benchmarks are structured or want to understand the benchmark interface.

Add Benchmark

Integrate a new simulation benchmark into vla-eval. Benchmarks run inside Docker containers and communicate with model servers over WebSocket + msgpack.

1. Gather requirements

Ask the user for (if not already provided):

Benchmark name (e.g. maniskill3)
Simulation framework (e.g. MuJoCo, SAPIEN, PyBullet, Isaac Sim)
Key pip dependencies needed inside Docker
Observation format — cameras, resolution, whether to include proprioceptive state
Action space — dimension, format (e.g. 7-DoF delta EEF + gripper)
Success condition — how to detect task completion
Max steps per episode

2. Create the benchmark module

Create src/vla_eval/benchmarks/<name>/:

src/vla_eval/benchmarks/<name>/
├── __init__.py      # empty
├── benchmark.py     # main implementation
└── utils.py         # optional helpers

Subclass StepBenchmark from vla_eval.benchmarks.base and implement the required methods:

from typing import Any

import numpy as np

from vla_eval.benchmarks.base import StepBenchmark, StepResult
from vla_eval.specs import DimSpec
from vla_eval.types import Action, EpisodeResult, Observation, Task


class MyBenchmark(StepBenchmark):
    def __init__(self, **kwargs: Any) -> None:
        super().__init__()
        # Accept benchmark-specific params from config YAML `params:` section.
        # Lazily import heavy deps (MuJoCo, SAPIEN) — NOT at module level,
        # because the registry resolves the class without loading the sim.
        ...

    # --- Required methods (6) ---

    def get_tasks(self) -> list[Task]:
        # Return list of task dicts. Each MUST have a "name" key.
        # May include "suite" for task filtering.
        ...

    def reset(self, task: Task) -> Any:
        # Reset env for task. Store env on self. Return initial raw observation.
        # task dict has "episode_idx" (int) injected by the orchestrator.
        ...

    def step(self, action: Action) -> StepResult:
        # action dict has "actions" key (np.ndarray from model server).
        # Return StepResult(obs, reward, done, info).
        ...

    def make_obs(self, raw_obs: Any, task: Task) -> Observation:
        # Convert raw env observation to dict for model server.
        # Convention:
        #   {"images": {"cam_name": np.ndarray HWC uint8},
        #    "task_description": str}
        # Optionally add "state": np.ndarray for proprioception.
        ...

    def get_step_result(self, step_result: StepResult) -> EpisodeResult:
        # Extract episode result from the final StepResult.
        # Must return at least {"success": bool}.
        ...

    # --- Optional overrides ---

    def check_done(self, step_result: StepResult) -> bool:
        # Default: step_result.done. Override for custom termination logic.
        return step_result.done

    def get_action_spec(self) -> dict[str, DimSpec]:
        # Declare the action format this benchmark's env consumes.
        # The orchestrator compares this against the model server's spec
        # and warns on mismatches — catching convention bugs early.
        ...

    def get_observation_spec(self) -> dict[str, DimSpec]:
        # Declare the observation format this benchmark produces.
        ...

    def get_metric_keys(self) -> dict[str, str]:
        # Declare which metrics from get_step_result() to aggregate.
        # Default: {"success": "mean"} (= success rate).
        # Aggregation options: "mean", "sum", "max", "min".
        return {"success": "mean"}

    def get_metadata(self) -> dict[str, Any]:
        # Return {"max_steps": N} for benchmark default.
        return {}

    def cleanup(self) -> None:
        # Release resources (envs, renderers). Called at end of evaluation.
        ...

Async bridge (automatic)

StepBenchmark auto-bridges your sync methods to the async Benchmark parent interface. The orchestrator/runners call the async methods (start_episode, apply_action, get_observation, is_done, get_result) — you never implement those directly.

Key patterns from existing implementations

Lazy imports: Put heavy sim imports (torch, robosuite, sapien) inside methods, not at module level.
Env reuse: LIBERO reuses envs across episodes of the same task. SimplerEnv creates fresh envs per episode. Choose based on your sim's reset semantics.
Action processing: Model servers output raw continuous actions. Convert to sim-specific format in step() (e.g. discretize gripper, convert euler→axis-angle).
Image preprocessing: Handle non-standard images (flipped, wrong resolution) in make_obs().
EGL headless rendering: Add os.environ.setdefault("PYOPENGL_PLATFORM", "egl") at module top if the sim uses OpenGL.

Optional: external dataset acquisition

If the benchmark needs licence-restricted scene/data files that can't ship in the docker image (e.g. ToS-gated downloads), do the lazy fetch inside _init_*() / reset() using the shared primitives in vla_eval.dirs:

from vla_eval.dirs import assets_cache, ensure_license

def _ensure_assets(self, data_path: Path) -> None:
    if (data_path / "ready_marker").exists():
        return
    ensure_license(
        "my-dataset-tos",                # also accepts via --accept-license <id>
        url="https://example.com/license",
        description="My benchmark dataset ToS (~N GiB).",
    )
    data_path.mkdir(parents=True, exist_ok=True)
    # ... download into data_path with whatever helper your sim provides

ensure_license reads stdin in interactive contexts and falls back to the VLA_EVAL_ACCEPTED_LICENSES env var (forwarded by vla-eval run --accept-license <id>). The eval YAML's volume mount should resolve the host path with the same XDG-aware precedence so vla-eval run and the in-container fetch agree:

volumes:
  - "${oc.env:VLA_EVAL_ASSETS_CACHE,${oc.env:VLA_EVAL_HOME,${oc.env:XDG_CACHE_HOME,${oc.env:HOME}/.cache}/vla-eval}/assets}/<bench>:<container_data_path>"

Reference: Behavior1KBenchmark._ensure_assets() in benchmarks/behavior1k/benchmark.py.

3. Create config YAML

Create configs/<name>_eval.yaml:

server:
  url: "ws://localhost:8000"

docker:
  image: ghcr.io/allenai/vla-evaluation-harness/<name>:latest
  env: []     # e.g. ["NVIDIA_DRIVER_CAPABILITIES=all"] for Vulkan
  volumes: [] # e.g. ["/path/to/data:/data:ro"]

output_dir: "./results"

benchmarks:
  - benchmark: "vla_eval.benchmarks.<name>.benchmark:MyBenchmark"
    mode: sync
    episodes_per_task: 50
    params:
      # All keys here passed as **kwargs to MyBenchmark.__init__()
      suite: default
      seed: 7

benchmark field: module.path:ClassName import string
params: arbitrary dict passed to constructor — no schema enforcement
max_steps: omit to use get_metadata()["max_steps"], or set explicitly to override

4. Create Dockerfile

Create docker/Dockerfile.<name>:

ARG BASE_IMAGE
FROM ${BASE_IMAGE}

# Install benchmark-specific dependencies
RUN pip install <benchmark-packages>

# Copy benchmark code
COPY src/vla_eval/benchmarks/<name>/ src/vla_eval/benchmarks/<name>/

All benchmark Dockerfiles inherit from the base image (docker/Dockerfile.base) which already installs the harness. Your Dockerfile only needs to add benchmark-specific dependencies and code.

5. Register in build/push scripts

Add to the BENCHMARKS array in docker/build.sh and the IMAGES array in docker/push.sh:

BENCHMARKS=(... <name> ...)

Underscores in names are auto-converted to hyphens for Docker image names (e.g. mikasa_robo → mikasa-robo).

6. Verify

make check                                    # lint + format + type check
make test                                     # existing tests still pass
vla-eval test --validate                      # validate all config import strings
vla-eval test -c configs/<name>_eval.yaml     # smoke-test (1 episode, EchoModelServer, no GPU needed — requires Docker + image)

Don't add tests/test_<name>_benchmark.py with mocked sim modules. tests/ is for harness mechanics, not per-sim integration. Fake omnigibson / sapien / mujoco modules drift from upstream each release and miss the real bugs (import paths, action encoding, physics determinism). Verify via the smoke test above.

Reference implementations

Benchmark	File	Key patterns
LIBERO	`benchmarks/libero/benchmark.py`	MuJoCo tabletop, env reuse, suite-specific max_steps, image flip
SimplerEnv	`benchmarks/simpler/benchmark.py`	SAPIEN+Vulkan, new env per episode, euler→axis-angle conversion
CALVIN	`benchmarks/calvin/benchmark.py`	PyBullet, chained subtasks, delta actions, hardcoded normalization

add-benchmark

Más de este repositorio

Más de este repositorio

Add Benchmark

1. Gather requirements

2. Create the benchmark module

Async bridge (automatic)

Key patterns from existing implementations

Optional: external dataset acquisition

3. Create config YAML

4. Create Dockerfile

5. Register in build/push scripts

6. Verify

Reference implementations

Add Benchmark

1. Gather requirements

2. Create the benchmark module

Async bridge (automatic)

Key patterns from existing implementations

Optional: external dataset acquisition

3. Create config YAML

4. Create Dockerfile

5. Register in build/push scripts

6. Verify

Reference implementations