| name | capx-agentic-robotics |
| description | Agentic robotics with CaP-X — LLM-driven robot manipulation via code generation. Use when: (1) Setting up CaP-X / CaP-Gym environments for robot manipulation benchmarks, (2) Running CaP-Bench evaluations across LLMs/VLMs on robotic tasks, (3) Building or extending CaP-Agent0 agentic harnesses (skill libraries, visual differencing, parallel reasoning), (4) Training robot coding agents with CaP-RL (GRPO on code generation), (5) Developing perception APIs (SAM3, Molmo, depth, point clouds) or control APIs (IK solvers, grasp planners), (6) Sim-to-real transfer for Franka Panda, R1Pro humanoid, or other robot platforms, (7) Designing auto-synthesized skill libraries for physical manipulation (Voyager-style), (8) Integrating agentic robotics with Life Agent OS (Arcan orchestration, Spaces agent networking, Lago persistence), (9) Any task involving LLM-based robot control, manipulation benchmarks, robotic code synthesis, or embodied AI agents.
|
CaP-X Agentic Robotics
LLM agents that write Python code to control real robots — from zero-shot manipulation to RL-trained coding agents with sim-to-real transfer.
Based on CaP-X (NVIDIA, Berkeley, Stanford, CMU). arXiv: 2603.22435. MIT License.
Quick Start
1. Clone and Install
git clone https://github.com/capgym/cap-x.git
cd cap-x
uv sync
2. Start Perception Services
CaP-X runs perception models as microservices:
python -m capx.perception.sam3_server
python -m capx.perception.molmo_server
python -m capx.perception.grasp_server
python -m capx.perception.owlvit_server
Requires CUDA GPU. For IK solvers:
- PyRoKi (port 8116) — CPU-friendly inverse kinematics
- cuRobo — GPU-accelerated motion planning (NVIDIA, requires CUDA)
3. Run a Benchmark Task
python scripts/eval.py \
--task cube_lift \
--model openai/gpt-5.2 \
--tier S1 \
--num_trials 100
python scripts/eval_capbench.py --model anthropic/claude-opus-4.5
Architecture
CaP-X
├── CaP-Gym ─── Gymnasium interface wrapping 187 tasks
│ ├── RoboSuite (7 core) ── Franka Panda tabletop/bimanual
│ ├── LIBERO-PRO (130+) ── Franka Panda kitchen/living
│ └── BEHAVIOR (50) ────── R1Pro humanoid, Isaac Sim
│
├── CaP-Bench ── 8 tiers (S1-S4 single, M1-M4 multi-turn)
│ ├── Varies: perception noise, API abstraction, visual grounding
│ └── 12 frontier LLMs/VLMs benchmarked
│
├── CaP-Agent0 ── Training-free agentic harness
│ ├── Visual Differencing Module (VDM)
│ ├── Auto-synthesized skill library (Voyager lineage)
│ └── Parallel ensembled reasoning (multi-model)
│
└── CaP-RL ──── GRPO post-training on code generation
├── 7B model: 25% → 80% in 50 iterations
└── Zero-shot sim-to-real transfer
Control Flow
The agent never outputs raw joint commands. Instead:
LLM → generates Python code → composes perception + control APIs → robot executes
API abstraction levels (Franka example):
FrankaControlApi — Full high-level (perception + IK control)
FrankaControlPrivilegedApi — Oracle state (no perception noise)
FrankaControlReducedApi — Low-level primitives
FrankaControlReducedSkillLibraryApi — Low-level + auto-synthesized skills
CaP-Bench Tiers
| Tier | Mode | Perception | Abstraction | Visual Grounding |
|---|
| S1 | Single | Noiseless | High | -- |
| S2 | Single | Noisy | High | -- |
| S3 | Single | Noisy | Low | -- |
| S4 | Single | Noisy | Low | -- |
| M1 | Multi | Noisy | High | -- |
| M2 | Multi | Noisy | High | Multimodal feedback |
| M3 | Multi | Noisy | Low | -- |
| M4 | Multi | Noisy | Low | VDM |
Run specific tiers:
python scripts/eval.py --task cube_stack --model google/gemini-3-pro --tier M4
CaP-Agent0: Training-Free Harness
Three components inspired by Voyager (Wang et al., 2023):
1. Visual Differencing Module (VDM)
Convert before/after scene images into structured text describing what changed. Solves cross-modal alignment failures found in M2 tier.
2. Auto-Synthesized Skill Library
Reusable functions discovered from successful execution traces. Persist across trials. Compilation pipeline:
python scripts/skill_library_compilation/compile.py \
--eval_outputs results/cube_lift/ \
--output skills/
python scripts/eval.py --task cube_stack --tier M4 --skill_library skills/
9 task-agnostic skills discovered (geometric utilities, grasp filters, quaternion helpers).
3. Parallel Ensembled Reasoning
Multiple models generate candidate solutions:
python scripts/eval_agent0.py \
--task cube_stack \
--models "google/gemini-3-pro,openai/gpt-5.2,anthropic/claude-opus-4.5" \
--ensemble
Result: Matches/exceeds human expert code on 4/7 tasks. Competitive with trained VLA policies (OpenVLA, pi_0) despite being training-free.
CaP-RL: RL on Code Generation
GRPO (Group Relative Policy Optimization) applied to Qwen2.5-Coder-7B:
python scripts/train_rl.py \
--task cube_lift \
--base_model Qwen/Qwen2.5-Coder-7B-Instruct \
--group_size 15 \
--train_iterations 50 \
--gpu_type h100
| Task | Base (7B) | +CaP-RL (50 iter) | Human Expert |
|---|
| Cube Lift (sim) | 25% | 80% | 93% |
| Cube Stack (sim) | 4% | 44% | 73% |
| Spill Wipe (sim) | 30% | 93% | 100% |
| Cube Lift (real) | 24% | 84% | 92% |
| Cube Stack (real) | 12% | 76% | 84% |
Transfer zero-shot to real robots because reasoning is over abstract APIs, not raw pixels.
See references/caprl-training.md for full GRPO configuration, GPU requirements, and training recipes.
Sim-to-Real Transfer
Simulation (CaP-Gym) ──[abstract APIs]──> Real Robot
├── Franka Panda (primary)
├── AgiBot G1 (bimanual demos)
└── R1Pro (mobile manipulation)
Real-world requirements:
- Stereo camera with calibrated metric-scale depth (e.g., ZED)
robots_realtime package for Franka Panda control
- Same perception service stack (SAM3, Molmo, ContactGraspNet)
Extending CaP-X
Add a New Task
class MyTask(CaPGymTask):
"""Gymnasium-compatible task wrapper."""
def __init__(self):
self.perception = PerceptionStack(sam3=True, molmo=True, depth=True)
self.control = FrankaControlApi(ik_solver="pyroki")
def get_prompt(self) -> str:
return "Pick up the red cube and place it on the green platform."
def compute_reward(self, obs, action, next_obs) -> float:
return float(self._check_cube_on_platform(next_obs))
Add a New Robot
Implement the control API interface. See references/api-spec.md for the full specification:
goto_pose(pos, quat) — Move end-effector to target pose via IK
grasp() / open_gripper() / close_gripper() — Gripper control
get_ee_pose() — Current end-effector state
Add a Perception Module
Register as a microservice with a standard interface:
- Input: RGB image (+ optional depth)
- Output: Structured detection/segmentation result
- Follow the pattern in
capx/perception/ servers
Integration with Life Agent OS
CaP-X connects to the Broomva Agent OS stack at three points:
Arcan Orchestration
Orchestrate multi-step robotic workflows as Arcan task graphs:
let pipeline = arcan::Pipeline::new()
.step("perceive", capx_perception_step)
.step("plan", capx_agent_step)
.step("execute", capx_control_step)
.step("verify", capx_vdm_step)
.on_failure("reflect", capx_reflect_step)
.build();
Spaces Agent Networking
Robotic agents publish events to Spaces channels:
#robot-logs <- Execution traces, success/failure, skill library updates
#agent-logs <- Session summaries (standard bstack integration)
#perception <- Scene descriptions, object detections, grasp candidates
Multiple robot agents share skill libraries via Spaces — a skill discovered by one robot transfers to others.
Lago Persistence
Store and version:
- Skill libraries — SHA-256 hashed Python functions as Lago blobs
- Evaluation traces — Execution logs for CaP-RL training data
- Benchmark results — CaP-Bench scores across models and tiers
- Trained checkpoints — RL-tuned model weights
Companion Skills
npx skills add broomva/skills --skill orcahand -g -y
npx skills add broomva/skills --skill remote-gpu -g -y
Resources
references/
references/api-spec.md — Full perception and control API specification
references/capbench-results.md — Benchmark results across 12 models, 8 tiers
references/caprl-training.md — CaP-RL training guide (GRPO config, GPU requirements, recipes)
scripts/
scripts/setup-perception.sh — Start all perception microservices
scripts/run-benchmark.sh — Full CaP-Bench evaluation sweep
Key Papers