| name | rerun-lerobot |
| description | Ingest a LeRobot (HuggingFace) dataset into Rerun. Read when converting a LeRobot dataset to RRDs, splitting it into per-episode segments, or registering it on a Rerun catalog. Covers the built-in directory importer (log_file_from_path), the RrdReader + send_chunks per-episode split, and when to drop to ParquetReader for custom control. |
| user_invocable | true |
| allowed-tools | Read, Grep, Bash, WebFetch |
Rerun LeRobot Ingestion
Rerun has a built-in LeRobot importer: point log_file_from_path (or the
viewer, or rerun <dir> on the CLI) at the dataset directory and it ingests
episodes, camera videos, and state/action tables with no conversion code.
There is no chunk-level LeRobotReader; the chunk-processing route is to
import first, then reprocess the resulting RRD with RrdReader. The complete
working pipeline this skill follows is prepare_dataset.py in
https://github.com/rerun-io/rerun/tree/main/examples/python/dataloader.
Verified against rerun-sdk 0.34.0a1. The download step needs
huggingface_hub.
Step 1: dataset → one combined RRD
from huggingface_hub import snapshot_download
import rerun as rr
dataset_dir = snapshot_download(repo_id="rerun/so101-pick-and-place",
repo_type="dataset", local_dir=dest)
with rr.RecordingStream("lerobot") as rec:
rec.save(str(combined_rrd))
rec.log_file_from_path(str(dataset_dir))
The importer emits one recording per episode (recording ids like
episode_1), plus a metadata-only root recording, all into the single RRD.
Step 2: split into per-episode RRDs
Catalog segments are one-recording-per-file, and recording_id becomes the
segment id on registration. Split with RrdReader:
reader = rr.experimental.RrdReader(str(combined_rrd))
for entry in reader.recordings():
store = reader.store(store=entry)
if not store.schema().entity_paths():
continue
episode_id = zero_pad(entry.recording_id)
with rr.RecordingStream("lerobot", recording_id=episode_id, send_properties=False) as rec:
rec.save(str(rrd_dir / f"{episode_id}.rrd"))
rec.send_chunks(store)
Two non-obvious moves, both from prepare_dataset.py:
- Zero-pad the episode id.
episode_10 sorts before episode_2
lexicographically; segment tables and viewers sort lexicographically. Pad to
a fixed width when re-assigning recording_id.
send_properties=False on the new stream, so the copy doesn't inject
fresh recording properties on top of the copied chunks.
send_chunks does not preserve the source store's identity; the new stream's
recording_id wins, which is exactly what makes the rename work.
If episodes need cleanup (drop topics, fix data, add derived components), run
the store through lenses between read and write:
reader.stream(store=entry).drop(...).lenses(...) then
collect(optimize=OptimizationProfile.OBJECT_STORE).write_rrd(..., recording_id=episode_id)
(see rerun-chunk-processing). Use the OBJECT_STORE profile whenever the RRDs
are headed for a catalog.
Step 3: register on a catalog
dataset = client.create_dataset("my_lerobot_set")
dataset.register_prefix(rrd_dir.as_uri())
Computed layers and per-episode properties then follow the standard patterns
in rerun-data-model (layer recording_id must equal the episode segment
id).
When to bypass the importer
The importer decides the entity layout for you. Reach for ParquetReader
(see rerun-parquet) on the dataset's data/*.parquet files when you need a
different model: custom entity paths, selected columns only, typed
Transform3D assembly from flat columns, or ingestion without materializing
videos. You then own video handling and episode boundaries yourself; prefer
the importer unless its layout actually blocks you.
Gotchas
log_file_from_path must target the dataset root directory, not a file
inside it.
- Unpadded episode ids mis-sort downstream; pad before registering.
- The combined RRD contains a metadata-only root recording; skip stores with
no entity paths or you register an empty segment.
- Re-registering after a fix: the catalog keys segments by
recording_id;
reuse the same ids to update in place rather than minting new ones.
References
https://github.com/rerun-io/rerun/tree/main/examples/python/dataloader —
prepare_dataset.py (download → import → split → register, complete and
runnable) and train.py (training-side consumption via
rerun.experimental.dataloader)
rerun-parquet (direct parquet route), rerun-chunk-processing (RrdReader,
lenses, OBJECT_STORE), rerun-data-model (segments, layers, properties)