with one click
with one click
Monitor an Iris job and recover it on failure. Use when asked to babysit or watch a job or run.
Multi-agent correctness review of a pull request.
Author or update a Marin PR in the required plain-text format. Use when creating or updating a PR.
De-rot markdown docs in lib/iris, lib/zephyr, and lib/fray.
Multi-session research workflow: logbooks, experiment issues, and W&B.
Scheduled scrub: docs and code parity.
| name | add-dataset |
| description | Register or inspect a Hugging Face dataset for Marin pipelines. |
Inspect a Hugging Face dataset schema with Marin's schema inspection tool, then
register an ExecutorStep so the dataset can be downloaded in Marin pipelines.
experiments/pretraining_datasets/__init__.py
(see the fineweb_edu entry).experiments/pretraining_datasets/nemotron.py)
and add the step there.nemotron.py,
defining a separate step per subset.pip install.uv sync --all-packages, then run
uv run lib/marin/tools/get_hf_dataset_schema.py.uv run --with datasets --with pyyaml lib/marin/tools/get_hf_dataset_schema.py ...Command line:
uv run lib/marin/tools/get_hf_dataset_schema.py <dataset_name> [options]
Python import:
from marin.tools.get_hf_dataset_schema import get_schema
schema = get_schema(dataset_name="wikitext", config_name="wikitext-103-v1")
{"error": "Config name is required.", "available_configs": [...]} — select
an appropriate config from the list and retry with --config_name.text; fall back
to fields containing text; consider string-type fields if no obvious text
field exists. Examine sample_row to verify field contents.--trust_remote_code).sample_row may be empty for some datasets.The tool returns a JSON object:
{
"splits": ["train", "validation", ...],
"text_field_candidates": ["text", "content", ...],
"features": {
"text": "string",
"label": "int64",
...
},
"sample_row": {
"text": "Example content...",
...
}
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext
{
"error": "Config name is required.",
"available_configs": ["wikitext-103-raw-v1", "wikitext-103-v1", ...]
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext --config_name wikitext-103-v1
{
"splits": ["train", "validation", "test"],
"text_field_candidates": ["text"],
"features": {"text": "string"},
"sample_row": {"text": "Article content..."}
}
For datasets needing remote code, add --trust_remote_code.
Once the schema is inspected and the dataset is registered, cargo-cult existing dataset configs for tokenization:
lib/marin/tools/get_hf_dataset_schema.py