with one click
add-dataset
Register or inspect a Hugging Face dataset for Marin pipelines.
Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.
Menu
Register or inspect a Hugging Face dataset for Marin pipelines.
Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.
Based on SOC occupation classification
Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.
Modify or upstream a Grug/Grugformer experiment variant.
Run a perf gate on a PR that touches lib/zephyr internals.
Curate the experiment report index at docs/reports/index.md.
Triage a failed canary ferry run (CI-invoked).
Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.
| name | add-dataset |
| description | Register or inspect a Hugging Face dataset for Marin pipelines. |
Inspect a Hugging Face dataset schema with Marin's schema inspection tool, then
register an ExecutorStep so the dataset can be downloaded in Marin pipelines.
experiments/pretraining_datasets/__init__.py
(see the fineweb_edu entry).experiments/pretraining_datasets/nemotron.py)
and add the step there.nemotron.py,
defining a separate step per subset.pip install.uv sync --all-packages, then run
uv run lib/marin/tools/get_hf_dataset_schema.py.uv run --with datasets --with pyyaml lib/marin/tools/get_hf_dataset_schema.py ...Command line:
uv run lib/marin/tools/get_hf_dataset_schema.py <dataset_name> [options]
Python import:
from marin.tools.get_hf_dataset_schema import get_schema
schema = get_schema(dataset_name="wikitext", config_name="wikitext-103-v1")
{"error": "Config name is required.", "available_configs": [...]} — select
an appropriate config from the list and retry with --config_name.text; fall back
to fields containing text; consider string-type fields if no obvious text
field exists. Examine sample_row to verify field contents.--trust_remote_code).sample_row may be empty for some datasets.The tool returns a JSON object:
{
"splits": ["train", "validation", ...],
"text_field_candidates": ["text", "content", ...],
"features": {
"text": "string",
"label": "int64",
...
},
"sample_row": {
"text": "Example content...",
...
}
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext
{
"error": "Config name is required.",
"available_configs": ["wikitext-103-raw-v1", "wikitext-103-v1", ...]
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext --config_name wikitext-103-v1
{
"splits": ["train", "validation", "test"],
"text_field_candidates": ["text"],
"features": {"text": "string"},
"sample_row": {"text": "Article content..."}
}
For datasets needing remote code, add --trust_remote_code.
Once the schema is inspected and the dataset is registered, cargo-cult existing dataset configs for tokenization:
lib/marin/tools/get_hf_dataset_schema.py