ワンクリックで
add-dataset
Register or inspect a Hugging Face dataset for Marin pipelines.
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
メニュー
Register or inspect a Hugging Face dataset for Marin pipelines.
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
Lint, run the pre-PR checks, commit, push, and author or update the branch's pull request in the required plain-text format. Use when committing, pushing, or creating/updating a PR.
Modify or upstream a Grug/Grugformer experiment variant.
Run a perf gate on a PR that touches lib/zephyr internals.
Curate the experiment report index at docs/reports/index.md.
Triage a failed canary ferry run (CI-invoked).
Refresh Marin TPU-vLLM forks from a tpu-inference release/LKG pair, update exact SHA pins, run TPU smokes, and open the Marin PR.
SOC 職業分類に基づく
| name | add-dataset |
| description | Register or inspect a Hugging Face dataset for Marin pipelines. |
Inspect a Hugging Face dataset schema with Marin's schema inspection tool, then
register an ExecutorStep so the dataset can be downloaded in Marin pipelines.
experiments/pretraining_datasets/__init__.py
(see the fineweb_edu entry).experiments/pretraining_datasets/nemotron.py)
and add the step there.nemotron.py,
defining a separate step per subset.pip install.uv sync --all-packages, then run
uv run lib/marin/tools/get_hf_dataset_schema.py.uv run --with datasets --with pyyaml lib/marin/tools/get_hf_dataset_schema.py ...Command line:
uv run lib/marin/tools/get_hf_dataset_schema.py <dataset_name> [options]
Python import:
from marin.tools.get_hf_dataset_schema import get_schema
schema = get_schema(dataset_name="wikitext", config_name="wikitext-103-v1")
{"error": "Config name is required.", "available_configs": [...]} — select
an appropriate config from the list and retry with --config_name.text; fall back
to fields containing text; consider string-type fields if no obvious text
field exists. Examine sample_row to verify field contents.--trust_remote_code).sample_row may be empty for some datasets.The tool returns a JSON object:
{
"splits": ["train", "validation", ...],
"text_field_candidates": ["text", "content", ...],
"features": {
"text": "string",
"label": "int64",
...
},
"sample_row": {
"text": "Example content...",
...
}
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext
{
"error": "Config name is required.",
"available_configs": ["wikitext-103-raw-v1", "wikitext-103-v1", ...]
}
$ uv run lib/marin/tools/get_hf_dataset_schema.py wikitext --config_name wikitext-103-v1
{
"splits": ["train", "validation", "test"],
"text_field_candidates": ["text"],
"features": {"text": "string"},
"sample_row": {"text": "Article content..."}
}
For datasets needing remote code, add --trust_remote_code.
Once the schema is inspected and the dataset is registered, cargo-cult existing dataset configs for tokenization:
lib/marin/tools/get_hf_dataset_schema.py