원클릭으로
add-benchmark
// Add a new benchmark to sgl-eval by vendoring its NeMo-Skills dataset module and registering it in `_TABLE`. Use when the user asks to "add <benchmark>" / "support <benchmark>" / "register <benchmark>" inside the sgl-eval repo.
// Add a new benchmark to sgl-eval by vendoring its NeMo-Skills dataset module and registering it in `_TABLE`. Use when the user asks to "add <benchmark>" / "support <benchmark>" / "register <benchmark>" inside the sgl-eval repo.
Audit whether all score-deciding logic in sgl-eval is vendored from NeMo-Skills, or whether some has crept into SE code. Use when the user asks "are we vendoring enough", "review vendor coverage", "audit vendoring", "is our vendoring complete", or before a release.
Upgrade the vendored NeMo-Skills slice to a newer upstream commit. Use when the user asks to "bump vendored", "sync NeMo-Skills", "update vendored sha", "upgrade NS", or wants to pull a specific upstream fix. Inside the sgl-eval repo.
| name | add-benchmark |
| description | Add a new benchmark to sgl-eval by vendoring its NeMo-Skills dataset module and registering it in `_TABLE`. Use when the user asks to "add <benchmark>" / "support <benchmark>" / "register <benchmark>" inside the sgl-eval repo. |
Don't skip step 1. If NS doesn't have the benchmark, this becomes a
heavier add-metrics-type task instead, not this skill.
SHA=$(python -c "import yaml; print(yaml.safe_load(open('sgl_eval/_vendored/nemo_skills/SOURCES.yaml'))['synced_from_sha'])")
gh api "repos/NVIDIA/NeMo-Skills/contents/nemo_skills/dataset/<name>?ref=$SHA" \
--jq '.[] | .name'
gh api "repos/NVIDIA/NeMo-Skills/contents/nemo_skills/dataset/<name>/__init__.py?ref=$SHA" \
--jq '.content' | base64 -d
Confirm:
METRICS_TYPE is one of math, multichoice (the wired-up types).
Otherwise stop and ask the user; this skill cannot proceed.GENERATION_ARGS contains a ++prompt_config=... (the prompt yaml
basename will be auto-derived).SOURCES.yamlAlways append the dataset __init__.py:
- src: nemo_skills/dataset/<name>/__init__.py
dst: dataset/<name>/__init__.py
Then exactly one data-source entry, depending on what the upstream benchmark ships:
If upstream has dataset/<name>/prepare.py (benchmark downloads
from HuggingFace / a URL at prepare time):
- src: nemo_skills/dataset/<name>/prepare.py
dst: dataset/<name>/prepare.py
Else if upstream has dataset/<name>/test.txt (questions bundled
as jsonl-in-txt):
- src: nemo_skills/dataset/<name>/test.txt
dst: dataset/<name>/test.txt
binary: true
Else stop -- this skill only covers the two loaders sgl-eval wires.
If the prompt yaml referenced by GENERATION_ARGS is one we already
vendor (math.yaml, mcq-4choices.yaml, mcq-4choices-boxed.yaml),
no further yaml entry is needed. Otherwise, also add the prompt:
- src: nemo_skills/prompt/config/<path>.yaml
dst: prompts/<basename>.yaml
python scripts/sync_vendored.py
_registry.py:_TABLEAppend a row. Only set fields that are non-default for this benchmark;
metrics_type + prompt yaml basename are auto-derived from the upstream
__init__.py:GENERATION_ARGS, so do not hardcode them.
Two minimal shapes -- pick the one matching step 2:
Bundled (vendored test.txt):
{
"name": "<name>",
"loader": "bundled",
"thinking": True, # True for reasoning, False for knowledge
"default_n_repeats": 16, # 1 single-shot, 8/16 stochastic
"description": "<one-liner>",
},
Prepare (vendored prepare.py):
{
"name": "<name>",
"loader": "prepare",
"save_args": ("test",), # see invariant below
"save_kwargs": {"random_seed": 42}, # only if upstream save_data takes non-default kwargs
"thinking": False,
"default_n_repeats": 1,
"description": "<one-liner>",
},
save_args[0]invariant._loader.pyusesoutput_basename = f"{save_args[0]}.jsonl"to find the filesave_datawrote. Sosave_args[0]must equal the basename (sans.jsonl) of upstream's output. e.g.gpqa.save_data("diamond")writesdiamond.jsonl->save_args=("diamond",). If upstream's signature differs, the loader can't be reused as-is.
pytest # vendor audit + NS tests still pass
sgl-eval list -v # new benchmark appears with correct defaults
For a smoke test against a real server (only if user has one running):
sgl-eval run <name> --base-url http://localhost:30000/v1 --num-examples 3
One commit per benchmark. Title: add <name> benchmark.
The commit should contain:
SOURCES.yaml change (manifest entries)sgl_eval/_vendored/nemo_skills/dataset/<name>/_TABLE row in _registry.pyNothing else. If the sync also produced unrelated changes (e.g., the
upstream commit has been bumped meanwhile), separate that into its own
vendor-update commit first.