| name | review-vendor-coverage |
| description | Audit whether all score-deciding logic in sgl-eval is vendored from NeMo-Skills, or whether some has crept into SE code. Use when the user asks "are we vendoring enough", "review vendor coverage", "audit vendoring", "is our vendoring complete", or before a release. |
Review vendor coverage
The principle (per CLAUDE.md): anything that decides a score is
vendored verbatim from NeMo-Skills. SE code is allowed only for
transport, glue, default config, and CLI / output formatting.
This skill audits whether SE code has leaked scoring logic. It does NOT
auto-fix; vendoring more is a design call the user makes after seeing
the report.
1. Enumerate SE files
find sgl_eval scripts tests -name "*.py" \
| grep -v _vendored | grep -v __pycache__ | sort
2. Classify each into one of four buckets
For each file (and each function within larger files), ask:
"Does this code decide a score?"
| Bucket | Description | Examples | Action |
|---|
| Transport | Wraps an external service or a Python primitive. No scoring decisions. | sampler, runner, cli, types, registry, metrics formatter | OK as SE. |
| Glue | Bridges between vendored API and our runtime model. Necessary because of architectural difference (sync vs async, in-memory vs file, single vs batch). | _predictions.py, _loader.py, _eval_single_sync, _score_via_eval_mcq | OK as SE. Document the upstream API it bridges. |
| Default config | Per-benchmark sampling / repeat / thinking defaults. Model-dependent, not a scoring decision. | _TABLE rows in _registry.py | OK as SE. Confirm tracks NS-aligned defaults. |
| Score-deciding | Code that parses generation, compares answers, computes pass@k, renders prompts beyond str.format. | (should be empty in SE!) | RED FLAG. See step 4. |
3. Specific spots to scrutinize (grey areas)
Known borderline pieces -- not violations, but watch for drift:
| Location | What it does | Why grey |
|---|
evals/_prompts.py:render_prompt | yaml.safe_load + str.format on vendored yaml | Re-implements upstream's prompt subsystem (~500 lines). Equivalent today because templates only use {problem} / {examples}. Drift if upstream adds richer templating. |
evals/_math.py:aggregate_with_math_metrics (padding) | while len < n_repeats: append dup | NS doesn't pad; we pad to keep MathMetrics happy under partial sample failure. Affects score only when sampler fails partway. |
evals/_math.py:_flatten_math_metrics and _multichoice.py:_flatten | Picks fields out of MathMetrics.get_metrics() nested dict | Display selection, not computation. Safe if upstream keeps the same field names. |
_TABLE thinking / n_repeats | Per-benchmark sgl-eval defaults | Not a violation; runtime config. CLI override always wins. |
If a new grey area surfaces during the audit, add a row.
4. If you find score-deciding code in SE
Don't auto-edit. Report to the user:
- What it computes (the scoring decision in plain words).
- Where in NS upstream the equivalent lives (or "not in NS").
- What it would take to vendor it: file path(s), import rewrites,
any drop_imports/drop_functions needed.
Then ask the user whether to:
- vendor (run
vendor-update workflow with extended SOURCES.yaml),
- keep as SE (document the deviation, e.g., in CLAUDE.md grey-area list),
- or remove the logic entirely.
5. Output format
Write a short report (under 300 words):
SE files: <count> total
transport : <n> (<list>)
glue : <n> (<list>)
default config : <n> (<list>)
score-deciding : <n> (RED if > 0)
Grey areas reviewed:
- <location>: <still acceptable / has drifted because X>
New findings:
- <none> | <list>
Recommendation: <one sentence>