name	eval-model
description	Evaluate a trained or downloaded language model with `lm-eval-harness` standard tasks (arc, hellaswag, gsm8k, mmlu, truthfulqa, ifeval, ...). Triggered when the user wants to benchmark, eval, or compare a model — pre- or post-training.

eval-model — lm-eval-harness wrapper

Wrapper: scripts/eval_lm.py. Requires pip install 'lm-eval>=0.4.5' (or bash bootstrap.sh --eval).

Quick eval

python scripts/eval_lm.py \
    --model HuggingFaceTB/SmolLM2-360M-Instruct \
    --tasks arc_easy,hellaswag,piqa \
    --num-fewshot 0 \
    --batch-size auto

For local checkpoints, pass the directory (outputs/sft-default) or any Hub repo id. lm-eval handles loading.

Task picks by training goal

Goal	Tasks
Instruct-quality smoke	`arc_easy,hellaswag,piqa,winogrande`
Math reasoning	`gsm8k,minerva_math`
Knowledge	`mmlu`
Truthfulness	`truthfulqa_mc2`
Instruction following	`ifeval`
Code	`humaneval` (needs sandbox)

Reporting

After the run, surface:

Task → score → baseline comparison. Use the published numbers from the paper that motivated the recipe (the research-recipe output should include these).
The model URL on the Hub.
The Trackio / wandb run URL for training, if applicable.

Do not just dump the JSON. Pick the 3-5 numbers the user actually cares about and put them in a small table.

name	eval-model
description	Evaluate a trained or downloaded language model with `lm-eval-harness` standard tasks (arc, hellaswag, gsm8k, mmlu, truthfulqa, ifeval, ...). Triggered when the user wants to benchmark, eval, or compare a model — pre- or post-training.

eval-model — lm-eval-harness wrapper

Wrapper: scripts/eval_lm.py. Requires pip install 'lm-eval>=0.4.5' (or bash bootstrap.sh --eval).

Quick eval

python scripts/eval_lm.py \
    --model HuggingFaceTB/SmolLM2-360M-Instruct \
    --tasks arc_easy,hellaswag,piqa \
    --num-fewshot 0 \
    --batch-size auto

For local checkpoints, pass the directory (outputs/sft-default) or any Hub repo id. lm-eval handles loading.

Task picks by training goal

Goal	Tasks
Instruct-quality smoke	`arc_easy,hellaswag,piqa,winogrande`
Math reasoning	`gsm8k,minerva_math`
Knowledge	`mmlu`
Truthfulness	`truthfulqa_mc2`
Instruction following	`ifeval`
Code	`humaneval` (needs sandbox)

Reporting

After the run, surface:

Task → score → baseline comparison. Use the published numbers from the paper that motivated the recipe (the research-recipe output should include these).
The model URL on the Hub.
The Trackio / wandb run URL for training, if applicable.

Do not just dump the JSON. Pick the 3-5 numbers the user actually cares about and put them in a small table.

eval-model

eval-model — lm-eval-harness wrapper

Quick eval

Task picks by training goal

Reporting

More from this repository

More from this repository

eval-model — lm-eval-harness wrapper

Quick eval

Task picks by training goal

Reporting