| name | add-eval-dataset-config |
| description | Guide for adding and validating evaluation dataset configuration in slime. Use when user wants to configure eval datasets via --eval-config or --eval-prompt-data, add per-dataset overrides, or customize evaluation rollout behavior. |
Add Eval Dataset Config
Configure evaluation datasets in slime with explicit dataset-level overrides and predictable runtime behavior.
When to Use
Use this skill when:
- User asks to add evaluation datasets for periodic eval
- User asks to migrate from
--eval-prompt-data to structured --eval-config
- User asks for per-dataset eval overrides (sampling params, keys, rm_type, metadata)
Step-by-Step Guide
Step 1: Choose Config Entry Method
Supported inputs:
- Structured config file:
--eval-config <yaml>
- Legacy CLI pairs:
--eval-prompt-data <name1> <path1> <name2> <path2> ...
If --eval-interval is set, eval datasets must be configured.
Step 2: Build YAML with Required Fields
Each dataset needs at least:
Example:
eval:
defaults:
n_samples_per_eval_prompt: 1
temperature: 0.7
top_p: 1.0
datasets:
- name: aime
path: /path/to/aime.jsonl
rm_type: math
input_key: prompt
label_key: answer
metadata_overrides:
split: test
Step 3: Understand Override Priority
slime/utils/eval_config.py resolves fields in this order:
- Dataset-level values in
eval.datasets[*]
eval.defaults
- CLI args fallback (for example eval_* or rollout_* fields)
Common overridable fields include:
- Runtime:
n_samples_per_eval_prompt, temperature, top_p, top_k, max_response_len
- Sample keys:
input_key, label_key, tool_key, metadata_key
- Extra:
rm_type, custom_generate_function_path, metadata_overrides
Step 4: Wire Eval Function if Needed
By default, eval uses --eval-function-path (defaults to rollout function path).
Use a separate eval function when inference/eval behavior must differ from training rollout.
Step 5: Validate Parsing and Runtime
- Start with config parsing sanity by running a short launch command.
- Confirm dataset entries are loaded into
args.eval_datasets.
- Verify output keys match eval logging/metrics expectations.
Common Mistakes
- Missing
name in dataset entries
- Odd-length
--eval-prompt-data pairs
- Setting
--eval-interval without any eval dataset
- Mixing reward dict outputs without
eval_reward_key configuration
Reference Locations
- Eval config model:
slime/utils/eval_config.py
- Eval config resolution:
slime/utils/arguments.py
- Eval rollout path:
slime/rollout/sglang_rollout.py
- Customization docs:
docs/en/get_started/customization.md