بنقرة واحدة
slm-lab-benchmark
// Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots.
// Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots.
| name | slm-lab-benchmark |
| description | Run SLM-Lab deep RL benchmarks, monitor dstack jobs, extract results, and update BENCHMARKS.md. Use when asked to run benchmarks, check run status, extract scores, update benchmark tables, or generate plots. |
${max_frame} variable in specs — never hardcodeEvery completed run MUST go through ALL of these steps. No exceptions. Do not skip any step.
When a run completes (dstack ps shows exited (0)):
dstack logs NAME | grep "trial_metrics" → get total_reward_madstack logs NAME 2>&1 | grep "Uploading data/" → extract folder name from the upload log line[FOLDER](https://huggingface.co/datasets/SLM-Lab/benchmark-dev/tree/main/data/FOLDER)source .env && huggingface-cli download SLM-Lab/benchmark-dev --local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"ls data/benchmark-dev/data/ | grep -i envname), then generate with ONLY the folders matching BENCHMARKS.md entries:
uv run slm-lab plot -t "EnvName" -d data/benchmark-dev/data -f FOLDER1,FOLDER2,...
NOTE: -d sets the base data dir, -f takes folder names (NOT full paths).
If some folders are in data/ (local runs) and some in data/benchmark-dev/data/, use data/ as base (it has the info/ subfolder needed for metrics).docs/plots/A row in BENCHMARKS.md is NOT complete until it has: score, HF link, and plot.
After intake, graduate each finalized run to public HF benchmark:
source .env && huggingface-cli upload SLM-Lab/benchmark data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset
SLM-Lab/benchmark-dev → SLM-Lab/benchmark for that entrysource .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset
# Launch a run
source .env && uv run slm-lab run-remote --gpu \
-s env=ALE/Pong-v5 SPEC_FILE SPEC_NAME train -n NAME
# Monitor
dstack ps # running jobs
dstack logs NAME | grep "trial_metrics" # extract score at completion
# Score = total_reward_ma from trial_metrics line
# trial_metrics: frame:1.00e+07 | total_reward_ma:816.18 | ...
Remote GPU run → auto-uploads to benchmark-dev (HF)
↓ Pull to local data/
↓ Generate plots (docs/plots/)
↓ Update BENCHMARKS.md (scores, links, plots)
↓ Graduate to public benchmark (HF)
↓ Update links: benchmark-dev → benchmark
↓ Upload docs/ to public benchmark (HF)
# Pull full dataset (fast, single request — avoids rate limits)
source .env && hf download SLM-Lab/benchmark-dev \
--local-dir data/benchmark-dev --repo-type dataset
# Or pull specific folder
source .env && hf download SLM-Lab/benchmark-dev \
--local-dir data/benchmark-dev --repo-type dataset --include "data/FOLDER/*"
# KEEP this data — needed for plots AND graduation upload later
# Find folders for a game (check both local data/ and benchmark-dev)
ls data/ | grep -i pong
ls data/benchmark-dev/data/ | grep -i pong
# Generate comparison plot — use -d for base dir, -f for folder names only
# Use data/ as base (has info/ subfolder with trial_metrics)
uv run slm-lab plot -t "Pong-v5" -f ppo_pong_folder,sac_pong_folder,crossq_pong_folder
When a run is finalized, graduate individually from benchmark-dev → benchmark:
# Upload individual folder
source .env && huggingface-cli upload SLM-Lab/benchmark \
data/benchmark-dev/data/FOLDER data/FOLDER --repo-type dataset
# Update BENCHMARKS.md link for that entry: benchmark-dev → benchmark
# Then upload docs/ (includes updated plots + BENCHMARKS.md)
source .env && huggingface-cli upload SLM-Lab/benchmark docs docs --repo-type dataset
source .env && huggingface-cli upload SLM-Lab/benchmark README.md README.md --repo-type dataset
| Repo | Purpose |
|---|---|
SLM-Lab/benchmark-dev | Development — noisy, iterative |
SLM-Lab/benchmark | Public — finalized, validated |
Only when algorithm fails to reach target:
source .env && uv run slm-lab run-remote --gpu SPEC_FILE SPEC_NAME search -n NAME
Budget: ~3-4 trials per dimension. After search: update spec with best params, run train, use that result.
Work continuously when benchmarking. Use sleep 300 && dstack ps to actively wait (5 min intervals) — never delegate monitoring to background processes or scripts. Stay engaged in the conversation.
Workflow loop (repeat every 5-10 minutes):
dstack ps — identify completed/failed/runningKey principle: Work continuously, check in regularly, iterate immediately on failures. Never idle. Keep reminding yourself to continue without pausing — check on tasks, update, plan, and pick up the next task immediately until all tasks are completed.
--include patterns