en un clic
numerai-experiment-design
// Design and manage Numerai experiments in this repo for any model idea.
// Design and manage Numerai experiments in this repo for any model idea.
Add a new Numerai model type to the agents training pipeline. Use when you need to register a model in `agents/code/modeling/utils/model_factory.py`, handle fit/predict quirks in `agents/code/modeling/utils/numerai_cv.py`, and update configs so the model can run via `python -m agents.code.modeling`.
Create Numerai Tournament model upload pickles (.pkl) with a self-contained predict() function. Use when preparing upload artifacts, debugging numerai_predict import errors, or documenting model-upload requirements and testing steps.
End-to-end Numerai research workflow for trying a new idea: design experiments, implement new model types if needed, run scout→scale experiments, write a full experiment.md report with standard plots, and optionally package/upload a Numerai pickle. Use when a user asks to “try/test a new idea”, “run an experiment”, “sweep configs”, “compare model variants”, or otherwise do new Numerai research.
Write a complete Numerai experiment report in experiment.md (abstract, methods, results tables, decisions, next steps) and generate/link the standard show_experiment plot(s). Use after running any Numerai research experiments, or when a user asks for a “full report”, “write up”, “experiment.md update”, or “generate the standard plot”.
| name | numerai-experiment-design |
| description | Design and manage Numerai experiments in this repo for any model idea. |
Use this workflow to plan, run, and report Numerai experiments for any model idea.
Note: run commands from numerai/ (so agents is importable), or from repo root with PYTHONPATH=numerai.
This skill is not complete after a single promising run. You must run experiments in rounds (typically 4–5 configs per round), synthesize results, and decide what to try next. Only finalize when you reach a plateau and additional rounds stop improving the primary metric.
deep_lgbm_ender20_baseline (feature_set=all) unless the user explicitly requests the small baseline; keep experiments' feature_set aligned with the chosen baseline.bmc_mean and bmc_last_200_eras) where BMC = Benchmark Model Contribution vs official v52_lgbm_ender20.If the user's request is unclear or underspecified:
bmc_mean and bmc_last_200_eras.experiment.md.Core loop (repeat for each experiment round):
PYTHONPATH=numerai python3 -m agents.code.modeling --config <config> --output-dir <experiment_dir>, which calls pipeline.py for CV/OOF + results.bmc_last_200_eras.mean (primary), with bmc_mean as a tie-breakercorr_mean and avg_corr_with_benchmark (avoid “high corr, low BMC” traps)experiment.md with: what changed this round, the metrics table, and the next-round decision.v5.2/downsampled_full.parquet + v5.2/downsampled_full_benchmark_models.parquet to save memory and time when experimenting.Stop iterating only when at least two consecutive rounds fail to beat the current best bmc_last_200_eras.mean by a meaningful margin (rule of thumb: ~1e-4–3e-4), and the remaining untried knobs are either redundant with what you already swept or likely to increase overfit/benchmark-correlation.
If you plateau on downsampled data, do one confirmatory scale step (bigger feature set and/or more data) before concluding the idea is maxed out.
Note that these are examples only. Each idea will call for different sweeps, or no sweeps. These are some guidelines but use your judgement to determine the best experiments to run to answer the core question of "does/can this core idea produce a model that has high bmc_mean?
feature_set aligned with the baseline for comparisons.v52_lgbm_ender20) as the benchmark reference and plot baseline, even when sweeping; only use the small baseline when explicitly requested.agents/experiments/.configs/ for configslogs/ for run logspredictions/ + results/ from OOF CVexperiment.md for summary and decisions. Declare the baseline in the experiment.md. Update the experiment.md as you progress.PYTHONPATH=numerai python3 -m agents.code.analysis.show_experiment benchmark <best_model> --base-benchmark-model v52_lgbm_ender20 --benchmark-data-path numerai/v5.2/full_benchmark_models.parquet --start-era 575 --dark --output-dir <experiment_dir> --baselines-dir numerai/agents/baselines to generate the cumulative corr + BMC plot (share the output path).python -m agents.code.analysis.plot_benchmark_corrs only when comparing official benchmark model columns, not for experiment BMC curves.bmc (full) and bmc_last_200_erascorr_mean and avg_corr_with_benchmark (corr vs the official benchmark predictions)experiment.md after each run.python -m agents.code.data.build_full_datasets.
numerai/v5.2/full.parquet, numerai/v5.2/full_benchmark_models.parquetnumerai/v5.2/downsampled_full.parquet, numerai/v5.2/downsampled_full_benchmark_models.parquetPYTHONPATH=numerai python3 -m agents.code.modeling (training + metrics)agents/code/metrics/numerai_metrics.py (BMC/corr summaries)PYTHONPATH=numerai python3 -m agents.code.analysis.show_experiment (compare runs)PYTHONPATH=numerai python3 -m agents.code.data.build_full_datasets (full + downsampled datasets)Once you have finalized your best model and created a pkl file using the numerai-model-upload skill:
Offer deployment: Ask the user if they want to deploy the pkl to Numerai for automated submissions.
Deployment options (via the Numerai MCP server):
create_model to create a new model slot, then upload the pklFollow the numerai-model-upload skill for the complete deployment workflow using the MCP server tools (create_model, upload_model, graphql_query).
This allows the full research-to-deployment workflow to happen in a single session.