| name | write-kaggle-benchmarks |
| description | Write, push, run, publish, and manage Kaggle Benchmark tasks using the kaggle CLI and the kaggle-benchmarks Python SDK. Use when the user wants to create or push a benchmark task (optionally with attached Kaggle datasets), run benchmarks against LLM models, check task/run status, stream or fetch execution logs, download results and source notebooks, publish a task to make it public, or troubleshoot benchmark workflows. |
Write Kaggle Benchmarks
Keywords
Kaggle benchmarks, write a benchmark, benchmark task, kbench, push task, run task.
Official Resources
Command Hierarchy
kaggle benchmarks (alias: kaggle b)
āāā auth ā Fetch Model Proxy credentials
āāā init ā Fetch credentials + setup local dev environment
āāā tasks (alias: t) ā Manage benchmark tasks
āāā push ā Upload a task from a .py file
āāā run ā Run a task against model(s)
āāā list ā List your benchmark tasks
āāā status ā Show task details and per-model run status
āāā download ā Download completed run outputs (and optionally source notebooks)
āāā log (logs) ā Show execution logs for run(s) (streams live for RUNNING runs)
āāā publish ā Make a task public (publishes the backing notebook by default)
āāā models ā List available benchmark models
āāā delete ā Delete a task (not yet supported by server)
Setup
kaggle b init -y
kaggle b auth -y
Custom paths: --env-file <FILE> and --example-file <FILE> for init.
Env vars written by init:
MODEL_PROXY_URL
MODEL_PROXY_API_KEY
MODEL_PROXY_EXPIRY_TIME
LLM_DEFAULT
LLM_DEFAULT_EVAL
LLMS_AVAILABLE
Core workflow: Init ā Write ā Validate ā Push ā Run ā Status ā Download
Pacing ā check in at every stage
Do NOT chain the full pipeline. Treat each numbered step below as a checkpoint:
- State what you are about to do for the current step (one sentence, including the exact command you intend to run).
- Wait for the user's go-ahead before executing ā including for steps that look "obvious" like
init or list.
- After the step completes, show the relevant output, then stop. Do not auto-advance to the next step.
- Ask the user how they want to proceed: continue to the next documented step, change parameters, or branch off.
If the user explicitly asks for "the whole pipeline" or "do everything", you may chain, but summarize the planned chain in advance and ask for one confirmation covering the lot, instead of skipping the per-step checkpoints silently.
0. Init (once per environment, re-run when creds expire)
init fetches Model Proxy credentials, writes .env, and drops an
example_task.py + kaggle_benchmarks_reference.md next to it. Every
later step depends on the MODEL_PROXY_* vars it writes, so run it
before anything else ā and re-run it any time python task.py or
kaggle b t run fails with an auth error (the API key is short-lived).
kaggle b init -y
kaggle b auth -y
1. Write a task file
A task file must:
- Import
kaggle_benchmarks as kbench
- Define at least one function decorated with
@kbench.task(...)
- Call
.run(kbench.llm) (or .evaluate(...)) on the task function ā see Gotchas
- Use
# %% cell markers (jupytext percent format)
Minimal example:
import kaggle_benchmarks as kbench
@kbench.task(name="my-test-task")
def my_test_task(llm):
response = llm.prompt("What is 2 + 2?")
kbench.assertions.assert_in("4", response, expectation="Should contain 4")
my_test_task.run(kbench.llm)
LLM resolution precedence (highest ā lowest):
- Explicit model in code:
task.run(llm=kbench.llms["google/gemini-3.5-flash"])
- Default in code:
task.run(llm=kbench.llm) (resolves to LLM_DEFAULT)
- Env vars from .env (
LLM_DEFAULT, LLMS_AVAILABLE, MODEL_PROXY_*)
2. Validate locally
Run the task end-to-end before pushing. This catches the silent-no-op gotcha and broken prompts before the push ā run ā wait ā download round-trip.
kaggle b init -y
python task.py
ls -1 *.run.json
If python task.py exits cleanly and *.run.json appears, the task is safe to push. If validation fails, fix and re-run before proceeding to Step 3.
3. Push
kaggle b t push my-task -f task.py --wait
kaggle b t push my-task -f task.py -d owner/dataset1 -d owner/dataset2
--wait [TIMEOUT] blocks until server-side creation finishes (no arg = indefinite). --poll-interval <SECONDS> caps the polling interval (default 60s; polling starts at 5s and grows adaptively). Repeat -d/--kaggle-dataset once per dataset (do not space-separate; see the Gotchas).
4. Run
kaggle b t run my-task
kaggle b t run my-task -m google/gemini-3.5-flash
kaggle b t run my-task -m google/gemini-3.5-flash -m anthropic/claude-haiku-4-5
kaggle b t run my-task -m google/gemini-3.5-flash --wait
List available models: kaggle b t models.
5. Status
kaggle b t status my-task
kaggle b t status my-task -m google/gemini-3.5-flash
Prints task metadata (slug, version, state, created timestamp, public flag, task URL) and a per-model run table. Errored runs render their final exception line under an Errors: section.
6. Download
kaggle b t download my-task
kaggle b t download my-task -o ./results
kaggle b t download my-task -m google/gemini-3.5-flash
kaggle b t download my-task -s
kaggle b t download my-task -f
Output layout: <output>/<task>/<version>/<model>/<run_id>/.... Already-downloaded runs are skipped unless --force/-f is passed. With --include-source/-s, each run's directory also contains __notebook__.ipynb and __notebook_source__.ipynb alongside the regular outputs (useful for debugging the kernel session).
7. Log
kaggle b t log my-task
kaggle b t log my-task -m google/gemini-3.5-flash
kaggle b t log my-task -m model-a -m model-b
RUNNING runs stream live via SSE; COMPLETED/ERRORED runs print the persisted log in one shot; QUEUED runs print (No logs available ā server returned 404) and continue.
8. Publish
kaggle b t publish my-task
kaggle b t publish my-task --no-publish-backing-notebook
Publishes both the task and the backing notebook by default. If the task is already public the command is a no-op for the task itself but will still publish the notebook unless --no-publish-backing-notebook is passed.
Quick Recipes
Reminder: these are reference snippets, not invocations to chain automatically. Per the "Pacing" section above, run them one at a time with user confirmation between each, unless the user explicitly asks you to chain them.
kaggle b t push my-task -f task.py --wait
kaggle b t run my-task -m google/gemini-3.5-flash --wait
kaggle b t download my-task -o ./results
kaggle b t list --name-regex "^math" --status errored
kaggle b t log my-task -m google/gemini-3.5-flash
kaggle b t download my-task -m google/gemini-3.5-flash -s -f
Gotchas
Most of these are silent failures the agent will not detect on its own ā review before generating any task file or CLI invocation.
- No
.run() call ā silent no-op. The push will succeed even if the file has no .run() (push validation only checks for @task decorators). The task will then execute on the server and produce no .run.json, so nothing is recorded. Every task function must end with task_fn.run(kbench.llm) (or .evaluate(...)).
MODEL_PROXY_API_KEY is short-lived. If python task.py fails with an auth error, re-run kaggle b auth -y (or kaggle b init -y) to refresh.
init / auth append to the env file. Loaded via dotenv so last-wins makes re-running safe, but the file accumulates duplicate entries over time.
- Task slug must match a
@task decorator. kaggle b t push <SLUG> -f file.py fails if <SLUG> doesn't match the slugified name of some @kbench.task(name=...) (or function name) in the file. Names are normalized: My Task ā my-task, my_task ā my-task.
- Server returns model slugs with
@default suffix sometimes (e.g. google/gemini-3.5-flash@default). The CLI normalizes @ ā - for matching; user-facing commands should use the plain owner/model form.
delete is not implemented server-side. The command exists but currently prints Delete is not supported by the server yet.
- Repeated flags, not space-separated. For multi-value flags (
-m, -d/--kaggle-dataset), pass the flag once per value: -m a -m b, not -m a b. Space-separated form is not supported and will error.
- CLI scope is tasks only, not benchmarks. A benchmark is a curated collection of tasks. The CLI lets you create, push, and run individual tasks, but creating or managing benchmarks (collections) must be done on the Kaggle web UI.