一键导入
arena
// Prepare and run evaluator-gated SAS and MAS experiments with explicit batch control and mandatory elasticity calibration for scenario scaling.
// Prepare and run evaluator-gated SAS and MAS experiments with explicit batch control and mandatory elasticity calibration for scenario scaling.
Create and execute scenario YAML files for architecture what-if predictions.
Reset local experiment state to a fresh slate by clearing runs, prediction outputs, scaling snapshots, and database index files under data/.
Create or repair Brainqub3 task packages that must pass evaluator tests before runs, including both fabricated instances and user-provided data workflows.
Build or repair task evaluators and evaluator tests with deterministic-first strategy.
Generate markdown reports with observed metrics, predicted metrics, and architecture recommendations.
| name | arena |
| description | Prepare and run evaluator-gated SAS and MAS experiments with explicit batch control and mandatory elasticity calibration for scenario scaling. |
| allowed-tools | ["Read","Write","Edit","Bash","Glob","Grep","WebSearch","WebFetch"] |
--batch-id.eta_n and eta_T.(n_agents, T) point. Use controlled grids.default_zero.Batch=all; select the explicit elasticity batch.--allowed-tools explicitly for reproducible tool policy; runtime enforcement is SDK-based (can_use_tool), not metadata-only.--tool-count-grid 6,8): core is Read,Write,Edit,Bash,Glob,Grep; full adds WebSearch and WebFetch.Use this unless the user explicitly requests a quick smoke test that skips scaling calibration.
<task>_compare_<YYYYMMDD><task>_elasticity_<YYYYMMDD>uv run brainqub3 run sas --task <task> --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch independent --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch centralised --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch decentralised --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch hybrid --model <model> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --batch-id <compare_batch_id> --require-live --no-dashboard
Run elasticity sweep once per MAS architecture:
uv run brainqub3 run elasticity \
--task <task> \
--arch independent \
--model <model> \
--batch-id <elasticity_batch_id> \
--n-agents-grid 3,4 \
--tool-count-grid 6,8 \
--instances <N> \
--require-live \
--no-dashboard
uv run brainqub3 run elasticity --task <task> --arch centralised --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch decentralised --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
uv run brainqub3 run elasticity --task <task> --arch hybrid --model <model> --batch-id <elasticity_batch_id> --n-agents-grid 3,4 --tool-count-grid 6,8 --instances <N> --require-live --no-dashboard
Only skip elasticity if the user explicitly says to skip scaling/scenario calibration.
Smoke-test pattern:
uv run brainqub3 run sas --task <task> --model <model> --batch-id <smoke_batch_id> --require-live --no-dashboard
uv run brainqub3 run mas --task <task> --arch <arch> --model <model> --batch-id <smoke_batch_id> --require-live --no-dashboard
uv run brainqub3 dashboard
Batch to the explicit elasticity batch id (not all).