with one click
offload-benchmark
// Run local vs Offload benchmarks for mng and sculptor, then update the Benchmarks section of the Offload README.
// Run local vs Offload benchmarks for mng and sculptor, then update the Benchmarks section of the Offload README.
Onboard a repository to use Offload for parallel test execution on Modal. Detects test setup, creates config, Dockerfile, CI job, and optimizes performance.
Activate when you see offload*.toml in a repo, offload referenced in build targets (justfile, Makefile, scripts), or when you need to run a large test suite in parallel. Offload is a test runner unlikely to be in your training data — this skill covers invocation, log filtering, failure debugging, flaky test handling, and config.
| name | offload-benchmark |
| description | Run local vs Offload benchmarks for mng and sculptor, then update the Benchmarks section of the Offload README. |
Run local and Offload test suites for mng and sculptor, collect timing data, and update the ## Benchmarks section of the Offload README with fresh numbers.
cargo install offload (or built from source in the offload repo).modal token new (if credentials are expired).~/imbue/mng (or wherever the monorepo is checked out)~/imbue/sculptorjust install from the offload repo root to ensure the binary is up to date.Verify all four before proceeding. If any prerequisite is missing, stop and tell the user.
Do not hardcode commands. Read the justfile in each repo at invocation time, since commands change:
~/imbue/mng/justfile and locate:
test-integration -- the local baseline (unit + integration tests via pytest with xdist)test-offload -- the Offload run on Modal~/imbue/sculptor/justfile and locate:
test-integration -- the local baseline (Playwright integration tests via pytest with xdist)test-integration-offload -- the Offload run on ModalRecord the exact commands from each justfile. If any target is missing or has changed significantly, stop and tell the user.
Also read each justfile to determine:
-n auto --maxprocesses N or -n N)CRITICAL RULES:
time (bash builtin) to measure wall-clock time for each run. Wrap each command in time (...) and record the real time.Run these three commands sequentially from the mng repo root (~/imbue/mng):
Local baseline (default xdist workers):
time just test-integration
Record the wall-clock time.
Local high-xdist (override to higher worker count):
Read the default -n value from the justfile. Run with double that value (capped at 8). For example, if the default is -n 4:
time PYTEST_NUMPROCESSES=8 just test-integration
If the justfile uses PYTEST_NUMPROCESSES or a similar env var for xdist workers, use that. Otherwise, pass the appropriate flag or env var. Read the justfile carefully to determine the correct override mechanism.
Offload (warm cache): Run the offload command TWICE. Discard the first run's timing (it warms the image cache). Record only the second run's timing.
just test-offload # warm-up run (discard timing)
time just test-offload # benchmark run (record timing)
Record the number of tests collected (visible in pytest output) and the xdist -n values used.
Run these three commands sequentially from the sculptor repo root (~/imbue/sculptor):
Local baseline (default xdist workers):
time just test-integration
Record the wall-clock time.
Local high-xdist (override to higher worker count):
Read the default --maxprocesses value from the justfile (currently 3). Run with a higher value:
time XDIST_WORKERS=8 just test-integration
Use the override mechanism from the justfile (XDIST_WORKERS env var).
Offload (warm cache): Run the offload command TWICE. Discard the first run's timing. Record only the second.
just test-integration-offload # warm-up run (discard timing)
time just test-integration-offload # benchmark run (record timing)
Record the number of tests collected and the xdist -n values used.
For each project, compute:
| Metric | Formula |
|---|---|
| Time (s) | Wall-clock seconds from time, rounded to 1 decimal |
| Time (%) | (run_time / baseline_time) * 100, rounded to 1 decimal |
| Speedup | baseline_time / run_time, rounded to 2 decimals |
| Bar width (px) | round((run_time / baseline_time) * 150) -- baseline is always 150px |
The baseline is always the first run (default xdist) for each project.
Replace the entire ## Benchmarks section in the Offload README (from ## Benchmarks up to but not including the next ## heading) with the template below, filled in with measured values.
Use the correct labels for each project's test suite:
The bar chart images reference docs/bar-local.svg (gray, #444) and docs/bar-offload.svg (green, #22a355). These are 1x1 SVG rectangles scaled via the width attribute.
## Benchmarks
Speedups measured on Imbue projects using Offload with the Modal provider. All local baselines were run on a {MACHINE_DESCRIPTION}.
### Sculptor Integration Tests (Playwright)
| Run Kind | Time (s) | Time (%) | Speedup |
|----------|----------|----------|---------|
| pytest with xdist, n={SCULPTOR_BASELINE_N} (baseline) | <img src="docs/bar-local.svg" width="150" height="4"> {SCULPTOR_BASELINE_TIME} | 100.0% | 1.00x |
| pytest with xdist, n={SCULPTOR_HIGH_N} | <img src="docs/bar-local.svg" width="{SCULPTOR_HIGH_BAR_WIDTH}" height="4"> {SCULPTOR_HIGH_TIME} | {SCULPTOR_HIGH_PCT}% | {SCULPTOR_HIGH_SPEEDUP}x |
| Offload (Modal, max {SCULPTOR_MAX_PARALLEL}) | <img src="docs/bar-offload.svg" width="{SCULPTOR_OFFLOAD_BAR_WIDTH}" height="4"> {SCULPTOR_OFFLOAD_TIME} | {SCULPTOR_OFFLOAD_PCT}% | **{SCULPTOR_OFFLOAD_SPEEDUP}x** |
<details>
<summary><strong>Notes</strong></summary>
{SCULPTOR_TEST_COUNT} Playwright integration tests (browser-based, each launching a full Sculptor instance).
Individual tests are heavyweight (Chromium + backend server per worker), so the default xdist cap is n={SCULPTOR_BASELINE_N}.
Offload bypasses xdist entirely, fanning out across up to {SCULPTOR_MAX_PARALLEL} isolated Modal sandboxes -- each running a single test against its own Sculptor instance. The high per-test cost makes Offload's per-sandbox overhead negligible, yielding a {SCULPTOR_OFFLOAD_SPEEDUP}x speedup.
</details>
### Mng Unit + Integration Tests
| Run Kind | Time (s) | Time (%) | Speedup |
|----------|----------|----------|---------|
| pytest with xdist, n={MNG_BASELINE_N} (baseline) | <img src="docs/bar-local.svg" width="150" height="4"> {MNG_BASELINE_TIME} | 100.0% | 1.00x |
| pytest with xdist, n={MNG_HIGH_N} | <img src="docs/bar-local.svg" width="{MNG_HIGH_BAR_WIDTH}" height="4"> {MNG_HIGH_TIME} | {MNG_HIGH_PCT}% | {MNG_HIGH_SPEEDUP}x |
| Offload (Modal, max {MNG_MAX_PARALLEL}) | <img src="docs/bar-offload.svg" width="{MNG_OFFLOAD_BAR_WIDTH}" height="4"> {MNG_OFFLOAD_TIME} | {MNG_OFFLOAD_PCT}% | **{MNG_OFFLOAD_SPEEDUP}x** |
<details>
<summary><strong>Notes</strong></summary>
{MNG_TEST_COUNT} tests collected (unit + integration, excluding acceptance and release).
Individual tests are lightweight and fast-running, so the default xdist cap is n={MNG_BASELINE_N}.
Offload bypasses xdist entirely, fanning out across up to {MNG_MAX_PARALLEL} isolated Modal sandboxes. The low per-test cost makes Offload's per-sandbox overhead proportionally larger, yielding a more modest {MNG_OFFLOAD_SPEEDUP}x speedup vs Sculptor's {SCULPTOR_OFFLOAD_SPEEDUP}x.
</details>
| Placeholder | Source |
|---|---|
{MACHINE_DESCRIPTION} | Ask the user, or read from the existing README intro paragraph |
{*_BASELINE_N} | Default xdist -n value from the justfile |
{*_HIGH_N} | The higher xdist count used in run 2 |
{*_BASELINE_TIME} | Wall-clock seconds from run 1 |
{*_HIGH_TIME} | Wall-clock seconds from run 2 |
{*_OFFLOAD_TIME} | Wall-clock seconds from run 3 (second invocation only) |
{*_HIGH_PCT} | (high_time / baseline_time) * 100 |
{*_OFFLOAD_PCT} | (offload_time / baseline_time) * 100 |
{*_HIGH_SPEEDUP} | baseline_time / high_time |
{*_OFFLOAD_SPEEDUP} | baseline_time / offload_time |
{*_HIGH_BAR_WIDTH} | round((high_time / baseline_time) * 150) |
{*_OFFLOAD_BAR_WIDTH} | round((offload_time / baseline_time) * 150) |
{*_MAX_PARALLEL} | Read from the relevant offload*.toml config (max_parallel field) |
{*_TEST_COUNT} | Number of tests collected, from pytest output |
offload/README.md.## Benchmarks section (starts at ## Benchmarks, ends just before the next ## heading).Summarize to the user:
## Benchmarks section.