원클릭으로 Manus에서 모든 스킬 실행

$pwd:

paper-benchmark-execution

Name: Paper Benchmark Execution
Author: albumentations-team

// Executes the paper benchmark plan for RGB, multichannel, DataLoader, and video benchmarks. Use when the user mentions the paper benchmark, deadline plan, machine matrix, RGB micro, multichannel, DataLoader, video GPU, c4/c4d/g2 machines, or what to run next.

Manus에서 실행

$ git log --oneline --stat

stars:87

forks:3

updated:2026년 5월 6일 15:45

SKILL.md

readonly

related-skills.json

같은 저장소

paper-coverage-validator.md

from "albumentations-team/benchmark"

Validates whether benchmark artifacts cover the paper's required RGB micro and RGB DataLoader sections. Use when checking missing RGB runs, deciding what to run next, validating gcp_runs/output folders, or preparing paper tables.

2026-05-0687

benchmark-runner.md

from "albumentations-team/benchmark"

Automates running image/video augmentation benchmarks for single or multiple libraries, validates outputs, generates comparison reports, and updates documentation. Use when running benchmarks, comparing library performance, or when the user mentions benchmark, benchmark.cli, pyperf, GCP benchmark runs, or performance testing.

2026-05-0687

documentation-generator.md

from "albumentations-team/benchmark"

Updates benchmark documentation with latest results including README tables, speedup plots, and library metadata. Use when updating documentation, generating comparison tables, or when the user mentions update_docs.sh or documentation generation.

2026-05-0687

gcp-benchmark-triage.md

from "albumentations-team/benchmark"

Triage detached GCP benchmark runs, DONE/FAILED sentinels, VM cleanup, vm.log, gcp_last_run.json, and partial result downloads. Use when GCP benchmark logs mention DONE, FAILED, exit_code.txt, VM disappeared, STOPPING, gcloud machine type errors, or missing artifacts.

2026-05-0687

library-integration.md

from "albumentations-team/benchmark"

Guides adding support for new image/video augmentation libraries to the benchmark suite. Use when integrating a new library, adding library support, or when the user mentions adding a new augmentation library to test.

2026-05-0687

performance-analysis.md

from "albumentations-team/benchmark"

Analyzes benchmark results to identify slow transforms, warmup issues, and performance regressions. Compares speedups across libraries and generates optimization recommendations. Use when analyzing performance, investigating slow benchmarks, or comparing library results.

2026-05-0687

package.json

"author": "albumentations-team"

"repository": "albumentations-team/benchmark"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

소프트웨어 품질 보증 분석가·테스터컴퓨터 및 수학직15-1253L4

name	paper-benchmark-execution
description	Executes the paper benchmark plan for RGB, multichannel, DataLoader, and video benchmarks. Use when the user mentions the paper benchmark, deadline plan, machine matrix, RGB micro, multichannel, DataLoader, video GPU, c4/c4d/g2 machines, or what to run next.

Paper Benchmark Execution

Use _internal/plans/paper_benchmark_execution_plan.md as the source of truth.

Rules

Use benchmark/matrix.py as the source of truth for built-in paper scenario/library/mode support, spec paths, requirement groups, paper transform-set files, device support, pipeline scopes, and backend selection.
Use benchmark/policy.py as the source of truth for slow-transform thresholds and media defaults. Do not patch separate image/video defaults in micro or DataLoader runners.
Use checked-in YAML configs under configs/paper/ for paper and GCP runs. Run python -m benchmark.cli plan --config ... before launch, then use python -m benchmark.cli run --config ...; use small overrides such as --num-items, --num-runs, --device, --workers, --batch-size, and --output instead of rebuilding long flag commands.
BenchmarkRunConfig in benchmark/config/models.py is the typed source of truth for run shape. resolved_config.yaml and result metadata must contain the resolved config, including expanded paper transform names.
benchmark/config/resolve.py owns YAML loading and CLI overrides, while benchmark/config/plan.py owns dry-run job and expected-output expansion.
Keep result filename changes in benchmark/output_naming.py, and detached GCP path changes in benchmark/cloud/paths.py.
Paper run command construction should flow through benchmark/jobs.py, and backend-specific execution should flow through benchmark/orchestrator.py. Do not add paper-only command branches in benchmark/cli.py.
Do not run every benchmark on every CPU.
CPU rows run on CPU-only machines, usually c4-standard-16.
GPU rows run only for GPU libraries/paths, usually g2-standard-16 with L4.
Do not run CPU-only rows on GPU VMs for hardware symmetry; label hardware per row instead.
Respect the current 128-vCPU all-regions quota, 96-vCPU C4-family quota in us-central1, and 1-GPU quota. In practice, run up to six c4-standard-16 CPU jobs if no GPU job is active, or up to five C4 jobs plus one g2-standard-16 GPU job. Keep only one G2 job active because L4 quota remains one GPU.
Respect the current 500 GB Hyperdisk Balanced quota in us-central1. Production C4 configs use 100 GB boot disks; a 200 GB C4 boot disk can block the third parallel C4 VM with HDB_TOTAL_GB quota errors.
Treat RGB micro as a profiler, not the main user-facing training throughput table.
Keep micro specs native: no Normalize, ToTensor, axis conversion, or DataLoader collation work in micro rows.
DataLoader pipeline rows use recipe specs with Normalize+ToTensor; the conversion belongs in *_pipeline_impl.py, not in pipeline_runner.py.
Video DataLoader rows also use dedicated *_video_pipeline_impl.py recipe specs. Do not run video DataLoader through the transform-only *_video_impl.py micro specs.
Keep slow-transform preflight enabled for micro and DataLoader runs. Image transforms below the practical floor (>=0.05 sec/image, <=20 img/s) should early-stop instead of consuming full paper sweep time; these transforms are not usable in practical DataLoader training pipelines.
DataLoader paper sweeps should default to epoch-based timing (--min-time 0) and rely on --num-runs, full dataset size, and slow-preflight guards rather than a fixed 30-second minimum per recipe.
Before cloud runs, reduced local production-path runs should show visible tqdm progress for library loops, media loading, micro transforms, and pipeline transforms. Missing or anonymous progress bars are a benchmark UX bug because long paper sweeps must be diagnosable while running.
Do not run every transform from benchmark/transforms/specs.py for the paper. Use only transforms that exist in at least two selected libraries. The paper transform sets live in docs/paper_transform_sets/rgb.md, docs/paper_transform_sets/9ch.md, and docs/paper_transform_sets/video.md.
Use --transform-set paper for paper micro/pipeline runs unless explicitly testing a smaller transform subset with --transforms.
Prefer the checked-in production configs over raw commands for current paper runs:
- configs/paper/prod_c4_rgb_micro_cpu.yaml
- configs/paper/prod_c4_rgb_dataloader_cpu.yaml
- configs/paper/prod_c4_9ch_micro_cpu.yaml
- configs/paper/prod_c4_9ch_dataloader_cpu.yaml
- configs/paper/prod_g2_rgb_micro_gpu.yaml
- configs/paper/prod_g2_rgb_dataloader_gpu.yaml
- configs/paper/prod_g2_9ch_micro_gpu.yaml
- configs/paper/prod_g2_9ch_dataloader_gpu.yaml Keep gcp_*_smoke.yaml configs for fast path checks and reruns only.
- configs/paper/gcp_g2_video_smoke.yaml
Use gs://imagenet_validation/ucf101/ucf101.tar for paper video cloud runs; uploaded object size is 14136559616 bytes.
Cloud paper runs should use one dataset tarball per dataset (val.tar, ucf101.tar) rather than GCS directories full of individual media files. Create tarballs on macOS with COPYFILE_DISABLE=1, tar --no-xattrs, and excludes for .DS_Store, AppleDouble ._*, and __MACOSX; detached GCP staging filters those entries again while extracting.
benchmark/cloud/stage_dataset.py runs in the VM bootstrap before the control venv exists. Keep it stdlib-only: no Pydantic, no benchmark.config, and no imports that require benchmark dependencies.
If paper scenario support changes, update docs/benchmark_architecture.md, docs/benchmark_scope.md, .cursor/skills/benchmark-runner/SKILL.md, config examples, and matrix/config/job tests in the same patch.

Core Matrix

Already done:

MacBook M4 RGB micro.
n2-standard-16 RGB micro.
n2d-standard-16 RGB micro.
Reduced g2-standard-16 video smoke:
- 821ae79852204f5cb4d5bea42fab99b1: video micro, torchvision kornia, DONE, exit_code=0.
- 861bd4a840a84ec28ff711f3f68c81a8: video pipeline CUDA batch-copy smoke, albumentationsx torchvision kornia, DONE, exit_code=0.
- b7e8cdf6fd154357be68a0b38d134136: repeat video pipeline CUDA batch-copy smoke, albumentationsx torchvision kornia, DONE, exit_code=0.

Core remaining:

c4-standard-16: CPU-only paper tables.
c4d-standard-16: RGB micro AMD sanity check only.
g2-standard-16: final torchvision/Kornia/DALI GPU video rows only.

Required Paper Runs

Main CPU suite on c4-standard-16 or equivalent modern Intel CPU:

RGB micro: start from configs/paper/prod_c4_rgb_micro_cpu.yaml.
9ch micro: start from configs/paper/prod_c4_9ch_micro_cpu.yaml.
RGB DataLoader memory: start from configs/paper/prod_c4_rgb_dataloader_cpu.yaml.
9ch DataLoader memory: start from configs/paper/prod_c4_9ch_dataloader_cpu.yaml.
Video rows: transforms from docs/paper_transform_sets/video.md; run CPU/GPU subsets according to the machine plan.

Deadline-first DataLoader config fields:

execution:
  batch_size: 256  # RGB; use 64 for the current 9-channel candidate
  workers: 8
  num_runs: 1
  min_time: 0
  thread_policy: pipeline-default

Use batch_size: 64 for all 9-channel DataLoader libraries in the current main-table candidate. The first 9-channel CPU DataLoader attempts at batch_size: 256 and batch_size: 128 OOM-killed Kornia, so partial b256 and b128 9-channel rows are exploratory and should not be mixed into the main table.

After the full one-run matrix is covered and validated, add two more runs for important rows and aggregate them into 3-run statistics. Use 5 total runs only for high-variance or close-call conclusions.

AMD sanity on c4d-standard-16 or equivalent:

RGB micro only, full selected transform set.
Optional reduced RGB DataLoader sanity with --num-items 1000; do not run the full CPU matrix on AMD unless studying CPU-vendor effects.

GPU/video suite on g2-standard-16 with L4 or equivalent:

GPU image micro and DataLoader production rows for torchvision and kornia on RGB and 9-channel images, starting from the prod_g2_* configs. 9-channel GPU micro uses num_items: 1000 because num_items: 2000 OOMs during Kornia device-resident preload on L4. Keep this labeled as a memory-limited GPU micro row.
Video micro on the G2 machine for albumentationsx, torchvision, and kornia, labeled by execution device: host CPU for AlbumentationsX, L4 GPU for torchvision/Kornia.
GPU video pipeline/DataLoader for GPU-capable paths.
DALI video pipeline when DALI is available.

Kornia image GPU rows intentionally exclude Shear in micro and DataLoader modes because Kornia's current CUDA shear parameter generator can fail with mixed CPU/CUDA tensors when moved to GPU. Keep Shear in the global RGB/9-channel paper transform sets for AlbumentationsX, Pillow, torchvision where supported, and Kornia CPU rows; mention this as a benchmark methodology limitation.

Kornia 9-channel image GPU rows also exclude MedianBlur. The L4 9-channel GPU micro run OOMed on Kornia's median-blur temporary allocation after device-resident preload. Keep MedianBlur in RGB GPU, CPU, and other-library rows; mention this as a Kornia 9-channel GPU memory limitation.

TorchVision image GPU DataLoader rows use a per-sample GPU loop for the measured transform, then batch normalization, because TorchVision v2 does not expose a same_on_batch=False equivalent for per-image random parameters in batched image transforms. Label this explicitly in paper tables.

TorchVision JpegCompression uses torchvision.transforms.v2.JPEG, which requires uint8 CPU input. Exclude it from TorchVision GPU image rows; keep it in CPU TorchVision rows and other libraries that support it. Mention this JPEG-compression device constraint when interpreting GPU tables.

Kornia RGB GPU DataLoader can record GaussianIllumination as unsupported because the current recipe path can fail with mixed CPU/CUDA tensors. Keep it as a Kornia GPU methodology limitation, not a global transform-set removal.

CUDA DataLoader rows record per-transform peak GPU memory during timed runs. Use gpu_memory.peak_allocated_bytes and gpu_memory.peak_reserved_bytes as paper-facing cost columns for GPU augmentation. Pyperf micro rows do not report peak memory because pyperf executes timed loops in worker processes.

Do not rerun CPU-only image rows on GPU machines for hardware symmetry. Label hardware per row instead.

Execution Order

Inventory existing results and avoid rerunning completed n2/n2d baselines.
Run production Wave 1 RGB: CPU micro plus GPU micro, then CPU DataLoader plus GPU DataLoader.
Run production Wave 2 9-channel: CPU micro plus GPU micro, then CPU DataLoader plus GPU DataLoader.
Pull and validate artifacts after each run before starting top-up repeats.
Run video production sizing only after image tables are secured, unless video becomes central to the paper claim.
Pull and validate artifacts before generating plots/tables.

Validation

After pulling results, run:

python -m tools.check_paper_coverage gcp_runs output

Use --require-optional-libraries only when DALI must be present.

paper-benchmark-execution

이 저장소의 다른 Skills

이 저장소의 다른 Skills

Paper Benchmark Execution

Rules

Core Matrix

Required Paper Runs

Execution Order

Validation

Paper Benchmark Execution

Rules

Core Matrix

Required Paper Runs

Execution Order

Validation