一键导入
dataset-manager
// Use this skill to generate benchmark datasets (TPC-H, TPC-DS, etc.). Trigger when the user needs test data at a specific scale factor for benchmarking or testing. Supports parquet and duckdb output formats.
// Use this skill to generate benchmark datasets (TPC-H, TPC-DS, etc.). Trigger when the user needs test data at a specific scale factor for benchmarking or testing. Supports parquet and duckdb output formats.
| name | dataset-manager |
| description | Use this skill to generate benchmark datasets (TPC-H, TPC-DS, etc.). Trigger when the user needs test data at a specific scale factor for benchmarking or testing. Supports parquet and duckdb output formats. |
| argument-hint | [benchmark] [scale_factor] [--format duckdb|parquet] [--output <path>] |
Generate benchmark datasets by running the corresponding shell script under test/<benchmark>_performance/.
Parse $ARGUMENTS for:
--format duckdb|parquet (optional — each benchmark has a default)--output <path> (optional — scripts have sensible defaults)If any required parameter (benchmark, scale factor) is missing, ask the user.
Each entry follows the same structure: script location, command template, supported formats, defaults, and prerequisites.
| Field | Value |
|---|---|
| Script | test/tpch_performance/generate_tpch_data.sh |
| Default format | parquet |
| Formats | parquet (tpchgen-rs), duckdb (DuckDB dbgen()) |
| Default output (parquet) | test_datasets/tpch_parquet_sf<SF> |
| Default output (duckdb) | test_datasets/tpch_sf<SF>.duckdb |
| Prerequisites | Parquet: pixi env (rust, python, pyarrow). DuckDB: build/release/duckdb |
cd test/tpch_performance && pixi run bash generate_tpch_data.sh <SF> --format <FORMAT> [--output <path>]
Notes:
| Field | Value |
|---|---|
| Script | test/tpcds_performance/generate_tpcds_data.sh |
| Default format | duckdb |
| Formats | duckdb, parquet |
| Default output (duckdb) | test_datasets/tpcds_sf<SF>.duckdb |
| Default output (parquet) | test_datasets/tpcds_parquet_sf<SF> |
| Prerequisites | build/release/duckdb |
cd test/tpcds_performance && bash generate_tpcds_data.sh <SF> --format <FORMAT> [--output <path>]
Notes:
test/tpcds_performance/queries/q{1..99}.sqlFor any benchmark that requires the DuckDB binary, check before running:
test -x build/release/duckdb
If missing, tell the user to build first: CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) make
Use this skill when the build fails, compilation errors occur, or you see undefined references, linker errors, CUDA compilation issues, missing headers, or template instantiation failures. Analyzes errors, suggests fixes, and iteratively rebuilds until success.
Run TPC-H benchmarks on Super Sirius or DuckDB CPU baseline — generate data, execute queries, validate results, and compare timings. Trigger when the user mentions benchmarking, TPC-H, performance testing, query runtimes, or wants to compare Sirius vs DuckDB speed.
Discover and document a dependency library or submodule — analyzes all uses within the codebase, divides the library into logical modules, identifies which modules are used, and generates LLM-consumable API documentation for each module. Use when the user wants to understand a library dependency, map its modules, or generate API reference docs for a submodule.
Use this skill to update Super Sirius documentation after code changes. Trigger when the user says "update docs", "refresh documentation", "sync docs with code changes", or after merging PRs that changed the Super Sirius codebase. Inspects merged PRs since the last update and patches affected doc files.
Use this skill when a Sirius query crashes, segfaults, hangs, throws an exception, or unexpectedly falls back to CPU. Also use when you see CUDA errors, std::bad_alloc, or the process gets killed. Diagnoses issues using log analysis, cuda-gdb, AddressSanitizer, and NVIDIA Compute Sanitizer.
Use this skill when a Sirius query returns wrong results, missing rows, extra rows, or incorrect values compared to DuckDB CPU. Pinpoints the faulty operator using per-operator row counts and data checksums. Also detects CUDA stream synchronization issues that cause garbage data.