| name | runtime-errors |
| description | Use this skill when a Sirius query crashes, segfaults, hangs, throws an exception, or unexpectedly falls back to CPU. Also use when you see CUDA errors, std::bad_alloc, or the process gets killed. Diagnoses issues using log analysis, cuda-gdb, AddressSanitizer, and NVIDIA Compute Sanitizer.
|
| argument-hint | [sql-query-or-file] [--timeout 30] |
| disable-model-invocation | true |
Runtime Error Analyzer
Diagnose runtime errors (crashes, segfaults, exceptions, hangs, fallback triggers) by analyzing log files, inserting targeted debug logs, using cuda-gdb and NVIDIA Compute Sanitizer, and re-running queries.
Reference: See .claude/skills/_shared/build-and-query.md for shared infrastructure (build modes, query execution, autonomy mode, change tracking, debug log conventions).
Triage
-
Gather context:
- Ask the user for the SQL query (or accept via
$ARGUMENTS) and the error description
- Ask about data format (DuckDB or Parquet) -- see shared build-and-query.md
- Parse optional
--timeout argument (default: 30 seconds)
- Determine autonomy mode:
interactive (default), autonomous, or semi-autonomous
-
Classify the error type to determine which workflow path to follow:
- Hang / timeout (query never returns, process stuck) -> Hang Path
- Segmentation fault (SIGSEGV) (process killed by signal 11) -> Segfault Path
- Runtime error (exception, error message, unexpected fallback, wrong exit code) -> Runtime Error Path
Each path below is self-contained with its own phased workflow. All paths share the same fix iteration and log cleanup steps at the end.
Hang Path
When a query never returns or the process appears stuck. Hangs are typically caused by deadlocks, infinite loops, missing CUDA stream synchronization, or a thread waiting on a condition that is never signaled.
Phase 1: Reproduce and capture last-known state
-
Run the query with a timeout to confirm the hang:
export SIRIUS_LOG_LEVEL=trace
export SIRIUS_LOG_DIR=build/release/log/run_$(date +%s)
mkdir -p $SIRIUS_LOG_DIR
timeout <TIMEOUT> build/release/duckdb <db_path> <<'EOF'
CALL gpu_execution('<QUERY>');
EOF
- If exit code is 124, the process was killed by timeout -- confirmed hang
SIRIUS_LOG_LEVEL=trace triggers spdlog::flush_on(trace) so all log entries before the hang are flushed
-
Read the log file and identify the last log entry before the hang:
- Which pipeline stage was active?
- Which thread(s) logged last?
- Was the last operation a lock acquisition, a CUDA kernel launch, a memory allocation, or a queue wait?
-
Look for patterns:
- Deadlock indicators: Multiple threads each waiting on a lock/condition, no progress. Look for
mutex, lock, wait, condition_variable in the last entries.
- Infinite loop indicators: The same log entry repeated many times, or a loop counter growing without bound.
- CUDA sync hang: Last entry is a
cudaStreamSynchronize, cudaDeviceSynchronize, or cudaEventSynchronize -- the GPU kernel may be hung or a stream dependency is unsatisfied.
- Queue starvation: Task Creator waiting for completed tasks, but no tasks are being submitted or completed (empty queue).
Phase 2: Targeted debug logging (ask user before proceeding)
If Phase 1 narrows the hang to a code region but not the exact cause:
-
Insert SIRIUS_LOG_TRACE("[SIRIUS_DIAG] <location>: reached"); statements at:
- Entry/exit of suspect functions
- Before/after lock acquisitions and releases
- Before/after CUDA synchronization calls
- Inside loops (with iteration counter) to detect infinite loops
- Task queue push/pop operations
-
Rebuild and re-run with timeout. Compare new logs against Phase 1 logs:
- Which new log entries appear? Which don't?
- If a "before lock" appears but "after lock" doesn't -> deadlock on that lock
- If loop counter grows without bound -> infinite loop
- If "before cudaSync" appears but "after cudaSync" doesn't -> GPU hang
-
Repeat up to 3 iterations, narrowing the scope each time.
Phase 3: cuda-gdb live attach (ask user before proceeding)
For deadlocks and thread-level hangs where log analysis is insufficient:
-
Run the query without timeout (let it hang). Use relwithdebinfo or clang-debug:
build/<preset>/duckdb <db_path> <<'EOF' &
CALL gpu_execution('<QUERY>');
EOF
HANG_PID=$\!
-
Wait for it to hang (check with ps and log activity), then attach:
cuda-gdb -p $HANG_PID -batch \
-ex "thread apply all bt" \
-ex "info threads" \
-ex quit
-
Analyze the thread backtraces:
- Identify threads blocked in
pthread_mutex_lock, pthread_cond_wait, futex, or CUDA sync calls
- Map thread IDs back to Sirius components (Thread Coordinator, Task Creator, Pipeline Executor, Scan Executor, GPU scheduling threads)
- For deadlocks: identify the lock cycle (Thread A holds Lock 1, waits on Lock 2; Thread B holds Lock 2, waits on Lock 1)
- For CUDA hangs: check if a GPU kernel is still running or if a stream is waiting on an event that was never recorded
-
Kill the hung process: kill $HANG_PID
Hang analysis checklist
Common hang causes in Sirius:
- Deadlock between Thread Coordinator and Task Creator: Both waiting on each other's condition variables
- GPU kernel infinite loop: A CUDA kernel that doesn't terminate (e.g., hash table probe with broken termination condition)
- Missing pipeline completion signal: A pipeline finishes but doesn't notify the Task Creator, so it waits forever
- Memory reservation deadlock: Thread waiting for memory reservation while holding a lock that blocks the thread that would free memory
- cuCascade eviction deadlock: Eviction callback tries to acquire a lock already held by the requesting thread
- Stream dependency cycle: CUDA stream A waits on event from stream B, stream B waits on event from stream A
Segfault Path
When the error is a segmentation fault (SIGSEGV) or bus error (SIGBUS). Sirius has a built-in backtrace handler (src/util/segfault_backtrace_handler.cpp) that automatically prints a demangled stack trace on crash, so the first step is always to capture that output.
Phase 1: Capture the automatic backtrace
Sirius installs a signal handler (via install_segfault_backtrace_handler()) that catches SIGSEGV/SIGBUS and prints a demangled C++ backtrace to stderr. If SIRIUS_LOG_DIR is set, it also writes the backtrace to $SIRIUS_LOG_DIR/segfault_backtrace.txt.
-
Run the query with trace logging and capture stderr:
export SIRIUS_LOG_LEVEL=trace
export SIRIUS_LOG_DIR=build/release/log/run_$(date +%s)
mkdir -p $SIRIUS_LOG_DIR
build/release/duckdb <db_path> <<'EOF' 2>&1 | tee /tmp/segfault_output.txt
CALL gpu_execution('<QUERY>');
EOF
-
Read the backtrace output. Look for the block between *** SIGSEGV — backtrace *** and *** end backtrace ***. The output includes:
- Faulting thread ID -- identifies which thread crashed
- Demangled stack frames -- shows the full call chain at the crash point
- The backtrace is also saved to
$SIRIUS_LOG_DIR/segfault_backtrace.txt
-
Analyze the backtrace:
- Identify the top frames in
namespace sirius -- this is the crash location
- Read the source file at that location to understand the crash context
- Determine whether the crash is CPU-side or GPU-side based on the call chain
- Cross-reference with known patterns in
common-segfaults.md
-
Also read the trace log ($SIRIUS_LOG_DIR/*.log) to understand the execution path leading up to the crash -- which pipeline, operator, and data was being processed.
If the backtrace clearly identifies the crash location, skip to the Crash Analysis Checklist below and then to the Fix Iteration section.
If the backtrace is insufficient (e.g., frames are in external libraries without symbols, the crash is in a GPU kernel, or the backtrace points to a generic location like a memory allocator), continue to Phase 1b.
Phase 1b: Binary-search log insertion (when backtrace is insufficient)
When the automatic backtrace doesn't pinpoint the root cause:
- Insert
SIRIUS_LOG_TRACE("[SIRIUS_DIAG] <location>: reached"); at strategic points around the area identified by the backtrace
- For data inspection at the suspected crash point, also insert debug utility calls:
sirius::debug_schema(*batch, stream);
sirius::debug_head(*batch, 5, stream);
sirius::debug_nulls(*batch, stream);
- Binary search strategy:
- Insert log statements at function entry points across the suspected code path
- Rebuild, run -- see which log entry was the last to appear before crash
- Insert more log statements within the identified function
- Rebuild, run -- narrow down further
- Repeat until the exact crashing line is identified
SIRIUS_LOG_LEVEL=trace triggers spdlog::flush_on(trace) ensuring no log data is lost before crash (see "Why Immediate Flush Matters" below)
After Phase 1/1b, use the analysis to determine whether the crash is likely CPU-side or GPU-side, then branch:
- CPU-side crash (backtrace/last log entry is in CPU orchestration code, pipeline scheduling, memory management, etc.) -> Phase 2a: ASan
- GPU-side crash (backtrace/last log entry is before/during a CUDA kernel launch, cuDF call, or GPU memory operation) -> Phase 2b: Compute Sanitizer
- Unclear -> Try Phase 2a first (faster), then Phase 2b if ASan finds nothing
Phase 2a: AddressSanitizer -- for CPU-side crashes (ask user before proceeding)
Summarize Phase 1 findings. Offer ASan when the crash appears to be a CPU-side memory error (buffer overflow, use-after-free, dangling pointer). ASan pinpoints the exact memory violation with stack traces for both the bad access and the original allocation.
Note: ASan only detects CPU-side memory errors. If the crash is GPU-side, skip to Phase 2b (Compute Sanitizer).
- Build with
clang-debug (ASan is on by default in Debug builds):
CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) make clang-debug
- Run the reproduction case:
export SIRIUS_LOG_LEVEL=trace
export SIRIUS_LOG_DIR=build/clang-debug/log/run_$(date +%s)
mkdir -p $SIRIUS_LOG_DIR
ASAN_OPTIONS="detect_leaks=1:halt_on_error=1" build/clang-debug/duckdb <db_path> <<'EOF'
CALL gpu_execution('<QUERY>');
EOF
- Parse ASan output. ASan reports include:
- Error type: heap-buffer-overflow, stack-buffer-overflow, heap-use-after-free, double-free, etc.
- Bad access location: file, line, and full stack trace of the invalid read/write
- Allocation context: where the memory was originally allocated (and freed, for use-after-free)
- Read the source files at both the access and allocation locations
- Trace the data flow to understand how the invalid state arose
ASan overhead: ~2x slowdown, ~2-3x memory. See _shared/build-and-query.md for full ASan configuration details.
Phase 2b: NVIDIA Compute Sanitizer memcheck -- for GPU-side crashes (ask user before proceeding)
Use when Phase 1 points to a GPU memory error (crash during/after a CUDA kernel launch or cuDF operation). Compute Sanitizer catches the exact GPU memory violation with kernel name and line info.
Build with debug symbols (clang-debug or relwithdebinfo):
compute-sanitizer --tool memcheck build/<preset>/duckdb <<'EOF'
CALL gpu_execution('<QUERY>');
EOF
Parse output for: out-of-bounds accesses, misaligned accesses, use-after-free on device memory.
Phase 3: cuda-gdb (ask user before proceeding)
Fallback when ASan and Compute Sanitizer don't find the issue, or when you need a full backtrace / interactive stepping to understand the crash context.
Crash analysis checklist
- Read the source file at the crash point
- Check for common causes:
- Null pointer dereference
- Dangling reference (object lifetime issue)
- Buffer overflow / out-of-bounds access
- Use-after-free
- Iterator invalidation
- GPU memory access violation
- cuDF column lifetime issues (column destroyed while view still active)
- DuckDB vector invalidation (vector recycled during scan)
- Trace the data flow to find where the invalid state originated
- Cross-reference with known patterns in
common-segfaults.md in this directory
Runtime Error Path
When the query produces an exception, error message, unexpected fallback to CPU, or a non-zero exit code without a segfault. This covers C++ exceptions, DuckDB errors, CUDA runtime errors (cudaError), cuDF exceptions, and Sirius fallback triggers.
Phase 1: Log analysis
-
Run the query with trace logging to capture the full execution path:
export SIRIUS_LOG_LEVEL=trace
export SIRIUS_LOG_DIR=build/release/log/run_$(date +%s)
mkdir -p $SIRIUS_LOG_DIR
build/release/duckdb <db_path> <<'EOF'
CALL gpu_execution('<QUERY>');
EOF
Capture both stdout/stderr output AND the log file.
-
Read the log file and analyze:
- Find the error message in both stderr and the log. Search for
[error], [critical], Exception, Error, CUDA error, Fallback.
- Trace the execution path leading up to the error:
- Which pipelines were created?
- Which operators were executed before the error?
- What data was being processed (table names, row counts, column types)?
- Identify the error category:
- C++ exception:
std::runtime_error, std::logic_error, etc. -- look for the throw site in the log/stderr
- CUDA runtime error:
cudaErrorXxx -- indicates GPU-side failure (OOM, invalid config, driver error)
- cuDF exception:
cudf::logic_error, cudf::cuda_error -- cuDF operation failed
- DuckDB error: Error propagated from DuckDB's execution engine
- Sirius fallback: Query or operator fell back to CPU execution -- search log for
fallback entries. Check src/fallback.cpp to understand which condition triggered it.
- Assertion failure:
assert() or DCHECK failure -- the code hit an unexpected state
-
Read the source file at the error location. Understand the function, the expected preconditions, and what state would cause the error.
Phase 2: Targeted debug logging (ask user before proceeding)
If Phase 1 identifies the general area but not the root cause:
-
Insert SIRIUS_LOG_TRACE("[SIRIUS_DIAG] ...") statements to capture:
- Variable values at the error site (function arguments, object state)
- Conditional branch outcomes leading to the error (which
if path was taken?)
- Data characteristics: row counts, column types, null counts, string lengths
- Memory state: allocation sizes, reservation amounts, available GPU memory
- Data state at the error site via debug utilities:
sirius::debug_schema(*batch, stream);
sirius::debug_head(*batch, 10, stream);
sirius::debug_nulls(*batch, stream);
-
Rebuild and re-run with a new log directory. Compare against Phase 1 logs.
-
Repeat up to 3 iterations, each time narrowing the scope:
- Iteration 1: Function-level -- which function is the error in?
- Iteration 2: Block-level -- which code block within the function?
- Iteration 3: Line-level -- which exact line and what variable values?
Phase 3: cuda-gdb (ask user before proceeding)
For errors that are hard to reproduce or where the exception obscures the true origin:
- Build with debug symbols. Ask the user which preset:
relwithdebinfo (recommended) -- optimized with debug symbols, faster reproduction
clang-debug -- unoptimized, best for stepping through code line-by-line
- Run with cuda-gdb to catch the exception at throw time (replace
<preset> with chosen preset):
cuda-gdb --batch \
-ex "catch throw" \
-ex run \
-ex bt \
-ex quit \
--args build/<preset>/duckdb
This catches C++ exceptions at the throw site, before unwinding obscures the call stack.
- For CUDA errors, set a breakpoint on the CUDA error handler:
cuda-gdb --batch \
-ex "break cudaGetLastError" \
-ex run \
-ex bt \
-ex quit \
--args build/<preset>/duckdb
Runtime error analysis checklist
Common runtime error causes in Sirius:
- Unsupported data type: A column type (e.g.,
INTERVAL, BLOB, nested types) reached a GPU operator that doesn't handle it. Should trigger fallback but may throw instead.
- cuDF operation failure: cuDF throws on invalid input (e.g., mismatched column lengths in join, unsupported aggregation type). Read the cuDF error message carefully -- it usually says exactly what's wrong.
- GPU OOM:
cudaErrorMemoryAllocation -- the GPU ran out of memory. Check memory reservation logic, consider reducing data batch size or GPU region sizes.
- Unexpected fallback: A query falls back to CPU when the user expected GPU execution. Check
src/fallback.cpp for the fallback condition. Common triggers: unsupported operator, data too large, unsupported type.
- Type mismatch: Sirius type conversion (DuckDB types <-> cuDF types) failed. Check type mapping in
src/include/cudf/ and src/include/data/.
- Pipeline dependency error: A pipeline tried to read from a Data Repository that hasn't been populated yet (dependency ordering issue in
sirius_meta_pipeline).
- Row count overflow: cuDF uses
int32_t for row indices (~2B row limit). Large tables can overflow. Should trigger fallback.
Shared: Fix Iteration
After identifying the root cause (from any path above):
-
Suggest a fix with explanation of why it resolves the issue.
-
Apply and verify (behavior depends on autonomy mode):
- Apply the fix (with git checkpoint)
- Rebuild and re-run the reproduction case
- Verify the error no longer occurs
- If the error persists or a new error appears, repeat the relevant path
- Optionally compare output against DuckDB CPU baseline to ensure correctness
- Continue until: issue resolved, max iterations (default: 5) reached, or user intervenes
-
Present final summary: all attempted fixes, which worked, which didn't.
Shared: Log Management Summary
At the end of any path, present which [SIRIUS_DIAG] log statements should be:
- Promoted: Useful long-term -- change from
[SIRIUS_DIAG] tag to a proper log message at appropriate level (SIRIUS_LOG_DEBUG, SIRIUS_LOG_INFO, etc.)
- Removed: Only useful for this investigation -- delete entirely
Debug Utilities
Sirius provides structured debug utility functions in src/include/debug_utils.hpp (implementation in src/debug_utils.cpp). Use these for data inspection at suspected fault points instead of writing ad-hoc log statements. All functions output to SIRIUS_LOG_DEBUG with [SIRIUS_DIAG] prefix, are thread-safe (buffered single-call output), and are wrapped in try/catch (never crash the pipeline).
Function Signatures
#include "debug_utils.hpp"
sirius::debug_schema(batch, stream);
sirius::debug_schema(batch, stream, {"col_a", "col_b"});
sirius::debug_nulls(batch, stream);
sirius::debug_head(batch, 10, stream);
sirius::debug_head(batch, 5, stream, sirius::DebugFormat::CSV);
sirius::debug_sample(batch, 10, stream);
sirius::debug_stats(batch, stream);
sirius::debug_checksum(batch, stream);
sirius::debug_diff(batch_a, batch_b, stream);
Usage for Runtime Error Diagnosis
At suspected fault points (before/after the crashing operator):
sirius::debug_schema(*batch, stream);
sirius::debug_nulls(*batch, stream);
sirius::debug_head(*batch, 10, stream);
sirius::debug_stats(*batch, stream);
When narrowing down a crash to an operator:
sirius::debug_schema(*input_batch, stream);
sirius::debug_head(*input_batch, 5, stream);
sirius::debug_schema(*output_batch, stream);
sirius::debug_head(*output_batch, 5, stream);
Key benefits over ad-hoc logging:
- Thread-safe: output buffered into single log call (no interleaving)
- Null-aware: NULL values displayed explicitly, not as garbage
- Try/catch wrapped: debug call never crashes the pipeline (even if the data is corrupted)
- Supports all Sirius types: INT, FLOAT, STRING, DECIMAL, TIMESTAMP, DATE
Known Segfault Patterns
See common-segfaults.md in this directory for a catalog of known segfault patterns and their fixes.
Scope
Only analyze code in namespace sirius plus exceptions listed in shared build-and-query.md. Ignore legacy namespace duckdb code.