一键导入
query
// Run SQL queries against registered tables or ad-hoc against files using datafusion-cli. Accepts raw SQL or natural language questions. Supports Parquet, CSV, JSON, and Arrow IPC files.
// Run SQL queries against registered tables or ad-hoc against files using datafusion-cli. Accepts raw SQL or natural language questions. Supports Parquet, CSV, JSON, and Arrow IPC files.
Register a data file as a persistent external table in the DataFusion session. Supports Parquet, CSV, JSON, Arrow IPC, and Avro files. Explores the schema and writes to the session state file for reuse across skills.
Search Apache DataFusion documentation, user guide, and API reference. Returns relevant documentation for a question or keyword. Searches the official DataFusion repository and website.
Visualize and analyze DataFusion query execution plans. Shows logical and physical plans, identifies performance bottlenecks, and suggests optimizations. Supports EXPLAIN and EXPLAIN ANALYZE.
Install or update datafusion-cli. Supports installation via cargo install, Homebrew, or pre-built binaries. Checks the current version and offers to upgrade if outdated.
Create and manage materialized views using DataFusion. Persist SQL query results as Parquet files for fast repeated access. Track source dependencies and refresh when data changes. Powered by datafusion-cli's COPY TO.
Read and explore data files (Parquet, CSV, JSON, Arrow IPC, Avro) locally or from S3/GCS. Auto-detects format by extension. Uses datafusion-cli for schema inspection and data preview.
| name | query |
| description | Run SQL queries against registered tables or ad-hoc against files using datafusion-cli. Accepts raw SQL or natural language questions. Supports Parquet, CSV, JSON, and Arrow IPC files. |
| argument-hint | <SQL or question> [--file path] |
| allowed-tools | Bash |
You are helping the user query data using Apache DataFusion.
Input: $@
Follow these steps in order.
Look for an existing state file:
STATE_DIR=""
test -f .datafusion-skills/state.sql && STATE_DIR=".datafusion-skills"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || echo "$PWD")"
PROJECT_ID="$(echo "$PROJECT_ROOT" | tr '/' '-')"
test -f "$HOME/.datafusion-skills/$PROJECT_ID/state.sql" && STATE_DIR="$HOME/.datafusion-skills/$PROJECT_ID"
If found, verify the state file is valid:
datafusion-cli --file "$STATE_DIR/state.sql" -c "SELECT 1;"
Now determine the mode:
--file flag is present, or the SQL references file paths (e.g. FROM 'data.parquet'), or STATE_DIR is empty.STATE_DIR is set and the input references table names, is natural language, or is SQL without file references.If no state file exists and no file is referenced, fall back to ad-hoc mode.
command -v datafusion-cli
If not found, delegate to /datafusion-skills:install-datafusion and then continue.
If the input is natural language (not valid SQL), generate SQL.
In session mode, first retrieve the schema to inform query generation:
datafusion-cli --file "$STATE_DIR/state.sql" -c "SHOW TABLES;"
Then for relevant tables:
datafusion-cli --file "$STATE_DIR/state.sql" -c "DESCRIBE <table_name>;"
When generating SQL, keep these DataFusion-specific features in mind:
SELECT * FROM 'path/to/file.parquet' works for Parquet files directlyCREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 'data.parquet';
CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 'data.csv' OPTIONS ('has_header' 'true');
CREATE EXTERNAL TABLE t STORED AS JSON LOCATION 'data.json';
CREATE EXTERNAL TABLE t STORED AS ARROW LOCATION 'data.arrow';
CREATE EXTERNAL TABLE t STORED AS AVRO LOCATION 'data.avro';
COPY (SELECT * FROM t WHERE x > 10) TO 'output.parquet' STORED AS PARQUET;
ROW_NUMBER, RANK, LAG, LEAD, etc.WITH clauses are fully supportedCOUNT, SUM, AVG, MIN, MAX, MEDIAN, APPROX_PERCENTILE_CONT, etc.LENGTH, LOWER, UPPER, TRIM, SUBSTR, REGEXP_MATCH, etc.NOW(), DATE_TRUNC, DATE_PART, INTERVAL, TO_TIMESTAMP, etc.EXPLAIN or EXPLAIN ANALYZE to see query plansBefore executing, check if the query might produce too many rows.
If the query has no LIMIT and might return a large result, wrap it:
SELECT COUNT(*) AS total_rows FROM (<original_query>);
If total_rows > 1000, add LIMIT 100 and inform the user.
Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple
-c flags for multiple statements, or write a .sql file and use --file.
Session mode:
datafusion-cli --file "$STATE_DIR/state.sql" -c "<SQL_QUERY>"
Ad-hoc mode with file:
DataFusion v44+ can query Parquet, CSV, and JSON files directly by path:
datafusion-cli -c "SELECT * FROM '<file_path>' LIMIT 10;"
For Arrow IPC or Avro, create a temporary external table using a .sql file:
cat > /tmp/_df_query.sql << SQL
CREATE EXTERNAL TABLE _tmp STORED AS ARROW LOCATION '<file_path>';
<SQL_QUERY using _tmp>;
SQL
datafusion-cli --file /tmp/_df_query.sql
Ad-hoc mode without file:
datafusion-cli -c "<SQL_QUERY>"
datafusion-cli: command not found → invoke /datafusion-skills:install-datafusion and retryDESCRIBE on the table/file and show the user available columns/datafusion-skills:datafusion-docs <error keywords> to search documentationPresent the query results clearly. If the result is a table, format it nicely.
Suggest next steps if appropriate:
/datafusion-skills:create-table to register the file for repeated use/datafusion-skills:materialized-view to persist resultsstate.sql exists, tables registered by /datafusion-skills:create-table are automatically available/datafusion-skills:datafusion-docs for persistent errors/datafusion-skills:explain-plan for slow queries