| name | data-analysis |
| description | Use this skill when the user uploads Excel (.xlsx/.xls) or CSV files and wants to perform data analysis, generate statistics, create summaries, pivot tables, SQL queries, or any form of structured data exploration. Supports multi-sheet Excel workbooks, aggregation, filtering, joins, and exporting results to CSV/JSON/Markdown. |
Data Analysis Skill
Overview
This skill analyzes user-uploaded Excel/CSV files using DuckDB — an in-process analytical SQL engine. It supports schema inspection, SQL-based querying, statistical summaries, and result export, all through a single Python script.
Core Capabilities
- Inspect Excel/CSV file structure (sheets, columns, types, row counts)
- Execute arbitrary SQL queries against uploaded data
- Generate statistical summaries (mean, median, stddev, percentiles, nulls)
- Support multi-sheet Excel workbooks (each sheet becomes a table)
- Export query results to CSV, JSON, or Markdown
- Handle large files efficiently with DuckDB's columnar engine
Workflow
Step 1: Understand Requirements
When a user uploads data files and requests analysis, identify:
- File location: Path(s) to uploaded Excel/CSV files under
/mnt/user-data/uploads/
- Analysis goal: What insights the user wants (summary, filtering, aggregation, comparison, etc.)
- Output format: How results should be presented (table, CSV export, JSON, etc.)
- You don't need to check the folder under
/mnt/user-data
Step 2: Inspect File Structure
First, inspect the uploaded file to understand its schema:
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action inspect
This returns:
- Sheet names (for Excel) or filename (for CSV)
- Column names, data types, and non-null counts
- Row count per sheet/file
- Sample data (first 5 rows)
Step 3: Perform Analysis
Based on the schema, construct SQL queries to answer the user's questions.
Run SQL Query
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action query \
--sql "SELECT category, COUNT(*) as count, AVG(amount) as avg_amount FROM Sheet1 GROUP BY category ORDER BY count DESC"
Generate Statistical Summary
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action summary \
--table Sheet1
This returns for each numeric column: count, mean, std, min, 25%, 50%, 75%, max, null_count.
For string columns: count, unique, top value, frequency, null_count.
Export Results
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.xlsx \
--action query \
--sql "SELECT * FROM Sheet1 WHERE amount > 1000" \
--output-file /mnt/user-data/outputs/filtered-results.csv
Supported output formats (auto-detected from extension):
.csv — Comma-separated values
.json — JSON array of records
.md — Markdown table
Parameters
| Parameter | Required | Description |
|---|
--files | Yes | Space-separated paths to Excel/CSV files |
--action | Yes | One of: inspect, query, summary |
--sql | For query | SQL query to execute |
--table | For summary | Table/sheet name to summarize |
--output-file | No | Path to export results (CSV/JSON/MD) |
[!NOTE]
Do NOT read the Python file, just call it with the parameters.
Table Naming Rules
- Excel files: Each sheet becomes a table named after the sheet (e.g.,
Sheet1, Sales, Revenue)
- CSV files: Table name is the filename without extension (e.g.,
data.csv → data)
- Multiple files: All tables from all files are available in the same query context, enabling cross-file joins
- Special characters: Sheet/file names with spaces or special characters are auto-sanitized (spaces → underscores). Use double quotes for names that start with numbers or contain special characters, e.g.,
"2024_Sales"
Analysis Patterns
Basic Exploration
SELECT COUNT(*) FROM Sheet1
SELECT DISTINCT category FROM Sheet1
SELECT category, COUNT(*) as cnt FROM Sheet1 GROUP BY category ORDER BY cnt DESC
SELECT MIN(date_col), MAX(date_col) FROM Sheet1
Aggregation & Grouping
SELECT category, DATE_TRUNC('month', order_date) as month,
SUM(revenue) as total_revenue
FROM Sales
GROUP BY category, month
ORDER BY month, total_revenue DESC
SELECT customer_name, SUM(amount) as total_spend
FROM Orders GROUP BY customer_name
ORDER BY total_spend DESC LIMIT 10
Cross-file Joins
SELECT s.order_id, s.amount, c.customer_name, c.region
FROM sales s
JOIN customers c ON s.customer_id = c.id
WHERE s.amount > 500
Window Functions
SELECT order_date, amount,
SUM(amount) OVER (ORDER BY order_date) as running_total,
RANK() OVER (ORDER BY amount DESC) as amount_rank
FROM Sales
Pivot-style Analysis
SELECT category,
SUM(CASE WHEN MONTH(date) = 1 THEN revenue END) as Jan,
SUM(CASE WHEN MONTH(date) = 2 THEN revenue END) as Feb,
SUM(CASE WHEN MONTH(date) = 3 THEN revenue END) as Mar
FROM Sales
GROUP BY category
Complete Example
User uploads sales_2024.xlsx (with sheets: Orders, Products, Customers) and asks: "Analyze my sales data — show top products by revenue and monthly trends."
Step 1: Inspect the file
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action inspect
Step 2: Top products by revenue
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action query \
--sql "SELECT p.product_name, SUM(o.quantity * o.unit_price) as total_revenue, SUM(o.quantity) as total_units FROM Orders o JOIN Products p ON o.product_id = p.id GROUP BY p.product_name ORDER BY total_revenue DESC LIMIT 10"
Step 3: Monthly revenue trends
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action query \
--sql "SELECT DATE_TRUNC('month', order_date) as month, SUM(quantity * unit_price) as revenue FROM Orders GROUP BY month ORDER BY month" \
--output-file /mnt/user-data/outputs/monthly-trends.csv
Step 4: Statistical summary
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales_2024.xlsx \
--action summary \
--table Orders
Present results to the user with clear explanations of findings, trends, and actionable insights.
Multi-file Example
User uploads orders.csv and customers.xlsx and asks: "Which region has the highest average order value?"
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/orders.csv /mnt/user-data/uploads/customers.xlsx \
--action query \
--sql "SELECT c.region, AVG(o.amount) as avg_order_value, COUNT(*) as order_count FROM orders o JOIN Customers c ON o.customer_id = c.id GROUP BY c.region ORDER BY avg_order_value DESC"
Output Handling
After analysis:
- Present query results directly in conversation as formatted tables
- For large results, export to file and share via
present_files tool
- Always explain findings in plain language with key takeaways
- Suggest follow-up analyses when patterns are interesting
- Offer to export results if the user wants to keep them
Caching
The script automatically caches loaded data to avoid re-parsing files on every call:
- On first load, files are parsed and stored in a persistent DuckDB database under
/mnt/user-data/workspace/.data-analysis-cache/
- The cache key is a SHA256 hash of all input file contents — if files change, a new cache is created
- Subsequent calls with the same files will use the cached database directly (near-instant startup)
- Cache is transparent — no extra parameters needed
This is especially useful when running multiple queries against the same data files (inspect → query → summary).
Enhanced Capabilities (v2.0)
The following enhanced capabilities are available in addition to the core features above:
Advanced Analytics
1. Time Series Forecasting
Predict future values based on historical data:
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales.csv \
--action forecast \
--column revenue \
--periods 12
Parameters:
--column: Numeric column to forecast
--periods: Number of periods to predict (e.g., 12 for next 12 months)
--frequency: Time frequency (daily/weekly/monthly/quarterly, default: infer from data)
Returns: Forecast values with confidence intervals
2. Anomaly Detection
Automatically detect outliers and anomalies:
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/transactions.csv \
--action anomaly \
--column amount \
--method iqr
Parameters:
--column: Column to analyze
--method: Detection method (iqr/zscore/isolation_forest)
--threshold: Sensitivity threshold (default: auto)
Returns: Rows identified as anomalies with scores
3. Correlation Analysis
Find relationships between variables:
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.csv \
--action correlate
Returns: Correlation matrix for all numeric columns, heatmap data
4. Trend Analysis
Identify trends, seasonality, and patterns:
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/timeseries.csv \
--action trend \
--column value \
--period 12
Parameters:
--column: Numeric column to analyze
--period: Period for seasonality detection (e.g., 12 for monthly data)
Returns: Trend direction, seasonality strength, pattern components
5. Clustering Analysis
Group similar data points:
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/customers.csv \
--action cluster \
--columns age,income,spending_score \
--k 4
Parameters:
--columns: Comma-separated columns to use for clustering
--k: Number of clusters (default: auto-detect using elbow method)
Returns: Cluster assignments, cluster centers, silhouette score
6. Data Quality Assessment
Comprehensive data quality report:
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/data.csv \
--action quality
Returns:
- Missing value analysis
- Duplicate detection
- Data type validation
- Consistency checks
- Quality score (0-100)
Usage Guidelines
When to Use Basic Analysis
- Simple aggregations and summaries
- Filtering and grouping
- Cross-file joins
- Ad-hoc queries
When to Use Advanced Analysis
- Forecasting: "predict next quarter sales", "forecast demand"
- Anomaly Detection: "find unusual transactions", "detect fraud"
- Correlation: "relationship between variables", "what affects X"
- Trend Analysis: "growth pattern", "seasonality"
- Clustering: "customer segmentation", "group similar items"
- Quality: "data health check", "cleanliness report"
Combining Basic + Advanced
For comprehensive analysis, combine both:
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales.csv \
--action inspect
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales.csv \
--action quality
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales.csv \
--action query \
--sql "SELECT region, SUM(revenue) as total FROM sales GROUP BY region"
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales.csv \
--action forecast \
--column revenue \
--periods 6
Output Formats
Advanced analysis outputs can be:
- Formatted tables (default)
- CSV export with
--output-file
- JSON for programmatic use
- Markdown for reports
Example with export:
python /mnt/skills/public/data-analysis/scripts/analyze.py \
--files /mnt/user-data/uploads/sales.csv \
--action forecast \
--column revenue \
--periods 12 \
--output-file /mnt/user-data/outputs/forecast-results.csv
Notes
- Advanced analysis requires sufficient data points (minimum 30 for forecasting, 50+ for clustering)
- Forecasting works best with regular time intervals
- Anomaly detection sensitivity can be adjusted with
--threshold
- Clustering automatically suggests optimal k if not specified
- All advanced methods include confidence scores where applicable
Notes
- DuckDB supports full SQL including window functions, CTEs, subqueries, and advanced aggregations
- Excel date columns are automatically parsed; use DuckDB date functions (
DATE_TRUNC, EXTRACT, etc.)
- For very large files (100MB+), DuckDB handles them efficiently without loading everything into memory
- Column names with spaces are accessible using double quotes:
"Column Name"