| name | data-analysis |
| description | Analyze, explore, clean, and visualize datasets with statistical rigor. Use when user asks to analyze data, find patterns, compute statistics, create visualizations, clean messy data, or explore a dataset. Trigger when user says things like "analyze this data", "what trends do you see", "find patterns in", "create a chart", "clean this dataset", "run statistics on", "what does this data tell us", or provides CSV/Excel/JSON data for exploration. Also trigger for A/B test analysis, cohort analysis, and data quality assessments. Do NOT trigger for simple data format conversions, database query writing without analysis, or ETL pipeline design. |
| metadata | {"version":"2.0.0","author":"Upsonic","tags":["data","analysis","statistics","visualization"]} |
Data Analysis
Explore, clean, analyze, and communicate findings from data. The goal is always to answer a question — start with what the user wants to know and work backward to the analysis that answers it.
Before You Analyze
Understand the Question
Before touching the data, clarify:
- What question are we answering? ("Is our conversion rate improving?" is an answerable question. "Analyze this data" is not — help the user sharpen it.)
- Who needs the answer? (Engineer debugging an issue? Executive making a budget decision? Researcher testing a hypothesis?)
- What decisions will this inform? (This determines how precise you need to be and what format the answer should take.)
- What's the timeline? (A quick sanity check and a thorough statistical analysis require different approaches.)
If the user says "analyze this data" without a specific question, help them formulate one:
- "What would be most useful to know from this data?"
- "Are you looking for trends over time, comparisons between groups, or something else?"
- "Is there a specific business question this should answer?"
Reference Materials and Scripts
- Execute
profile_data.py with a data file path to get a quick profile of any CSV, Excel, or JSON dataset — it reports shape, types, missing values, stats, and value distributions. Run with --help for usage.
- Load
statistical-tests-guide.md when choosing statistical tests — it has a decision matrix for test selection, effect size interpretation tables, and sample size guidelines.
Understand the Data
Before analysis, get your bearings:
- Source and context: Where did this data come from? How was it collected? What time period does it cover?
- Schema: What are the columns/fields? What do they represent? What are the data types?
- Scale: How many rows/records? What's the granularity? (Per user? Per day? Per transaction?)
- Known issues: Is the data known to be incomplete, biased, or have quality problems?
import pandas as pd
df = pd.read_csv("data.csv")
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst rows:\n{df.head()}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
Analysis Workflow
Step 1: Clean and Validate
Data quality determines analysis quality. Don't skip this.
Handle Missing Values
- Count them first: What percentage of each column is missing?
- Understand why: Are they random? Systematic? (e.g., optional fields vs data collection failures)
- Choose a strategy and document it:
- Drop rows: When missing data is rare and random (less than 5%)
- Impute with median/mode: When missing data is moderate and the distribution is known
- Flag as separate category: When missingness itself is informative
- Leave as-is: When the analysis method handles nulls natively
missing_pct = df.isnull().sum() / len(df) * 100
print("Missing data percentage per column:")
print(missing_pct[missing_pct > 0].sort_values(ascending=False))
Handle Outliers
- Detect: Use IQR method, z-scores, or domain knowledge
- Investigate: Are they errors or legitimate extreme values?
- Document your decision: Keep, cap, or remove — and explain why
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['value'] < Q1 - 1.5 * IQR) | (df['value'] > Q3 + 1.5 * IQR)]
print(f"Found {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")
Validate Data Types and Ranges
- Dates should be dates, numbers should be numbers
- Check for impossible values (negative ages, future dates, percentages over 100)
- Verify categorical values are consistent (watch for "USA", "US", "United States")
Step 2: Explore
Start broad, then focus on what's interesting.
Descriptive Statistics
Always start here — understand the basics before going deeper.
print(df.describe())
for col in df.select_dtypes(include='object').columns:
print(f"\n{col}: {df[col].nunique()} unique values")
print(df[col].value_counts().head(10))
Distributions
Understanding shape matters for choosing the right tests.
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for i, col in enumerate(['metric_a', 'metric_b', 'metric_c']):
df[col].hist(ax=axes[i], bins=30)
axes[i].set_title(col)
axes[i].axvline(df[col].median(), color='red', linestyle='--', label='median')
axes[i].legend()
plt.tight_layout()
plt.savefig("distributions.png")
Correlations and Relationships
Look for patterns between variables.
corr = df.select_dtypes(include='number').corr()
print("Strong correlations (|r| > 0.5):")
for i in range(len(corr.columns)):
for j in range(i+1, len(corr.columns)):
if abs(corr.iloc[i, j]) > 0.5:
print(f" {corr.columns[i]} vs {corr.columns[j]}: {corr.iloc[i,j]:.3f}")
Trends Over Time
If the data has a time dimension, always look at trends.
df['date'] = pd.to_datetime(df['date'])
daily = df.groupby('date')['metric'].agg(['mean', 'count'])
daily['mean'].plot(figsize=(12, 4), title='Daily Average')
plt.savefig("trend.png")
Step 3: Analyze
Choose the right method for the question.
Comparison Questions ("Is A different from B?")
Use the right statistical test:
- Two groups, continuous outcome: t-test (if normal) or Mann-Whitney U (if not)
- Multiple groups: ANOVA (if normal) or Kruskal-Wallis (if not)
- Two groups, categorical outcome: Chi-squared test
- Before/after with same subjects: Paired t-test or Wilcoxon signed-rank
Always report:
- Sample sizes for each group
- Effect size (not just p-value) — Cohen's d, odds ratio, or percentage difference
- Confidence intervals
- Practical significance, not just statistical significance
from scipy import stats
group_a = df[df['variant'] == 'A']['metric']
group_b = df[df['variant'] == 'B']['metric']
_, p_normal_a = stats.shapiro(group_a.sample(min(5000, len(group_a))))
_, p_normal_b = stats.shapiro(group_b.sample(min(5000, len(group_b))))
if p_normal_a > 0.05 and p_normal_b > 0.05:
stat, p_value = stats.ttest_ind(group_a, group_b)
test_name = "t-test"
else:
stat, p_value = stats.mannwhitneyu(group_a, group_b)
test_name = "Mann-Whitney U"
pooled_std = ((group_a.std()**2 + group_b.std()**2) / 2) ** 0.5
cohens_d = (group_a.mean() - group_b.mean()) / pooled_std
print(f"Test: {test_name}")
print(f"Group A: mean={group_a.mean():.3f}, n={len(group_a)}")
print(f"Group B: mean={group_b.mean():.3f}, n={len(group_b)}")
print(f"Difference: {group_a.mean() - group_b.mean():.3f} ({(group_a.mean() - group_b.mean())/group_b.mean()*100:.1f}%)")
print(f"Cohen's d: {cohens_d:.3f}")
print(f"p-value: {p_value:.4f}")
Trend Questions ("Is this changing over time?")
- Use rolling averages to smooth noise
- Look for seasonality (day-of-week, monthly, quarterly patterns)
- Segment by key dimensions — overall trends can mask group-level differences
- Compare period-over-period (WoW, MoM, YoY)
Composition Questions ("What's the breakdown?")
- Use percentages and proportions
- Show both absolute numbers and percentages
- Look for the Pareto principle (80/20 rule)
- Segment into meaningful groups
A/B Test Analysis
For experiment analysis, follow this checklist:
- Sample size: Was the test adequately powered?
- Duration: Did it run long enough for weekly patterns?
- Randomization: Were groups balanced on key dimensions?
- Metric definition: Is the metric clearly defined and correctly computed?
- Statistical test: Use the appropriate test for the metric type
- Multiple comparisons: Correct for multiple testing if checking many metrics
- Practical significance: Is the effect large enough to matter?
Step 4: Communicate Findings
Structure matters as much as the analysis itself.
Lead With the Answer
Start with what the user asked. Then support it.
Pattern:
[Answer to the question in 1-2 sentences]
[Key supporting evidence: 2-4 bullet points with specific numbers]
[Methodology note: How you arrived at this, in 1-2 sentences]
[Caveats and limitations: What could affect this conclusion]
[Recommended next steps: What to do with this information]
Visualization Guidelines
Choose the right chart type:
- Trend over time: Line chart
- Comparison between categories: Bar chart (horizontal for many categories)
- Distribution: Histogram or box plot
- Relationship between two variables: Scatter plot
- Part-of-whole: Stacked bar or pie chart (only for 2-5 categories)
- Multiple dimensions: Heatmap or small multiples
For every chart:
- Title that states the finding, not just the data ("Conversion drops 40% on weekends" not "Conversion by day")
- Label axes with units
- Include sample size or time period
- Use consistent colors across related charts
What to Always Include
- Sample sizes: How much data backs each claim
- Time periods: When the data is from
- Definitions: How key metrics are calculated
- Confidence: How certain you are (confidence intervals, p-values where appropriate)
- Limitations: What the data can't tell you
Common Pitfalls
- Survivorship bias: Only analyzing data from things that succeeded/persisted
- Simpson's paradox: A trend that appears in groups but reverses when combined
- Confounding variables: Correlation doesn't mean causation — look for third variables
- Small sample sizes: Don't draw strong conclusions from small N. State the limitation.
- Cherry-picking timeframes: Make sure the time period is representative
- Precision theater: Reporting "23.847%" when the data only supports "roughly 24%"
- Missing baseline: A number without context is meaningless ("10,000 errors" — is that a lot?)
Handling Common Requests
"Just give me a quick look"
Provide: shape, column summary, missing data count, top-level descriptive stats, one key insight. Keep it to one screen of output.
"What's interesting in this data?"
Run the full exploration workflow (Step 2) and surface the 3-5 most notable patterns: unexpected distributions, strong correlations, outliers, trends.
"Is this statistically significant?"
Clarify what comparison they mean, choose the right test, report effect size and p-value, and explain what it means in practical terms.
"Create a dashboard / visualization"
Ask what decisions the dashboard supports. Design 3-5 charts that answer those questions. Use consistent styling and clear titles that state findings.