| name | data-analyst |
| description | 数据分析专业指南。当用户提到"分析"、"看看"、"查看"、"统计"、"计算"、"图表"、"报表"、"数据"、"趋势"、"对比"、"汇总"、"平均"、"占比"、"增长"等任何数据相关词汇时必须加载。包含统计方法选择、加权平均计算、数据质量检查、业务洞察提取等最佳实践。适用于 Excel、CSV 等所有数据源的专业分析。 |
# Professional Data Analyst Guidelines
##############################################################################
🚨🚨🚨 MANDATORY RULES - 必须严格遵循 🚨🚨🚨
##############################################################################
你已加载此 skill,现在必须像法律一样严格遵循以下所有规则!
⛔ 忽略这些规则将导致统计错误(如用错平均方法、忽略数据权重等)
✅ 严格遵循这些规则才能得出正确的分析结论
##############################################################################
⚠️ CRITICAL: 必须加载此 Skill 的场景
当用户提到以下任何词汇时,你必须先加载此 skill:
- 分析、查看、看看、统计、计算、汇总
- 平均、均值、占比、增长、趋势、变化
- 对比、比较、相关、影响因素
- 图表、报表、可视化
- 任何涉及数据处理、数值计算的任务
IMPORTANT: This skill provides data analysis best practices and guidelines. Use it alongside excel-python for actual data processing.
##############################################################################
🔴 STOP! 在执行任何计算之前,先检查以下关键点:
##############################################################################
- 分母是否相同? 不同分母的百分比不能简单平均!
- 权重是否相等? 不同规模的数据需要加权平均!
- 数据类型是什么? 百分比、绝对值、时间序列各有不同处理方法!
##############################################################################
Core Principles
When users ask to "analyze" (分析) data or request data insights, follow these professional standards:
1. Data Type Recognition
Identify Data Types Before Analysis
| Data Type | Description | Analysis Approach |
|---|
| Percentage/Rate | Ratios like completion rate, pass rate | Check if denominators are the same |
| Absolute Value | Amounts, quantities, revenue | Check for scale differences |
| Time Series | Annual, quarterly, monthly data | Use trend analysis |
| Categorical | Regions, product lines, departments | Use grouping/segmentation |
Chinese Unit Conversions
- 万 (wan) = 10,000
- 亿 (yi) = 100,000,000
- Always convert to consistent units for analysis
2. Statistical Method Selection
When to Use Mean (Average)
✅ Use Simple Average:
- Same denominator percentages (e.g., consecutive quarter growth rates)
- Same population scores (e.g., same employee's quarterly performance)
- Directly comparable values with same scale
❌ DO NOT Use Simple Average:
- Different denominator percentages (e.g., different class pass rates)
- Different scale absolute values (e.g., different department sizes)
Weighted Average Formula
def weighted_average(values, weights):
"""Calculate weighted average"""
if sum(weights) == 0:
return None
return sum(v * w for v, w in zip(values, weights)) / sum(weights)
pass_rates = [0.80, 0.90, 0.75]
student_counts = [50, 20, 40]
weighted_rate = weighted_average(pass_rates, student_counts)
print(f"全校及格率: {weighted_rate*100:.2f}%")
Statistical Measure Selection
| Measure | Best For | Not Suitable For |
|---|
| Mean | Normal distribution, no outliers | Skewed data, extreme values |
| Median | Outliers present, skewed | Calculating totals |
| Mode | Categorical data, most common | Numerical calculations |
| Std Dev | Measuring dispersion | Different scale comparison |
3. Business Analysis Dimensions
Trend Analysis
YoY Growth (同比):
增长率 = (本期 - 同期) / 同期 × 100%
- Compare current period to same period last year
- Used for year-over-year performance tracking
MoM Growth (环比):
增长率 = (本期 - 上期) / 上期 × 100%
- Compare current month/quarter to previous month/quarter
- Used for short-term trend tracking
Comparative Analysis
- By Product Line: Compare performance across products
- By Region: Compare regional performance
- By Time Period: Compare Q1 vs Q2 vs Q3 vs Q4
- By Customer Segment: Compare different customer groups
4. Data Quality Checks
Before analysis, ALWAYS check for:
import pandas as pd
df = pd.read_excel('/path/to/file.xlsx')
print("数据形状:", df.shape)
print("数据类型:\n", df.dtypes)
print("前5行:\n", df.head())
print("\n空值统计:\n", df.isnull().sum())
print("\n数值列统计:\n", df.describe())
for col in df.select_dtypes(include=[np.number]).columns:
if (df[col] < 0).any():
print(f"警告: {col} 包含负数")
Data Quality Checklist
5. Percentage/Rate Handling Rules
CRITICAL: The Percentage Trap
WRONG Example:
班级A: 80% 及格率 (40/50 = 80%)
班级B: 90% 及格率 (18/20 = 90%)
简单平均: (80% + 90%) / 2 = 85% ← WRONG!
CORRECT Example:
实际及格率 = (40 + 18) / (50 + 20) = 58/70 ≈ 82.86% ← CORRECT!
When to Use Which Method
Same Denominator → Simple Average OK:
Q1增长率: 10%
Q2增长率: 15%
Q3增长率: 12%
平均增长率: (10% + 15% + 12%) / 3 = 12.33% ← OK
Different Denominators → Must Use Weighted Average:
班级A: 80% 及格率 (50人)
班级B: 90% 及格率 (20人)
班级C: 75% 及格率 (40人)
全校及格率: (50×0.8 + 20×0.9 + 40×0.75) / (50+20+40) ≈ 80% ← CORRECT
6. Business Insight Framework
Four-Layer Output Structure
Layer 1 - Data Presentation:
- Display tables/charts with raw data
- Show key metrics clearly
Layer 2 - Basic Conclusions:
- Rising/Falling/Stable trend
- High/Low performance
- Anomalies detected
Layer 3 - Business Insights:
- Root cause analysis
- Impact assessment
- Pattern recognition
Layer 4 - Actionable Recommendations:
- Specific next steps
- Risk mitigation strategies
- Opportunity identification
Example Output Format
📊 销售数据分析报告
【数据概览】
2024年总营收: 5000万元
同比增长: +15.2%
各季度分布: Q1=1200万, Q2=1100万, Q3=1300万, Q4=1400万
【基本结论】
✅ 整体呈上升趋势
✅ Q4达到全年峰值
✅ 同比增长显著
【业务洞察】
🔍 Q4增长主要受节假日促销带动
🔍 Q2受市场影响略有下滑
🔍 线上渠道占比从40%提升至55%
【行动建议】
1. 加强Q1-Q3淡季营销力度
2. 继续扩大线上渠道投入
3. 提前备货应对Q4旺季
7. Common Analysis Templates
Excel Data Analysis Template
import pandas as pd
df = pd.read_excel('/path/to/file.xlsx')
print("=== 数据概览 ===")
print(df.info())
print("\n=== 基本统计 ===")
print(df.describe())
if '日期' in df.columns:
df['日期'] = pd.to_datetime(df['日期'])
df.sort_values('日期', inplace=True)
df['环比增速'] = df['金额'].pct_change() * 100
print("\n=== 环比增速 ===")
print(df[['日期', '金额', '环比增速']])
if '产品线' in df.columns:
grouped = df.groupby('产品线')['金额'].agg(['sum', 'count', 'mean'])
print("\n=== 按产品线汇总 ===")
print(grouped)
Percentage Analysis Template
import pandas as pd
df = pd.read_excel('/path/to/file.xlsx')
if '及格人数' in df.columns and '总人数' in df.columns:
df['实际及格率'] = df['及格人数'] / df['总人数'] * 100
print("=== 各班及格率 ===")
print(df[['班级', '及格人数', '总人数', '实际及格率']])
total_pass = df['及格人数'].sum()
total_students = df['总人数'].sum()
overall_rate = total_pass / total_students * 100
print(f"\n全校及格率: {overall_rate:.2f}%")
Trend Analysis Template
import pandas as pd
df = pd.read_excel('/path/to/file.xlsx')
df['日期'] = pd.to_datetime(df['日期'])
df.sort_values('日期', inplace=True)
df['同比增速'] = df.groupby(df['日期'].dt.month)['金额'].pct_change(periods=12) * 100
df['环比增速'] = df['金额'].pct_change() * 100
print("=== 趋势分析 ===")
print(df[['日期', '金额', '环比增速', '同比增速']])
latest_growth = df['环比增速'].iloc[-1]
if latest_growth > 10:
print("\n📈 近期增长强劲")
elif latest_growth > 0:
print("\n📊 稳步增长")
elif latest_growth > -10:
print("\n📉 小幅下滑")
else:
print("\n⚠️ 需关注下滑趋势")
8. Self-Check Before Output
Before presenting analysis results, ask yourself:
9. Common Pitfalls to Avoid
Wrong Analysis Examples
❌ WRONG: Averaging different denominator percentages
"各班及格率平均为 85%"
Why wrong: Different classes have different student counts.
✅ CORRECT:
"全校及格率为 82.86%(计算方式:总及格人数/总人数)"
❌ WRONG: Not explaining methodology
"平均增长率为 15%"
Why wrong: User doesn't know what was averaged.
✅ CORRECT:
"Q1-Q4平均环比增长率为 15%(基于4个季度的环比增速计算)"
❌ WRONG: No context
"销售额下降"
Why wrong: No timeframe or magnitude.
✅ CORRECT:
"2024年Q2销售额下降8.5%,环比Q1的5000万降至4575万"
10. Output Style Guidelines
Professional Communication
- Data-Driven: State facts first, then analysis
- Layered: Present → Analyze → Insight → Action
- Clear: Use tables, bold text, and structure
- Actionable: Provide concrete next steps
- Contextual: Explain methodology when needed
Chinese Terminology
| Term | Usage |
|---|
| 同比 | Compare to same period last year |
| 环比 | Compare to previous period |
| 营收 | Revenue |
| 毛利 | Gross profit |
| 净利 | Net profit |
| 及格率 | Pass rate |
| 完成率 | Completion rate |
Formatting for Readability
# 📊 销售数据分析报告
## 数据概览
| 指标 | 数值 | 同比 |
|------|------|------|
| 总营收 | 5000万 | +15.2% |
## 主要结论
✅ 整体呈上升趋势
⚠️ Q2略有下滑
## 业务洞察
1. Q4受节假日带动增长显著
2. 线上渠道持续扩张
## 行动建议
1. 加强淡季营销
2. 继续扩大线上投入
11. Analysis Workflow
When analyzing data, follow this sequence:
- Read Data - Load and inspect structure
- Quality Check - Identify nulls, outliers, issues
- Select Methods - Choose appropriate statistical approach
- Calculate - Perform analysis with Python
- Verify - Check results for reasonableness
- Interpret - Explain business meaning
- Present - Structure output with insights and recommendations
文件操作安全规则
分析时的文件处理:
- 只读分析:直接读取原文件(
data_only=True),不做修改
- 需要修改时:先复制原文件 → 在副本上修改 → 原文件保持不变
- 保存分析结果:保存到新文件(如
原文件名_分析结果.xlsx),不覆盖原文件
wb = openpyxl.load_workbook('/path/to/原文件.xlsx', data_only=True)
import shutil
shutil.copy('/path/to/原文件.xlsx', '/path/to/原文件_分析结果.xlsx')
wb = openpyxl.load_workbook('/path/to/原文件_分析结果.xlsx')
wb.save('/path/to/原文件_分析结果.xlsx')
汇报时说明文件位置:
12. Advanced Analysis Topics
Outlier Detection
import pandas as pd
import numpy as np
df = pd.read_excel('/path/to/file.xlsx')
Q1 = df['金额'].quantile(0.25)
Q3 = df['金额'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['金额'] < lower_bound) | (df['金额'] > upper_bound)]
print("异常值:", outliers)
Correlation Analysis
correlation_matrix = df.corr()
print("相关性分析:\n", correlation_matrix)
high_corr = correlation_matrix[correlation_matrix.abs() > 0.7]
print("高相关性 (>0.7):\n", high_corr)
Summary Checklist
Before finalizing any data analysis:
✅ Identified data types correctly
✅ Used appropriate statistical methods
✅ Handled percentages with correct weighting
✅ Checked data quality
✅ Provided clear methodology
✅ Structured output (Data → Conclusion → Insight → Action)
✅ Used professional Chinese terminology
✅ Included actionable recommendations
✅ Verified results are reasonable
✅ 原文件保持不变(如需修改,先复制再改)
✅ 告诉用户结果文件在哪里
Runtime
No additional dependencies needed. Use with:
pandas for data analysis
numpy for statistical calculations
- Excel files via
openpyxl (excel-python skill)