| name | sn-da-excel-workflow |
| description | Excel 数据分析多步编排器。覆盖:(1) 读取多 Sheet Excel 文件并统计行数,(2) 大文件检测(≥10k 行自动 Parquet 优化),(3) 数据清洗(缺失值、文本标准化、无效字符),(4) 条件筛选与分类提取,(5) 跨 Sheet 统计聚合,(6) 导出 Excel/CSV 并提供下载链接。覆盖从数据读取到报告生成全流程,按步骤编排 capability 子 skill。**遇到以下任一情况就主动使用本 skill,不要自行写几行 pandas 就回答**:①用户出现触发词:Excel 分析 / 表格分析 / 数据分析 / 数据清洗 / 数据统计 / 数据筛选 / 数据可视化 / 数据导出 / 汇总统计 / 透视表 / 分组统计 / 交叉分析 / 趋势分析 / 对比分析 / 异常值检测 / 去重 / 缺失值处理 / Excel 报告 / 生成报表 / analyze Excel / data analysis / data cleaning / pivot table;②用户上传或指定了 .xlsx / .xls / .csv 文件并要求分析、清洗、统计或可视化;③任务涉及多 Sheet 读取、条件筛选、分类汇总、图表生成中的任意一项;④用户要求导出带格式的 Excel 报告或下载链接。仅不用于:不涉及表格数据的纯文本处理、图片分析(使用 sn-da-image-caption)、单个公式计算的简单问答。 |
Excel Data Analysis Workflow
End-to-end workflow for structured Excel analysis. Each step maps to a
capability sub-skill that can be loaded for detailed patterns.
Workflow
Step 1 — Count rows across all sheets (lightweight, no full load)
Count rows per sheet without loading data into memory. Use openpyxl
read_only mode — this works for any file size.
import openpyxl, gc
wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
total_rows = 0
sheet_info = {}
for name in wb.sheetnames:
ws = wb[name]
row_count = sum(1 for _ in ws.iter_rows(min_row=2, values_only=True))
total_rows += row_count
sheet_info[name] = row_count
print(f"Sheet '{name}': {row_count} rows")
wb.close()
print(f"总行数={total_rows}")
⚠️ Do NOT use pd.read_excel() to count rows — it loads all data into
memory, which will OOM on large files.
→ capability: excel-reading/multi-sheet-reading
Step 2 — Large file gate (CRITICAL — choose strategy by row count)
| total_rows | Strategy | What to do |
|---|
| < 10k | Direct read | df = pd.read_excel(file_path, sheet_name=target_sheet) |
| 10k – 100k | Parquet cache | pd.read_excel() once → df.to_parquet() → all later reads from Parquet |
| >= 100k | STOP. Load sn-da-large-file-analysis skill | Read its SKILL.md, then follow its streaming read + Parquet pattern. Do NOT use pd.read_excel() at all — it will OOM or timeout on 100k+ rows. |
For >= 100k rows:
read_file(path="<skills_base>/sn-da-large-file-analysis/SKILL.md")
Then use stream_excel_to_parquet() from that skill — it reads via
openpyxl iter_rows in 50k-row chunks with constant memory.
For 10k – 100k rows (only):
import pandas as pd
parquet_path = "/tmp/_auto_parquet.parquet"
df = pd.read_excel(file_path, sheet_name=target_sheet)
df.to_parquet(parquet_path, engine="pyarrow")
del df; gc.collect()
df = pd.read_parquet(parquet_path)
→ capability: excel-reading/large-excel-reading
Step 3 — Inspect schema & data types
Preview target sheet structure. For large files (>= 10k rows), only read
a small sample — never full load just to inspect.
df_head = pd.read_excel(file_path, sheet_name=target_sheet, nrows=20)
print(f"Columns: {df_head.columns.tolist()}")
print(f"Dtypes:\n{df_head.dtypes}")
print(df_head.head(10))
→ capability: excel-reading/range-reading
Step 4 — Data cleaning
Handle missing values, normalize text, clean invalid characters.
null_count = df[col].isna().sum()
import re
def clean_text(val):
if pd.isna(val): return val
return "".join(re.findall(r"[\u4e00-\u9fff]", str(val))) or ""
df[col] = df[col].apply(clean_text)
⚠️ Large file rule: When total_rows >= 100k, do NOT use df.apply(lambda...).
Use vectorized operations or np.where() instead. See sn-da-large-file-analysis skill
for the vectorized cheat sheet.
→ capabilities:
excel-data-cleaning/missing-value-handling
excel-data-cleaning/invalid-data-cleaning
excel-data-cleaning/text-normalization
Step 5 — Filter & extract
Apply condition or category filters, aggregate results.
mask = df[col].astype(str).str.strip() == target_value
filtered = df[mask]
df_raw = pd.read_excel(file_path, sheet_name=sheet, header=None)
→ capabilities:
excel-data-filtering/condition-filtering
excel-data-filtering/category-filtering
excel-data-filtering/threshold-filtering
Step 6 — Export results
Save filtered/cleaned data as Excel or CSV. Provide download link.
output_path = "/mnt/data/result.xlsx"
result_df.to_excel(output_path, index=False)
print(f"[Download](sandbox:{output_path})")
→ capabilities:
excel-result-export/single-sheet-export
excel-result-export/formatted-export
Key rules
- Always count rows first — gate large-file logic on the 10k threshold.
- >= 100k rows → MUST load
sn-da-large-file-analysis skill — do not attempt to handle with pd.read_excel().
- Column names may contain spaces (e.g.
'是否通 过') — use exact string indexing.
- Headerless sheets — use
header=None and positional indexing.
- Prohibited on large files (>= 100k rows):
pd.read_excel() for full load (use streaming read → Parquet)
df.apply(lambda...) or df.iterrows() (use vectorized ops or itertuples())
fc-list, find ... fonts, subprocess to search fonts, or pip install (use fixed font paths below)
- Printing all unique values or full DataFrames (use
.head(), .value_counts().head())
CJK Font Setup (mandatory for charts)
When generating charts with matplotlib, copy this block as-is. Do NOT search for fonts.
import os
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
_FONT_PATHS = [
'/mnt/afs_agents/SimHei.ttf',
'/mnt/afs_agents/mnt/data/SimHei.ttf',
os.path.expanduser('~/.fonts/SimHei.ttf'),
'/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc',
'/usr/share/fonts/SimHei.ttf',
]
for _p in _FONT_PATHS:
if os.path.exists(_p):
fm.fontManager.addfont(_p)
matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).get_name()
break
matplotlib.rcParams['axes.unicode_minus'] = False
How to load sub-skills
Each workflow step references one or more capability sub-skills. When you
need the detailed code pattern for a step, load the sub-skill on demand:
read_file(path="<base_path>/<category>/<sub-skill-name>/SKILL.md")
Rules:
- Only load the sub-skill(s) needed for your current step.
- Do NOT load all sub-skills at once — it wastes context.
- The top-level workflow (this file) is your guide; sub-skills provide
detailed implementation patterns.
Available capability sub-skills
Base path: <skills_root>/sn-da-excel-workflow/capability/{category}/{sub-skill}/SKILL.md
excel-reading — 读取与解析
| Sub-skill | 功能 |
|---|
| single-sheet-reading | 读取单个工作表,支持合并单元格处理、交叉分析及多维度可视化 |
| multi-sheet-reading | 读取多工作表,动态评估数据量启用Parquet优化,支持正则清洗、分类汇总与线性拟合 |
| range-reading | 特定区域数据提取,根据数据量动态选择处理策略 |
| large-excel-reading | 大型Excel文件处理,支持Parquet转换提速,生成带条件高亮的格式化报告 |
| multi-file-reading | 多文件读取与统计,支持大文件Parquet转换与可视化报告 |
| specific-sheet-reading | 跨Sheet特定字段统计、数据清洗与交叉分析,生成带下载链接的汇总报告 |
| structured-header-reading | 动态识别目标列进行统计,正则清洗文本字段提取中文字符 |
excel-data-cleaning — 数据清洗
| Sub-skill | 功能 |
|---|
| missing-value-handling | 多Sheet智能清洗、跨表核对与可视化分析 |
| duplicate-removal | 多Sheet去重统计,生成摘要与明细报表 |
| invalid-data-cleaning | 正则清洗指定文本列(如保留中文字符),大文件自动Parquet加速 |
| text-normalization | 文本标准化清洗(去除异常前缀、提取纯中文字符等) |
| numeric-format-normalization | 数值格式标准化,支持关键指标合计核对与结果文件导出 |
| outlier-detection | IQR异常值检测,结合偏度/峰度分析数据分布,适用于非正态数据预处理 |
excel-data-filtering — 数据筛选
| Sub-skill | 功能 |
|---|
| condition-filtering | 根据数据规模动态选择处理策略进行条件筛选 |
| category-filtering | 自定义分类统计、交叉分析,支持文本长度/术语密度/正则匹配等综合评分与分级 |
| range-filtering | 根据多维数值条件筛选并导出,支持大规模数据自动性能优化 |
| threshold-filtering | 数值列清洗、条件过滤,使用openpyxl对符合条件的单元格进行样式标记 |
excel-data-analysis — 数据分析
| Sub-skill | 功能 |
|---|
| comparison-analysis | 两类分类数据对比分析,统计数量差异与比例关系并生成可视化 |
| group-by-analysis | 多Sheet数据清洗及分组聚合分析,生成带样式标记的统计表与图表 |
| kpi-metric-analysis | 提取关键指标进行单位一致性验证与排序分析 |
| pivot-table-analysis | 交叉表与热力图进行多维度占比分析,适用于奖项分布/绩效评估/市场占有率 |
| time-series-analysis | 时间序列趋势分析、百分比清洗、绩效分级建模与预测,生成高分辨率可视化报告 |
| trend-analysis | 多维度分级评估与趋势预测,差异化增长率计算,适用于绩效评估/目标设定 |
excel-data-statistics — 统计计算
| Sub-skill | 功能 |
|---|
| basic-statistics | 基础统计,支持按条件筛选计算均值,指定行区间提取数据去重求和 |
| category-statistics | 各类别数量与占比统计,生成柱状图/饼图等组合可视化报告 |
| grouped-statistics | 多Sheet数据合并与前向填充,分组统计 |
| percentage-calculation | 逐行扫描或列匹配提取关键指标并计算占比/均值,输出结构化报告及图表 |
excel-data-visualization — 数据可视化
| Sub-skill | 功能 |
|---|
| bar-chart-visualization | 处理合并单元格,交叉分组统计,生成支持中英文字体的美化柱状图 |
| histogram-visualization | 数值型分布分析与异常值检测,支持正则提取误差项,生成箱线图与直方图 |
| line-chart-visualization | 特征清洗与聚类分析,生成趋势对比/分布特征/参数敏感性多维度图表 |
| pie-chart-visualization | 分类汇总统计,自动识别关键字段生成包含占比/数值的美化饼图 |
| scatter-plot-visualization | 多维度统计分析与散点图可视化 |
| stacked-chart-visualization | 百分比字符串数据处理,补全缺失维度,生成堆叠柱状图展示构成变化趋势 |
excel-cell-coloring — 单元格着色
| Sub-skill | 功能 |
|---|
| category-coloring | 提取目标指标计算最大值,对特定行进行高亮标注 |
| duplicate-value-coloring | 对比多表中的特定系数并对异常值进行颜色标记 |
| outlier-coloring | 识别超限数值与错误单元格并进行高亮标注 |
| threshold-cell-coloring | 计算时间序列平均值,使用openpyxl输出带条件格式(如低于均值标绿)的报告 |
| top-value-coloring | 根据数据规模动态选择策略,多表合并、统计筛选,关键指标自动化样式高亮 |
excel-conditional-formatting — 条件格式
| Sub-skill | 功能 |
|---|
| data-bar-formatting | 从带单位字符串列提取数值并清洗,生成直方图/饼图/条形图/累积分布图 |
excel-result-export — 结果导出
| Sub-skill | 功能 |
|---|
| single-sheet-export | 多Sheet数据探查与条件过滤导出,重命名字段后生成带下载链接的Excel |
| formatted-export | 条件筛选记录并以整行标红格式导出Excel |
| chart-embedded-export | 分类分布清洗与统计,生成多维度交叉分析与高分辨率嵌入式图表报告 |
| report-generation-export | 从Excel提取多类型数据,生成包含可视化图表与下载链接的综合分析报告 |
excel-table-styling — 表格样式
| Sub-skill | 功能 |
|---|
| table-theme-styling | 大文件Parquet加速读取,条件筛选/分类汇总与结果导出 |