| name | sn-da-image-caption |
| description | 图片理解与数据提取 skill。当图片文件(.png/.jpg/.jpeg/.gif/.webp/.bmp)是主要输入且用户需要理解、提取数据或分析图片内容时使用。提供预配置的 caption 脚本(scripts/caption.py),通过 vision 模型将图片转为文本描述,无需额外配置 API Key。覆盖:(1) 通过 scripts/caption.py 对图表/表格/截图/流程图进行 caption,(2) 将 caption 文本解析为结构化 DataFrame,(3) 基于提取数据重新生成可视化图表,(4) 导出为 Excel/CSV。**遇到以下任一情况就主动使用本 skill,不要自行猜测图片内容**:①用户出现触发词:图片分析 / 图表提取 / 表格识别 / OCR / 图片描述 / 截图分析 / 图表数据 / 提取图片中的数据 / 图片转表格 / 识别图片 / image caption / extract data from image / chart analysis / table OCR;②用户上传或指定了图片文件(.png / .jpg / .jpeg / .gif / .webp / .bmp)并要求理解、提取数据或分析内容;③任务需要从图表截图、表格截图、UI 截图、流程图中提取结构化信息;④用户要求将图片中的数据转为 Excel/CSV 或重新生成可视化图表。仅不用于:图片编辑(裁剪、滤镜、缩放)、图片生成、不含数据的风景/人物照片描述。 |
Image Caption Analysis — 图片描述与数据提取
Overview
Analyze, extract data from, or understand image files (.png, .jpg, .jpeg, .gif, .webp, .bmp). The core workflow:
- Run
scripts/caption.py to get a text description of the image
- Parse the description into structured data (DataFrame, etc.)
- Analyze, visualize, or export
scripts/caption.py — Image Caption
The script converts images to text descriptions via a vision model. Configure via SN_API_KEY (minimum required), or use SN_VISION_API_KEY / SN_VISION_BASE_URL / SN_VISION_MODEL for fine-grained control. See the project environment variable spec for the full fallback chain.
Usage
python3 scripts/caption.py /mnt/data/image.png
python3 scripts/caption.py /mnt/data/chart.png --prompt "提取所有数值,Markdown 表格格式"
python3 scripts/caption.py /mnt/data/image.png --json
python3 scripts/caption.py /mnt/data/images/ --batch --output /mnt/data/captions.json
python3 scripts/caption.py /mnt/data/image.png --model gemini-3.1-flash-lite-preview
Options
| Option | Description |
|---|
--prompt, -p | Custom prompt (overrides auto-detection) |
--model, -m | Vision model (default: sensenova-6.7-flash-lite) |
--json | Output structured JSON instead of plain text |
--batch | Process all images in a directory |
--output, -o | Output file for batch results |
--no-cache | Skip MD5 cache |
What it does automatically
- Type detection: Detects image type from filename (chart/table/UI/diagram/general) and picks the best prompt
- Compression: Images >5MB or >2048px are compressed before sending
- Caching: Same image + same prompt → instant cached result, no API cost
- Error handling: Retries on failure, returns error message on permanent failure
JSON output format
{
"file": "/mnt/data/image.png",
"type": "chart",
"description": "这是一张柱状图...",
"usage": {"prompt_tokens": 1100, "completion_tokens": 400},
"cached": false
}
Calling from Python
import subprocess, json
CAPTION = "/path/to/skills/sn-da-image-caption/scripts/caption.py"
result = subprocess.run(
["python3", CAPTION, "/mnt/data/chart.png", "--json",
"--prompt", "提取图表数据,Markdown 表格输出"],
capture_output=True, text=True, timeout=60
)
data = json.loads(result.stdout)
description = data["description"]
result = subprocess.run(
["python3", CAPTION, "/mnt/data/images/", "--batch",
"--output", "/mnt/data/captions.json"],
capture_output=True, text=True, timeout=300
)
with open("/mnt/data/captions.json") as f:
all_captions = json.load(f)
Prompt Strategy
Different image types need different prompts. The script auto-detects, but specifying --prompt gives better results.
| Image Type | When | Recommended --prompt |
|---|
| Data chart | 柱状图/折线图/饼图 | "提取图表标题、坐标轴、每个数据点数值、图例。Markdown 表格输出。" |
| Table screenshot | 表格截图 | "提取表格所有内容,Markdown 表格格式,保持行列结构,数值不四舍五入。" |
| UI screenshot | 界面截图 | "以前端开发者视角描述:布局、组件、文字、颜色。" |
| Diagram | 流程图/架构图 | "描述所有节点、连接关系(A→B)、分支条件。" |
| General | 照片、其他 | 不传 --prompt,用默认 |
Parsing Caption Results
Caption 通常返回 Markdown 表格,解析为 DataFrame:
import pandas as pd
def parse_markdown_table(text):
lines = text.strip().split('\n')
table_lines = []
in_table = False
for line in lines:
stripped = line.strip()
if '|' in stripped:
in_table = True
table_lines.append(stripped)
elif in_table:
break
data_lines = []
for l in table_lines:
cells = [c.strip() for c in l.split('|') if c.strip()]
if cells and not all(set(c) <= set('-: ') for c in cells):
data_lines.append(cells)
if len(data_lines) < 2:
return None
header = data_lines[0]
rows = [r for r in data_lines[1:] if len(r) == len(header)]
df = pd.DataFrame(rows, columns=header)
for col in df.columns:
try:
cleaned = df[col].str.replace(',', '').str.strip()
if cleaned.str.endswith('%').any():
df[col] = pd.to_numeric(cleaned.str.rstrip('%'), errors='coerce')
else:
converted = pd.to_numeric(cleaned, errors='coerce')
if converted.notna().sum() > len(df) * 0.5:
df[col] = converted
except Exception:
pass
return df
Visualization
Chinese Font Setup (MANDATORY)
import matplotlib.pyplot as plt
import matplotlib
import os
font_path = '/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc'
if os.path.exists(font_path):
matplotlib.rcParams['font.family'] = 'WenQuanYi Zen Hei'
matplotlib.rcParams['axes.unicode_minus'] = False
Color Palette
COLORS = ['#4C72B0', '#55A868', '#C44E52', '#8172B2', '#CCB974', '#64B5CD']
Save & Display
plt.savefig('/mnt/data/chart.png', dpi=150, bbox_inches='tight')
plt.show()
print("")
Export to Excel
from openpyxl.styles import Font, PatternFill, Alignment
output_path = "/mnt/data/result.xlsx"
with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
df.to_excel(writer, index=False, sheet_name='提取数据')
ws = writer.sheets['提取数据']
fill = PatternFill(start_color='4472C4', end_color='4472C4', fill_type='solid')
for cell in ws[1]:
cell.font = Font(bold=True, color='FFFFFF')
cell.fill = fill
cell.alignment = Alignment(horizontal='center')
for i, col in enumerate(df.columns, 1):
w = max(df[col].astype(str).str.len().max(), len(str(col))) + 2
ws.column_dimensions[chr(64 + i)].width = min(w * 1.2, 40)
print(f"[下载](sandbox:{output_path})")
Multi-Image Processing
import glob
image_files = sorted(glob.glob("/mnt/data/*.png"))
all_dfs = []
for img in image_files:
r = subprocess.run(
["python3", CAPTION, img, "--json", "--prompt", "提取数据,Markdown 表格"],
capture_output=True, text=True, timeout=60
)
desc = json.loads(r.stdout)["description"]
df = parse_markdown_table(desc)
if df is not None:
all_dfs.append(df)
combined = pd.concat(all_dfs, ignore_index=True) if all_dfs else None
Or batch mode:
python3 scripts/caption.py /mnt/data/images/ --batch --output /mnt/data/captions.json
Common Pitfalls
- Always caption first — don't guess image content from filenames
- Use --prompt for precision — auto-detect is OK, explicit prompt is better
- Verify extracted data — check sums, percentages, row counts after parsing
- Large tables truncate — caption in two passes:
"提取前半部分" + "提取后半部分"
- Chinese font — must set before any matplotlib call, or output is garbled
- Timeout — single image ~10-30s, batch set timeout accordingly