with one click
pdf-analysis
PDF 文档解析。自动区分文字型 PDF 与扫描型 PDF,覆盖:文本/表格提取、多页全量扫描、嵌入图表 caption、单位感知数值计算。
Menu
PDF 文档解析。自动区分文字型 PDF 与扫描型 PDF,覆盖:文本/表格提取、多页全量扫描、嵌入图表 caption、单位感知数值计算。
Standard and fast PPT pipeline. All LLM / VLM / T2I calls are wrapped in a single CLI entry (scripts/run_stage.py). The main agent's job is simple: emit ONE shell command per stage, never write loops, never write prompts. Standard mode plans thoroughly with a style preview checkpoint, web research, and image search for polished, delivery-ready presentations. Fast mode builds a complete draft immediately with autonomous decisions, then provides structured refinement suggestions so the user can iterate quickly. Supports AI-generated infographics (U1) for diagrams and flowcharts, web image search (Serper) for real photos, and ECharts for data charts.
PPT (.pptx/.ppt) 全量解析。覆盖:所有 slide 文本/表格/图表提取、嵌入图片 caption、纯图片 slide 渲染识别、数据标签提取。
Word (.docx/.doc) 文档全量解析。覆盖:正文/段落文本提取、表格数据提取、高亮/颜色格式读取、多文件汇总对比、嵌入图片转 caption。
Word / PDF / PPT 文档解析与数据分析引擎。覆盖三类文件格式的全量提取、表格数值化、图表理解与跨文档汇总分析。**遇到以下任一情况就主动使用本 skill**:①用户上传或指定了 .docx / .doc / .pdf / .pptx / .ppt 文件并要求分析、提取或统计其中内容;②用户出现触发词:Word分析 / PDF解析 / PPT提取 / 文档分析 / 报告解析 / 幻灯片分析 / 发票提取 / 合同分析 / 文档统计 / 错别字 / 语病 / 字号检查 / 简历分析 / 多文档对比;③任务涉及从文档中提取表格、数值、图表、格式(颜色/高亮/字号)、组织架构、时间线等结构化信息。仅不用于:Excel/CSV 数据分析(使用 sn-da-excel-workflow)、纯图片分析(使用 sn-da-image-caption)。
Creative-mode PPT pipeline. One full-page 16:9 PNG per slide. LLM / VLM calls go through sn-ppt-standard/lib/model_client.py (shared thin client). Text-to-image (the actual png rendering) goes through sn-image-base/scripts/sn_agent_runner.py. Falls back to web image search when T2I generation fails. Expects task_pack.json + info_pack.json already written by sn-ppt-entry.
Entry point for PPT generation. Asks the user to choose a mode (fast, standard, or creative), then collects role / audience / scene / page_count as needed. For standard mode, also asks how images should be sourced (AI generation, web search, or none) and whether charts should use AI-generated infographics or ECharts. Parses uploaded pdf/docx/md/txt files, produces task_pack.json + info_pack.json in a new deck_dir, then dispatches to sn-ppt-creative or sn-ppt-standard. Fast mode skips optional questions and gets straight to building. Use when the user asks to make a PPT / presentation / 演示 / PPT.
| name | pdf-analysis |
| description | PDF 文档解析。自动区分文字型 PDF 与扫描型 PDF,覆盖:文本/表格提取、多页全量扫描、嵌入图表 caption、单位感知数值计算。 |
Critical first step: determine whether the PDF has extractable text or is a scanned image. Never skip this — using the wrong parser wastes time and produces empty results.
import fitz # PyMuPDF
def detect_pdf_type(pdf_path, sample_pages=3):
"""
Returns 'text' if PDF has extractable text, 'scanned' if image-based.
Checks first N pages (or all if fewer).
"""
doc = fitz.open(pdf_path)
total_chars = 0
pages_checked = min(sample_pages, len(doc))
for i in range(pages_checked):
page = doc[i]
text = page.get_text("text")
total_chars += len(text.strip())
doc.close()
avg_chars = total_chars / max(pages_checked, 1)
pdf_type = 'text' if avg_chars > 50 else 'scanned'
print(f"PDF type: {pdf_type} (avg {avg_chars:.0f} chars/page, checked {pages_checked} pages)")
return pdf_type
import fitz
def extract_text_pdf(pdf_path):
"""Extract text from all pages of a text-based PDF."""
doc = fitz.open(pdf_path)
total_pages = len(doc)
print(f"Total pages: {total_pages}")
all_text = []
for i, page in enumerate(doc):
text = page.get_text("text").strip()
if text:
all_text.append(f"=== Page {i+1} ===\n{text}")
else:
print(f" Page {i+1}: no text (may be image — will caption later)")
doc.close()
return '\n\n'.join(all_text)
# ⚠️ MUST iterate ALL pages — never stop at page 1
full_text = extract_text_pdf(pdf_path)
print(f"Total text length: {len(full_text)} chars")
For PDFs with tables, pdfplumber gives better table structure than fitz:
import pdfplumber
import pandas as pd
def extract_tables_pdf(pdf_path):
"""Extract all tables from all pages as DataFrames."""
all_tables = []
with pdfplumber.open(pdf_path) as pdf:
print(f"Total pages: {len(pdf.pages)}")
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, tbl in enumerate(tables):
if not tbl:
continue
# First row as header
df = pd.DataFrame(tbl[1:], columns=tbl[0])
# Clean: strip whitespace, replace None
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df = df.dropna(how='all').reset_index(drop=True)
all_tables.append({'page': i+1, 'table_idx': j, 'df': df})
print(f" Page {i+1}, Table {j}: {df.shape[0]}r × {df.shape[1]}c")
print(df.head(3))
return all_tables
# Verify table alignment after extraction:
# Print column headers and first 3 rows to confirm row/col mapping is correct
For scanned PDFs (image-based pages), render each page as PNG and caption:
import fitz
import subprocess, json, os
CAPTION = "/path/to/skills/sn-da-image-caption/scripts/caption.py"
def extract_scanned_pdf(pdf_path, prompt=None, dpi=150):
"""Render each page as image, then caption for text extraction."""
doc = fitz.open(pdf_path)
total_pages = len(doc)
print(f"Scanned PDF: {total_pages} pages, captioning each...")
all_text = []
for i, page in enumerate(doc):
# Render page to PNG
mat = fitz.Matrix(dpi/72, dpi/72)
pix = page.get_pixmap(matrix=mat)
img_path = f"/tmp/pdf_page_{i+1}.png"
pix.save(img_path)
# Caption the page image
cmd = ["python3", CAPTION, img_path, "--json"]
if prompt:
cmd += ["--prompt", prompt]
else:
cmd += ["--prompt", "提取页面中所有文字和表格内容,保持原始结构,Markdown格式输出。"]
r = subprocess.run(cmd, capture_output=True, text=True, timeout=90)
if r.returncode == 0:
desc = json.loads(r.stdout).get("description", "")
all_text.append(f"=== Page {i+1} ===\n{desc}")
print(f" Page {i+1}: {len(desc)} chars extracted")
else:
print(f" Page {i+1}: caption failed — {r.stderr[:100]}")
doc.close()
return '\n\n'.join(all_text)
# Usage for scanned invoice PDFs, bank statements, org charts, etc.
text = extract_scanned_pdf(pdf_path)
def extract_hybrid_pdf(pdf_path, text_prompt=None, image_prompt=None):
"""Handle PDFs where some pages have text, others are scanned."""
doc_fitz = fitz.open(pdf_path)
all_text = []
for i, page in enumerate(doc_fitz):
raw_text = page.get_text("text").strip()
if len(raw_text) > 50:
# Text page — use directly
all_text.append(f"=== Page {i+1} (text) ===\n{raw_text}")
else:
# Image page — render and caption
mat = fitz.Matrix(150/72, 150/72)
pix = page.get_pixmap(matrix=mat)
img_path = f"/tmp/hybrid_page_{i+1}.png"
pix.save(img_path)
cmd = ["python3", CAPTION, img_path, "--json"]
prompt = image_prompt or "提取页面中所有文字和表格内容,Markdown格式输出。"
cmd += ["--prompt", prompt]
r = subprocess.run(cmd, capture_output=True, text=True, timeout=90)
if r.returncode == 0:
desc = json.loads(r.stdout).get("description", "")
all_text.append(f"=== Page {i+1} (image→caption) ===\n{desc}")
else:
all_text.append(f"=== Page {i+1} (caption failed) ===")
doc_fitz.close()
return '\n\n'.join(all_text)
import fitz
def extract_pdf_images(pdf_path, min_width=100, min_height=100):
"""Extract all embedded images from a PDF (charts, diagrams, photos)."""
doc = fitz.open(pdf_path)
image_paths = []
for page_num, page in enumerate(doc):
for img_idx, img in enumerate(page.get_images(full=True)):
xref = img[0]
base = doc.extract_image(xref)
img_bytes = base["image"]
ext = base["ext"]
img_path = f"/tmp/pdf_img_p{page_num+1}_{img_idx}.{ext}"
with open(img_path, 'wb') as f:
f.write(img_bytes)
# Only keep images above size threshold (skip icons/logos)
from PIL import Image
with Image.open(img_path) as im:
w, h = im.size
if w >= min_width and h >= min_height:
image_paths.append({'page': page_num+1, 'path': img_path, 'size': (w, h)})
print(f" Page {page_num+1}, img {img_idx}: {w}×{h} → {img_path}")
doc.close()
return image_paths
# After extracting, caption each image:
# for img_info in image_paths:
# caption_image(img_info['path'], prompt="提取图表数据,Markdown 表格输出。")
# When PDF contains multiple invoices (one per page):
tables_by_page = extract_tables_pdf(pdf_path)
invoices = []
for item in tables_by_page:
df = item['df']
# Find key fields (flexible column name matching)
for col in df.columns:
if '金额' in str(col) or 'amount' in str(col).lower():
invoices.append({'page': item['page'], 'amount_col': col, 'data': df})
break
print(f"Found {len(invoices)} pages with amount data")
import re
def extract_number_with_unit(text_snippet):
"""
Extract value and unit from text like '1,760 千港元' or '95,975,196,217.52元'.
Returns (numeric_value, unit_string).
"""
# Remove thousands separator
text_snippet = text_snippet.replace(',', '')
match = re.search(r'([\d\.]+)\s*(千|万|亿|百万)?\s*(元|港元|美元|人民币|%|percent)?', text_snippet)
if not match:
return None, None
value = float(match.group(1))
multiplier_map = {'千': 1000, '万': 10000, '亿': 1e8, '百万': 1e6}
mult = multiplier_map.get(match.group(2), 1)
unit = match.group(3) or ''
return value * mult, f"{match.group(2) or ''}{unit}"
# Always verify unit matches what the question asks:
# "多几多" in HKD → answer in 千港元 if source says 千港元
def find_in_pdf(pdf_path, keyword, context_chars=200):
"""Search for keyword across all pages, return context snippets."""
text = extract_text_pdf(pdf_path)
results = []
start = 0
while True:
idx = text.find(keyword, start)
if idx < 0:
break
snippet = text[max(0, idx-context_chars//2): idx+context_chars]
results.append({'pos': idx, 'context': snippet})
start = idx + 1
print(f"Found '{keyword}' {len(results)} times")
return results
| Pitfall | Fix |
|---|---|
Use pdfplumber on scanned PDF → empty result | Detect type first (Method 0); use OCR path for scanned |
| Only read page 1, miss remaining invoices/data | Always for page in doc — never index [0] only |
| Table columns misaligned after extraction | Print headers + first 3 rows to verify before computing |
| Report number as % when question asks absolute value | Read question carefully; extract_number_with_unit() preserves context |
| Chart data embedded as image → pdfplumber returns nothing | Extract images (Method 5), then caption each |
| Long doc loses cross-page context | Use find_in_pdf() for keyword search across full text |
.pdf contains multiple scanned docs (zip of PDFs) | Check if input is dir or archive; unzip first |