mit einem Klick
document-converter
// Convert Office documents (PPTX, DOCX, XLSX, PDF, HTML, CSV, JSON, XML, images) to Markdown using Microsoft MarkItDown. Provides the agent with conversion strategies for academic and research workflows.
// Convert Office documents (PPTX, DOCX, XLSX, PDF, HTML, CSV, JSON, XML, images) to Markdown using Microsoft MarkItDown. Provides the agent with conversion strategies for academic and research workflows.
Standard operating procedure for research experiments, data analysis, and visualization. Covers Python/R script execution, statistical analysis, data wrangling (pandas/tidyverse), publication-quality figures (matplotlib/plotly/ggplot2), and experiment reproducibility.
Workspace file management, git versioning, directory conventions, and CLI execution safety rules for the Research-Claw workspace.
Four-phase iterative protocol for academic writing and document production. Covers outline → draft → self-review → polish with Stop Points, material supplementation loops, version comparison, and mechanical self-check. Includes IMRaD, LaTeX, docx/markdown, citation formatting, and md2pdf.
Standard operating procedure for academic figure generation. Four rendering engines: Python (data viz), Mermaid (flowcharts), AI Image via NanoBanana/OpenRouter (complex diagrams), SVG (vector). Includes engine selection decision tree, ReAct self-correction, NanoBanana configuration, environment detection, and academic style rules.
Standard operating procedure for academic literature search. Covers local library search, 18 academic database APIs (L1), web_fetch direct access (L1.5), browser RPA for CNKI/Google Scholar (L2), and optional API services (L3). Includes domain routing, recency protocol, search iteration protocol (evaluate → refine → re-search up to 3 rounds), and Zotero/EndNote import bridges.
JSON schema reference for 6 structured output card types used by the Research-Claw dashboard: paper_card, task_card, progress_card, approval_card, file_card, monitor_digest. Read this before outputting any structured card for the first time in a session.
| name | document-converter |
| description | Convert Office documents (PPTX, DOCX, XLSX, PDF, HTML, CSV, JSON, XML, images) to Markdown using Microsoft MarkItDown. Provides the agent with conversion strategies for academic and research workflows. |
| tags | ["document","conversion","markdown","office","pdf","pptx","docx"] |
| version | 1.0.0 |
| requirements | ["markitdown","markitdown-mcp"] |
| install | pip install 'markitdown[all]' markitdown-mcp |
Convert Office and structured documents to Markdown for LLM consumption using Microsoft MarkItDown.
MarkItDown is a Microsoft open-source tool that converts a wide range of document formats into clean Markdown text. This is essential for academic research workflows where source materials arrive in diverse formats (PDFs from journals, PPTX from conferences, DOCX from collaborators, XLSX data tables).
| Format | Extensions | Notes |
|---|---|---|
.pdf | Text extraction; scanned PDFs require OCR | |
| Word | .docx | Paragraphs, tables, lists, headings preserved |
| PowerPoint | .pptx | Slide-by-slide with speaker notes |
| Excel | .xlsx, .xls | Sheet-by-sheet, tables as Markdown tables |
| HTML | .html, .htm | Cleaned content extraction |
| CSV | .csv | Converted to Markdown table |
| JSON | .json | Pretty-printed structured output |
| XML | .xml | Structured text extraction |
| Images | .jpg, .png, .gif, .bmp, .tiff | OCR text extraction (requires optional deps) |
| Audio | .mp3, .wav | Transcription (requires optional deps) |
| ZIP | .zip | Recursively converts contained files |
When the markitdown-mcp server is configured, use the convert_to_markdown tool:
Tool: convert_to_markdown
Parameters:
uri: "file:///absolute/path/to/document.pptx"
The uri parameter accepts:
file:///path/to/file.docxhttps://example.com/paper.pdffile:// URI scheme for local files.If the MCP server is not available, use the CLI directly via shell:
# Convert a single file
markitdown document.pptx
# Save output to file
markitdown document.pptx > output.md
# Convert from URL
markitdown https://example.com/paper.pdf
Convert a PPTX to Markdown, then analyze the content:
convert_to_markdownWhen processing multiple documents:
library_add_paper| Format | Quality | Caveats |
|---|---|---|
| DOCX | Excellent | Complex layouts may lose formatting |
| PPTX | Good | Diagrams/charts become text descriptions |
| XLSX | Good | Merged cells may not render perfectly |
| Variable | Depends on PDF type (text vs scanned) | |
| HTML | Good | JavaScript-rendered content not captured |
| CSV | Excellent | Direct table conversion |
| Issue | Solution |
|---|---|
markitdown not found | Run pip install 'markitdown[all]' |
| MCP tool not available | Ensure markitdown-mcp is installed and configured in openclaw.json |
| Empty output from PDF | PDF may be scanned/image-only; needs OCR dependencies |
| Encoding errors | Ensure file is not corrupted; try re-saving from source application |
| Large file timeout | Convert via CLI instead of MCP for very large files |
pip install 'markitdown[all]' markitdown-mcp
markitdown --help
MarkItDown is pre-installed in the Research-Claw Docker image.