| name | docx-editor-cn |
| description | Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of "Word doc", "word document", ".docx", or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a "report", "memo", "letter", "template", or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation. |
| license | Proprietary. LICENSE.txt has complete terms |
DOCX creation, editing, and analysis
Overview
A .docx file is a ZIP archive containing XML files.
Quick Reference
| Task | Approach |
|---|
| Read/analyze content | pandoc or unpack for raw XML |
| Create new document | node scripts/new_doc.js (edit CONTENT section first) |
| Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below |
| Insert 三线表 (XML editing) | python scripts/table.py unpacked/ "1-1" "标题" --headers … --rows … |
| Insert block formula (XML editing) | python scripts/formula.py unpacked/ "LaTeX" 1 --anchor "锚文本" |
Converting .doc to .docx
Legacy .doc files must be converted before editing:
python scripts/office/soffice.py --headless --convert-to docx document.doc
Reading Content
pandoc --track-changes=all document.docx -o output.md
python scripts/office/unpack.py document.docx unpacked/
Converting to Images
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page
Accepting Tracked Changes
To produce a clean document with all tracked changes accepted (requires LibreOffice):
python scripts/accept_changes.py input.docx output.docx
Creating New Documents
Generate .docx files with JavaScript, then validate. Install: npm install -g docx
Setup
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
VerticalAlign, PageNumber, PageBreak } = require('docx');
const doc = new Document({ sections: [{ children: [] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));
Validation
After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
python scripts/office/validate.py doc.docx
Page Size
sections: [{
properties: {
page: {
size: {
width: 11906,
height: 16838
},
margin: { top: 1418, right: 1418, bottom: 1418, left: 1418 }
}
},
children: []
}]
Common page sizes (DXA units, 1440 DXA = 1 inch, 567 DXA = 1 cm):
| Paper | Width | Height | Content Width (2.5cm margins) |
|---|
| A4 (Chinese standard) | 11,906 | 16,838 | 9,070 |
| US Letter | 12,240 | 15,840 | 9,404 |
Landscape orientation: docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap:
size: {
width: 12240,
height: 15840,
orientation: PageOrientation.LANDSCAPE
},
Styles (Academic Thesis Formatting)
CRITICAL: When generating academic papers, course designs, or mathematical documents, strictly adhere to the following Chinese academic formatting standards.
1. Global Page & Normal Text (【全局页面与正文规范】)
- Page: A4 (11906 × 16838 DXA), margins all 2.5cm (
1418 DXA).
- Fonts: Chinese = SimSun (宋体), English/Math/Code = Cambria Math (default); Times New Roman is an acceptable alternative for English text.
- Size: 12pt (小四) ->
size: 24 in docx-js (half-points).
- Paragraph: First-line indent 2 characters (
firstLine: 480 DXA), Line spacing single (line: 240, lineRule: LineRuleType.AUTO) by default; alternatives: fixed 20pt (line: 400, lineRule: LineRuleType.EXACT) or 1.5x (line: 360, lineRule: LineRuleType.AUTO). Before/After spacing 0pt.
2. Headings (【标题规范】)
- Heading 1 (一级标题): SimHei (黑体), 16pt (三号,
size: 32), Bold, Centered, 1.2x line spacing (line: 288). Auto-numbering: Arabic 1, 2, 3 (displayed in heading text or prefixed manually as 一、二、). See Heading & Reference Auto-Numbering below.
- Heading 2 (二级标题): SimHei (黑体), 14pt (四号,
size: 28), Bold, Left-aligned, 1.5x line spacing (line: 360). Auto-numbering: 1.1, 1.2, 2.1 (chapter-tracking Arabic decimal).
- Heading 3 (三级标题): SimHei (黑体), 12pt (小四,
size: 24), Bold, Left-aligned, 1.1x line spacing (line: 264). Auto-numbering: 1.1.1, 2.4.1.
⚠️ Heading Numbering Rule: Use a single all-decimal multi-level numbering config for all three heading levels. This is the ONLY way to get correct cross-chapter tracking (1.1, 2.1, 2.4 etc.) while keeping H2/H3 numbers purely Arabic. If CHINESE_COUNTING is used for H1 level, H2 will render as 二.4 instead of 2.4 — this is a known OOXML rendering issue.
3. Figures, Tables & Math (【图表与公式规范】)
- Figure Captions (图标题): SimSun (宋体), 11pt (
size: 22), Centered, Bold. Spacing: Single spacing, 0.5 lines before (before: 120), 3pt after (after: 60). Position: Below the figure. Format: "图 章-图序" e.g. "图 1-1".
- Table Captions (表标题): SimSun (宋体), 11pt (
size: 22), Centered, no indent, Bold. Spacing: Single spacing, 0.5 lines before (before: 120), 3pt after (after: 60). Position: Above the table. Format: "表 章-表序" e.g. "表 1-1". Tables MUST use 三线表 (three-line table) style (thick top/bottom, thin after header, no other borders).
- Block Math Formulas: Use the formula table layout — a 3-column borderless table: [1cm spacer | formula centered | equation number right-aligned 1cm]. Use
scripts/formula.py to generate the XML. Inline math: write as plain text with italic formatting.
Creating a new document — use the template script:
node scripts/new_doc.js
scripts/new_doc.js provides ready-to-use helper functions: h1/h2/h3(text), body(text), tableCaption/figCaption(label), threeLineTable(headers, rows, colWidths), ref(text), blank(). The STYLES / NUMBERING / page constants at the top of the file are the canonical implementation of all specs below; refer to that file for exact docx-js values.
Lists (NEVER use unicode bullets)
new Paragraph({ children: [new TextRun("• Item")] })
new Paragraph({ children: [new TextRun("\u2022 Item")] })
const doc = new Document({
numbering: {
config: [
{ reference: "bullets",
levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "numbers",
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
]
},
sections: [{
children: [
new Paragraph({ numbering: { reference: "bullets", level: 0 },
children: [new TextRun("Bullet item")] }),
new Paragraph({ numbering: { reference: "numbers", level: 0 },
children: [new TextRun("Numbered item")] }),
]
}]
});
Tables (三线表 / Three-Line Table)
Chinese academic papers MUST use 三线表: thick top/bottom (1.5pt), thin after header (0.75pt), no other borders. Caption goes above the table.
For new documents — use threeLineTable() helper in scripts/new_doc.js:
tableCaption('符号说明'),
threeLineTable(
['符号', '说明'],
[['S', '状态空间'], ['A', '动作空间']],
[1800, 7270]
),
For editing existing documents (XML) — use scripts/table.py:
python scripts/table.py unpacked/ "1-1" "符号说明" \\
--headers "符号,说明" \\
--rows '[["S","状态空间"],["A","动作空间"]]' \\
--anchor "以下是符号说明"
python scripts/table.py unpacked/ "2-1" "参数" \\
--headers "参数,取值,说明" \\
--rows '[["lr","0.001","学习率"]]' \\
--widths "1800,1500,5770"
python scripts/table.py --caption "示例" \\
--headers "列1,列2" --rows '[["a","b"]]'
Key rules:
- Content width = 9070 DXA (A4 2.5cm margins);
columnWidths must sum exactly to 9070
- Always use
WidthType.DXA — never WidthType.PERCENTAGE
- Table width = sum of
columnWidths; set matching width on each cell too
Images
new Paragraph({
children: [new ImageRun({
type: "png",
data: fs.readFileSync("image.png"),
transformation: { width: 200, height: 150 },
altText: { title: "Title", description: "Desc", name: "Name" }
})]
})
Page Breaks
new Paragraph({ children: [new PageBreak()] })
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })
Table of Contents
new TableOfContents("目录", { hyperlink: true, headingStyleRange: "1-3" })
new Paragraph({
heading: HeadingLevel.HEADING_1,
children: [new TextRun("一、引言")]
})
new Paragraph({
heading: HeadingLevel.HEADING_2,
children: [new TextRun("1.1 研究背景")]
})
Headers/Footers
sections: [{
properties: {
page: { margin: { top: 1418, right: 1418, bottom: 1418, left: 1418 } }
},
headers: {
default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
},
footers: {
default: new Footer({ children: [new Paragraph({
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
})] })
},
children: []
}]
Heading Auto-Numbering (Word per-chapter reset)
H1 uses Chinese numerals written literally in text (一、二、三). H2/H3 use Word auto-numbering via per-chapter numbering references (sections_c1, sections_c2, ...). Each chapter boundary resets H2/H3 counters automatically — adding/removing sections within a chapter triggers full auto-update.
CRITICAL — numbering format rule:
NEVER use LevelFormat.CHINESE_COUNTING for H1 in a multi-level config — %1 in H2 text expands to "二", giving 二.4. Always use LevelFormat.DECIMAL for every level.
Numbering references are generated dynamically by buildNumberingConfig(chapterCount). Each chapter gets its own sections_c{N} reference with 2 levels (H2, H3). This ensures H2/H3 reset per chapter while Word handles all numbering.
h1Chinese('一、引言')
h2('研究背景')
h2('文献综述')
h3('研究现状')
h3('存在不足')
h1Chinese('二、方法')
h2('算法设计')
h3('数据结构')
The _chapter counter is incremented by h1Chinese() and used by h2(), h3(), figCaption(), and tableCaption() to select the correct numbering reference / SEQ identifier.
Caption Auto-Numbering (SEQ Fields)
Captions use Word SEQ fields for per-chapter auto-numbering. The chapter number is a literal prefix, and the sequence counter is chapter-specific (e.g., figure_c3, table_c1).
figCaption('系统架构图')
figCaption('数据流图')
tableCaption('符号说明')
Pass ONLY the description text. Never include manual "图 3-1" or "表 1-1" prefixes — the SEQ field handles numbering automatically.
Markdown Number Stripping
When converting markdown, use these helpers to strip manual numbers from input text:
stripH1Number('一、引言')
stripH2Number('1.1 研究背景')
stripCaptionNumber('图 3-1 系统')
Block Formula Layout
Block formulas use a 3-column borderless table so the equation number can be right-aligned while the formula is centered. Use scripts/formula.py to generate and insert formula blocks automatically.
python scripts/formula.py unpacked/ "Q_n(x,a) = r + \\gamma \\max_{a'} Q_{n-1}" 1 --anchor "由此可得"
python scripts/formula.py --latex "E=mc^2" --number 2
Layout (A4, 9070 DXA content width):
┌──────────┬──────────────────────────┬──────────┐
│ 1cm gap │ formula (centered) │ (n) │
│ 567 DXA │ 7936 DXA │ 567 DXA │
└──────────┴──────────────────────────┴──────────┘
no border throughout; single line spacing
If direct XML editing is preferred, the skeleton is:
<w:tbl>
<w:tblPr>
<w:tblW w:w="9070" w:type="dxa"/>
<w:tblBorders>
<w:top w:val="none"/><w:left w:val="none"/>
<w:bottom w:val="none"/><w:right w:val="none"/>
<w:insideH w:val="none"/><w:insideV w:val="none"/>
</w:tblBorders>
</w:tblPr>
<w:tblGrid>
<w:gridCol w:w="567"/><w:gridCol w:w="7936"/><w:gridCol w:w="567"/>
</w:tblGrid>
<w:tr>
<w:tc><w:tcPr><w:tcW w:w="567" w:type="dxa"/></w:tcPr>
<w:p/></w:tc>
<w:tc><w:tcPr><w:tcW w:w="7936" w:type="dxa"/></w:tcPr>
<w:p><w:pPr><w:jc w:val="center"/>
<w:spacing w:line="240" w:lineRule="auto"/></w:pPr>
<m:oMath xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math">
</m:oMath>
</w:p></w:tc>
<w:tc><w:tcPr><w:tcW w:w="567" w:type="dxa"/></w:tcPr>
<w:p><w:pPr><w:jc w:val="right"/>
<w:spacing w:line="240" w:lineRule="auto"/></w:pPr>
<w:r><w:t>(%NUMBER%)</w:t></w:r>
</w:p></w:tc>
</w:tr>
</w:tbl>
References (GB/T 7714-2015)
Use the "references" numbering reference defined above. Format text manually per GB/T 7714-2015 entry type.
{ id: "Reference", name: "Reference", basedOn: "Normal",
run: { font: { ascii: "Cambria Math", eastAsia: "SimSun" }, size: 24 },
paragraph: { spacing: { line: 240, lineRule: LineRuleType.AUTO },
indent: { left: 480, hanging: 480 } } }
new Paragraph({
style: "Reference",
numbering: { reference: "references", level: 0 },
children: [new TextRun("Author. "),
new TextRun({ text: "Title", italics: true }),
new TextRun("[J]. Journal, Year, Vol(Issue): Pages.")]
})
GB/T 7714-2015 common format templates (fill in fields):
| Type | Tag | Format |
|---|
| 期刊论文 | [J] | 作者. 题名[J]. 刊名, 年, 卷(期): 起止页. |
| 专著 | [M] | 作者. 书名[M]. 出版地: 出版者, 出版年: 起止页. |
| 学位论文 | [D] | 作者. 题名[D]. 保存地: 保存单位, 年份. |
| 会议论文 | [C] | 作者. 题名[C]//会议名. 出版地: 出版者, 年: 起止页. |
| 网络资源 | [EB/OL] | 作者. 题名[EB/OL]. (发布日期)[引用日期]. URL. |
Footer (Page Numbers)
Center-aligned page number in the footer. Uses A4 2.5cm margins.
const { Footer, Paragraph, TextRun, PageNumber, AlignmentType } = require('docx');
sections: [{
properties: {
page: {
size: { width: 11906, height: 16838 },
margin: { top: 1418, right: 1418, bottom: 1418, left: 1418 }
}
},
footers: {
default: new Footer({
children: [
new Paragraph({
alignment: AlignmentType.CENTER,
children: [new TextRun({ children: [PageNumber.CURRENT] })]
})
]
})
},
children: []
}]
new Paragraph({
alignment: AlignmentType.CENTER,
children: [
new TextRun("第 "),
new TextRun({ children: [PageNumber.CURRENT] }),
new TextRun(" 页 共 "),
new TextRun({ children: [PageNumber.TOTAL_PAGES] }),
new TextRun(" 页")
]
})
Critical Rules for docx-js
- Set page size explicitly - always use A4 (11906 × 16838 DXA) with 2.5cm margins (1418 DXA) for Chinese academic documents
- Landscape: pass portrait dimensions - docx-js swaps width/height internally; pass short edge as
width, long edge as height, and set orientation: PageOrientation.LANDSCAPE
- Never use
\n - use separate Paragraph elements
- Never use unicode bullets - use
LevelFormat.BULLET with numbering config
- PageBreak must be in Paragraph - standalone creates invalid XML
- ImageRun requires
type - always specify png/jpg/etc
- Heading numbering: per-chapter Word auto-numbering — H1 uses Chinese numerals in text (一、二、三) via
h1Chinese(). H2/H3 use Word auto-numbering via per-chapter sections_c{N} references generated by buildNumberingConfig(chapterCount). NEVER use LevelFormat.CHINESE_COUNTING in a multi-level config. Pass heading text WITHOUT manual number prefixes — Word handles all numbering.
- Figure/Table captions: no manual numbers — use
figCaption('描述') without "图 3-1 " prefix. SEQ fields generate "图 章-序 描述" automatically. Same for tableCaption(). Use stripCaptionNumber() to clean markdown input.
- Tables MUST be 三线表 - use thick top/bottom borders (size:12, 1.5pt) and thin header-bottom border (size:6, 0.75pt); set all other borders to
BorderStyle.NONE; no fill color on header cells
- Table caption: no first-line indent - override Normal style's
firstLine with indent: { firstLine: 0 } on the caption paragraph
- Block formulas use 3-column table - [567 DXA spacer | 7936 DXA formula centered | 567 DXA number right-aligned]; use
scripts/formula.py to generate; never use plain $$ LaTeX in the final docx
- Always set table
width with DXA - never use WidthType.PERCENTAGE (breaks in Google Docs)
- Tables need dual widths -
columnWidths array AND cell width, both must match
- Table width = sum of columnWidths - for DXA, ensure they add up exactly
- Always add cell margins - use
margins: { top: 80, bottom: 80, left: 120, right: 120 } for readable padding
- Use
ShadingType.CLEAR - never SOLID for table shading
- TOC: use
heading: HeadingLevel.HEADING_X - applies the overridden Heading style (with outlineLevel), giving TOC recognition AND custom formatting simultaneously
- Override built-in styles - use exact IDs: "Heading1", "Heading2", etc.; always use
font: { ascii, eastAsia, hAnsi } object form — font: "SimHei" string shorthand only sets the ASCII slot and leaves Chinese characters falling back to Normal font
- Include
outlineLevel - required for TOC (0 for H1, 1 for H2, etc.)
Editing Existing Documents
Follow all 3 steps in order.
Step 1: Unpack
python scripts/office/unpack.py document.docx unpacked/
Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities (“ etc.) so they survive editing. Use --merge-runs false to skip run merging.
Step 2: Edit XML
Edit files in unpacked/word/. See XML Reference below for patterns.
Use "Claude" as the author for tracked changes and comments, unless the user explicitly requests use of a different name.
Use the Edit tool directly for string replacement. Do not write Python scripts. Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
CRITICAL: Use smart quotes for new content. When adding text with apostrophes or quotes, use XML entities to produce smart quotes:
<w:t>Here’s a quote: “Hello”</w:t>
| Entity | Character |
|---|
‘ | ‘ (left single) |
’ | ’ (right single / apostrophe) |
“ | “ (left double) |
” | ” (right double) |
Adding comments: Use comment.py to handle boilerplate across multiple XML files (text must be pre-escaped XML):
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author"
Then add markers to document.xml (see Comments in XML Reference).
Step 3: Pack
python scripts/office/pack.py unpacked/ output.docx --original document.docx
Validates with auto-repair, condenses XML, and creates DOCX. Use --validate false to skip.
Auto-repair will fix:
durableId >= 0x7FFFFFFF (regenerates valid ID)
- Missing
xml:space="preserve" on <w:t> with whitespace
Auto-repair won't fix:
- Malformed XML, invalid element nesting, missing relationships, schema violations
Common Pitfalls
- Replace entire
<w:r> elements: When adding tracked changes, replace the whole <w:r>...</w:r> block with <w:del>...<w:ins>... as siblings. Don't inject tracked change tags inside a run.
- Preserve
<w:rPr> formatting: Copy the original run's <w:rPr> block into your tracked change runs to maintain bold, font size, etc.
- Never regenerate
styles.xml: When editing an existing document, all template styles live in word/styles.xml. Edit only document.xml content; do not overwrite or recreate styles.xml unless explicitly asked — doing so erases the user's template.
- Preserve
<w:pStyle> references: When inserting new paragraphs, copy the <w:pStyle w:val="..."/> from an adjacent paragraph of the same type. Omitting <w:pStyle> silently falls back to the document default style, losing all heading/body formatting from the template.
- Smart quotes are XML-encoded after unpack: The unpack step converts
" / " to “ / ” entities. When searching document.xml for heading text that contains Chinese quotation marks, search for the entity form (“), not the raw Unicode character.
XML Reference
Schema Compliance
- Element order in
<w:pPr>: <w:pStyle>, <w:numPr>, <w:spacing>, <w:ind>, <w:jc>, <w:rPr> last
- Whitespace: Add
xml:space="preserve" to <w:t> with leading/trailing spaces
- RSIDs: Must be 8-digit hex (e.g.,
00AB1234)
Tracked Changes
Insertion:
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:t>inserted text</w:t></w:r>
</w:ins>
Deletion:
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
Inside <w:del>: Use <w:delText> instead of <w:t>, and <w:delInstrText> instead of <w:instrText>.
Minimal edits - only mark what changes:
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="...">
<w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Claude" w:date="...">
<w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>
Deleting entire paragraphs/list items - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add <w:del/> inside <w:pPr><w:rPr>:
<w:p>
<w:pPr>
<w:numPr>...</w:numPr>
<w:rPr>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
</w:del>
</w:p>
Without the <w:del/> in <w:pPr><w:rPr>, accepting changes leaves an empty paragraph/list item.
Rejecting another author's insertion - nest deletion inside their insertion:
<w:ins w:author="Jane" w:id="5">
<w:del w:author="Claude" w:id="10">
<w:r><w:delText>their inserted text</w:delText></w:r>
</w:del>
</w:ins>
Restoring another author's deletion - add insertion after (don't modify their deletion):
<w:del w:author="Jane" w:id="5">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Claude" w:id="10">
<w:r><w:t>deleted text</w:t></w:r>
</w:ins>
Comments
After running comment.py (see Step 2), add markers to document.xml. For replies, use --parent flag and nest markers inside the parent's.
CRITICAL: <w:commentRangeStart> and <w:commentRangeEnd> are siblings of <w:r>, never inside <w:r>.
<w:commentRangeStart w:id="0"/>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted</w:delText></w:r>
</w:del>
<w:r><w:t> more text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:commentRangeStart w:id="0"/>
<w:commentRangeStart w:id="1"/>
<w:r><w:t>text</w:t></w:r>
<w:commentRangeEnd w:id="1"/>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>
Images
- Add image file to
word/media/
- Add relationship to
word/_rels/document.xml.rels:
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>
- Add content type to
[Content_Types].xml:
<Default Extension="png" ContentType="image/png"/>
- Reference in document.xml:
<w:drawing>
<wp:inline>
<wp:extent cx="914400" cy="914400"/>
<a:graphic>
<a:graphicData uri=".../picture">
<pic:pic>
<pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>
Dependencies
- pandoc: Text extraction AND LaTeX→OMML formula conversion (
scripts/formula.py requires pandoc ≥ 2.0)
- docx:
npm install docx (new documents)
- temml:
npm install temml (LaTeX → MathML conversion for Word native math)
- fast-xml-parser:
npm install fast-xml-parser (MathML parsing for docx conversion)
- LibreOffice: PDF conversion (auto-configured for sandboxed environments via
scripts/office/soffice.py)
- Poppler:
pdftoppm for images
Markdown to Word Conversion (Chinese Academic Papers)
This section documents comprehensive solutions for converting Markdown papers with LaTeX formulas, tables, and citations to properly formatted Word documents following Chinese academic standards.
Quick Start
npm install docx temml fast-xml-parser
node scripts/new_doc.js
Critical Issues & Solutions
The following 8 issues were identified and solved during production usage. All solutions are implemented in scripts/new_doc.js and scripts/mathml-to-docx.js.
Issue 1: Three-Line Table Middle Borders Visible
Problem: Body row borders in 三线表 appeared visible instead of invisible.
Solution: Set ALL body row borders to NONE, only keep:
- Header top:
THICK (1.5pt)
- Header bottom:
THIN (0.75pt)
- Last row bottom:
THICK (1.5pt)
const THICK = { style: BorderStyle.SINGLE, size: 12, color: '000000' };
const THIN = { style: BorderStyle.SINGLE, size: 6, color: '000000' };
const NONE = { style: BorderStyle.NONE, size: 0, color: 'FFFFFF' };
cellOf(h, colWidths[i], { top: THICK, bottom: THIN, left: NONE, right: NONE }, true)
cellOf(cell, colWidths[i], {
top: NONE,
bottom: isLastRow ? THICK : NONE,
left: NONE,
right: NONE,
})
Issue 2: Formula Table Borders Visible
Problem: Block formula tables (3-column layout) showed visible borders.
Solution: Set ALL borders including insideHorizontal and insideVertical to NONE:
return new Table({
width: { size: CONTENT_W, type: WidthType.DXA },
columnWidths: [567, 7936, 567],
borders: {
top: NONE,
bottom: NONE,
left: NONE,
right: NONE,
insideHorizontal: NONE,
insideVertical: NONE,
},
rows: [new TableRow({ children: [leftCell, formulaCell, numberCell] })],
});
Issue 3: Heading Numbering — Word Auto-Numbering with Per-Chapter Reset
Problem: Originally, heading numbering was baked into paragraph text via manual JS counters. When content was added/removed in Word, numbers did not update.
Solution: H2/H3 use Word's native auto-numbering via per-chapter numbering references. H1 keeps Chinese numerals in text (一、二、三). The buildNumberingConfig(chapterCount) function generates a sections_c{N} numbering reference for each chapter with 2 levels. At each chapter boundary, _chapter increments and H2/H3 switch to a fresh numbering reference, achieving automatic reset.
let _chapter = 0;
function h1Chinese(text) {
_chapter++;
return new Paragraph({
heading: HeadingLevel.HEADING_1,
indent: { firstLine: 0 },
children: [new TextRun(text)],
});
}
function h2(text) {
return new Paragraph({
heading: HeadingLevel.HEADING_2,
numbering: { reference: `sections_c${_chapter}`, level: 0 },
indent: { firstLine: 0 },
children: [new TextRun(text)],
});
}
function h3(text) {
return new Paragraph({
heading: HeadingLevel.HEADING_3,
numbering: { reference: `sections_c${_chapter}`, level: 1 },
indent: { firstLine: 0 },
children: [new TextRun(text)],
});
}
Issue 4: Heading English/Numbers in Wrong Font (SimSun instead of Cambria Math)
Problem: English text and numbers in headings, figure captions, and table captions displayed in SimSun (宋体) instead of Cambria Math.
Solution: Use mixed font configuration with ascii, eastAsia, and hAnsi properties:
{
id: 'Heading1', name: 'Heading 1', basedOn: 'Normal',
run: {
font: {
ascii: 'Cambria Math',
eastAsia: 'SimHei',
hAnsi: 'Cambria Math',
},
size: 32, bold: true,
},
}
font: { ascii: 'Cambria Math', eastAsia: 'SimSun', hAnsi: 'Cambria Math' }
Issue 5: Block Formulas Not Using Word Equation Editor
Problem: LaTeX formulas were rendered as plain text or images instead of native Word equations.
Solution: Use temml (LaTeX→MathML) + mathml-to-docx.js (MathML→OMML) pipeline:
const temml = require('temml');
const { mathmlToDocxChildren } = require('./mathml-to-docx');
function latexToMath(latex) {
const mathml = temml.renderToString(latex, { displayMode: true, throwOnError: false });
const children = mathmlToDocxChildren(mathml);
if (children && children.length) {
return new Math({ children });
}
return new Math({ children: [new MathRun(latex)] });
}
const mathObj = latexToMath('Q_n(x, a) = r + \\gamma V_{n-1}(y)');
Issue 6: Main Title English in Wrong Font
Problem: Document title's English text showed in SimSun instead of Cambria Math.
Solution: Apply same mixed font to title paragraph:
new Paragraph({
alignment: AlignmentType.CENTER,
children: [new TextRun({
text: '论文标题 Paper Title',
bold: true,
size: 36,
font: { ascii: 'Cambria Math', eastAsia: 'SimHei', hAnsi: 'Cambria Math' },
})],
})
Issue 7: Inline Math Detection Too Aggressive
Problem: Inline math regex matched plain numbers (like "1992") and English words (like "Agent", "Watkins"), incorrectly converting them to formula objects.
Solution: Use strict regex that ONLY matches actual mathematical content:
function containsMath(text) {
if (/[αβγδεζηθικλμνξπρστυφχψωΓΔΘΛΞΠΣΦΨΩ]/.test(text)) return true;
if (/[₀₁₂₃₄₅₆₇₈₉ₙₓᵢₜₛ⁰¹²³⁴⁵⁶⁷⁸⁹ⁿⁱ]/.test(text)) return true;
if (/[∞∑∏∫≤≥≠≈→←↔∈∉⊂⊃∀∃∧∨×÷±∓·…⋯′″⟨⟩]/.test(text)) return true;
if (/[A-Z]\*/.test(text)) return true;
if (/\$[^$]+\$/.test(text)) return true;
return false;
}
const mathPattern = /\$([^$]+)\$|([A-Z][₀₁₂₃₄₅₆₇₈₉ₙₓᵢₜₛ⁰¹²³⁴⁵⁶⁷⁸⁹ⁿⁱ]+\*?\s*\([^)]+\))|([A-Z]\s*\([^)]*[αβγδεζηθικλμνξπρστυφχψωΓΔΘΛΞΠΣΦΨΩ₀₁₂₃₄₅₆₇₈₉ₙₓᵢₜₛ][^)]*\))|([αβγδεζηθικλμνξπρστυφχψωΓΔΘΛΞΠΣΦΨΩ][₀₁₂₃₄₅₆₇₈₉ₙₓᵢₜₛ⁰¹²³⁴⁵⁶⁷⁸⁹ⁿⁱ]*\*?)|([A-Za-z][₀₁₂₃₄₅₆₇₈₉ₙₓᵢₜₛ⁰¹²³⁴⁵⁶⁷⁸⁹ⁿⁱ]+\*?)|([A-Z]\*)/g;
Issue 8: Citations Not in Superscript
Problem: Reference citations like [1], [2] appeared as normal text instead of superscript.
Solution: Detect citation pattern and apply superScript: true:
function containsCitation(text) {
return /\[\d+\]/.test(text);
}
const combinedPattern = /(\[\d+\])|...; // Citation first in alternation
if (match[1]) {
children.push(new TextRun({
text: match[1],
superScript: true,
}));
}
Issue 10: Page Break After Abstract and Before References
Problem: Abstract page should end after keywords, and references should start on a new page.
Solution: Add pageBreak() helper function:
function pageBreak() {
return new Paragraph({ children: [new PageBreak()] });
}
body('关键词:强化学习;Q-learning'),
pageBreak(),
pageBreak(),
new Paragraph({
heading: HeadingLevel.HEADING_1,
children: [new TextRun('参考文献')],
}),
Important: PageBreak MUST be wrapped in a Paragraph - it cannot be used standalone.
Issue 11: Formula Table Cell Vertical Alignment
Problem: In the 3-column borderless formula table, the formula and equation number (n) are not vertically centered, causing misalignment.
Solution: Add verticalAlign: VerticalAlign.CENTER to all three TableCell definitions:
const {
VerticalAlign,
} = require('docx');
const leftCell = new TableCell({
width: { size: 567, type: WidthType.DXA },
borders: noBorders,
shading: { fill: 'FFFFFF', type: ShadingType.CLEAR },
verticalAlign: VerticalAlign.CENTER,
children: [new Paragraph({ indent: { firstLine: 0 }, children: [] })],
});
const formulaCell = new TableCell({
width: { size: 7936, type: WidthType.DXA },
borders: noBorders,
shading: { fill: 'FFFFFF', type: ShadingType.CLEAR },
verticalAlign: VerticalAlign.CENTER,
children: [new Paragraph({
alignment: AlignmentType.CENTER,
indent: { firstLine: 0 },
children: [mathObj],
})],
});
const numberCell = new TableCell({
width: { size: 567, type: WidthType.DXA },
borders: noBorders,
shading: { fill: 'FFFFFF', type: ShadingType.CLEAR },
verticalAlign: VerticalAlign.CENTER,
children: [new Paragraph({
alignment: AlignmentType.RIGHT,
indent: { firstLine: 0 },
children: [new TextRun(`(${number})`)],
})],
});
Result: Formula and equation number now align horizontally on the same baseline.
File Structure
scripts/
├── new_doc.js # Main template with Issues 1-8, 10-11 fixes (~820 lines)
├── mathml-to-docx.js # MathML→docx Math converter (~250 lines)
├── formula.py # Block formula XML insertion (legacy)
├── table.py # Table XML insertion (legacy)
└── office/
├── unpack.py # DOCX→XML extraction
├── pack.py # XML→DOCX assembly
└── validate.py # Document validation
Unicode to LaTeX Mapping
The script includes comprehensive Unicode math symbol conversion:
const UNICODE_TO_LATEX = {
'α': '\\alpha', 'β': '\\beta', 'γ': '\\gamma',
'₀': '_0', '₁': '_1', '₂': '_2',
'²': '^2', '³': '^3',
'∞': '\\infty', '∑': '\\sum', '∫': '\\int',
};
MathML to DOCX Conversion
scripts/mathml-to-docx.js converts MathML (from temml) to docx Math components:
- Fractions:
<mfrac> → MathFraction
- Subscripts:
<msub> → MathSubScript
- Superscripts:
<msup> → MathSuperScript
- Combined:
<msubsup> → MathSubSuperScript
- Radicals:
<msqrt>, <mroot> → MathRadical
- Summation:
<munderover> with ∑ → MathSum
- Integrals:
<munderover> with ∫ → MathIntegral
- Matrices:
<mtable> → MathMatrix
Workflow for New Conversions
- Copy
new_doc.js template to your project
- Install dependencies:
npm install docx temml fast-xml-parser
- Edit CONTENT section with your document structure
- Use helper functions:
h1/h2/h3(text) - Headings with auto-numbering
body(text) - Body paragraph (auto-detects math/citations)
formula(latex, number) - Block formula
threeLineTable(headers, rows, colWidths) - Three-line table
tableCaption/figCaption(label) - Captions
pageBreak() - Page break (Issue 10)
ref(text) - Reference entry
- Run:
node new_doc.js