Run any Skill in Manus with one click

Get Started

docx-shell-extract

Extract text from DOCX files using shell commands when python-docx is unavailable

Run Skill in Manus

Overview

Extract text from DOCX files using shell commands when python-docx is unavailable

Install command

npx skills add https://github.com/HKUDS/OpenSpace --skill docx-shell-extract

Copy and paste this command into Claude Code to install the skill

Source

HKUDS/OpenSpace

Stars6,420

Forks792

UpdatedMarch 24, 2026 at 08:03

File Explorer

2 files

SKILL.md

readonly

name	docx-shell-extract
description	Extract text from DOCX files using shell commands when python-docx is unavailable

DOCX Shell Extraction

When to Use This Skill

Use this pattern when you need to read or extract text from Microsoft Word (.docx) files in constrained environments where:

The python-docx library is not available
You cannot install additional Python packages
You need a quick, reliable shell-based solution

Core Technique

DOCX files are ZIP archives containing XML files. The main document content is stored in word/document.xml. You can extract and parse this using standard shell tools.

Step-by-Step Instructions

Step 1: Extract the document.xml content

unzip -p filename.docx word/document.xml

The -p flag pipes the content to stdout without extracting to disk.

Step 2: Strip XML tags to get plain text

unzip -p filename.docx word/document.xml | sed 's/<[^>]*>//g'

This removes all XML tags, leaving the text content.

Step 3: Clean up whitespace (optional)

For cleaner output, add additional sed processing:

unzip -p filename.docx word/document.xml | \
  sed 's/<[^>]*>//g' | \
  sed 's/&[^;]*;//g' | \
  sed 's/^[[:space:]]*//' | \
  sed 's/[[:space:]]*$//' | \
  sed '/^$/d'

This removes:

XML tags
XML entities (like &, <)
Leading/trailing whitespace
Empty lines

Step 4: Save to a text file (optional)

unzip -p filename.docx word/document.xml | \
  sed 's/<[^>]*>//g' > output.txt

Complete Example

# Extract text from a Word document
DOCX_FILE="report.docx"
OUTPUT_FILE="report_text.txt"

unzip -p "$DOCX_FILE" word/document.xml | \
  sed 's/<[^>]*>//g' | \
  sed 's/&[^;]*;//g' | \
  sed '/^$/d' > "$OUTPUT_FILE"

echo "Extracted text saved to $OUTPUT_FILE"

Verification

After extraction, verify the content was captured:

# Check if output file has content
if [ -s "$OUTPUT_FILE" ]; then
    echo "Successfully extracted $(wc -l < "$OUTPUT_FILE") lines"
    head -5 "$OUTPUT_FILE"
else
    echo "Warning: Output file is empty"
fi

Limitations

This method extracts raw text without formatting
Complex layouts, tables, and images are not preserved
Some special characters may need additional handling
Works best for text-heavy documents

Alternatives to Explore

If this approach fails or the DOCX structure differs:

Check for word/document.xml existence: unzip -l filename.docx | grep document.xml
Some documents may use word/*.xml with different naming
Consider pandoc if available: pandoc filename.docx -t plain

More from this repository

same repository

delegate-task

HKUDS/OpenSpace

Delegate tasks to OpenSpace — a full-stack autonomous worker for coding, DevOps, web research, and desktop automation, backed by an extensive MCP tool and skill library. Skills auto-improve through use, reducing token consumption over time. A cloud community lets agents share and collectively evolve reusable skills.

2026-04-076.4k

adaptive-stem-alignment

HKUDS/OpenSpace

Incremental audio production with duration mismatch handling, adaptive stem extension, and pre-mix alignment verification

2026-03-246.4k

diagnostic-stem-delivery

HKUDS/OpenSpace

Audio production with diagnostic analysis, timecode parsing from documents, and verified export workflow

2026-03-246.4k

aligned-stem-workflow

HKUDS/OpenSpace

Incremental audio production with duration alignment handling, per-stem verification, and adaptive extension strategies

2026-03-246.4k

incremental-audio-workflow

HKUDS/OpenSpace

Step-by-step audio production with per-stem verification, timing alignment, and incremental quality gates

2026-03-246.4k

audio-track-production

HKUDS/OpenSpace

End-to-end audio production workflow with stems, effects, archiving, and verification

2026-03-246.4k

Source

HKUDS

HKUDS/OpenSpace

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	docx-shell-extract
description	Extract text from DOCX files using shell commands when python-docx is unavailable

DOCX Shell Extraction

When to Use This Skill

Use this pattern when you need to read or extract text from Microsoft Word (.docx) files in constrained environments where:

The python-docx library is not available
You cannot install additional Python packages
You need a quick, reliable shell-based solution

Core Technique

DOCX files are ZIP archives containing XML files. The main document content is stored in word/document.xml. You can extract and parse this using standard shell tools.

Step-by-Step Instructions

Step 1: Extract the document.xml content

unzip -p filename.docx word/document.xml

The -p flag pipes the content to stdout without extracting to disk.

Step 2: Strip XML tags to get plain text

unzip -p filename.docx word/document.xml | sed 's/<[^>]*>//g'

This removes all XML tags, leaving the text content.

Step 3: Clean up whitespace (optional)

For cleaner output, add additional sed processing:

unzip -p filename.docx word/document.xml | \
  sed 's/<[^>]*>//g' | \
  sed 's/&[^;]*;//g' | \
  sed 's/^[[:space:]]*//' | \
  sed 's/[[:space:]]*$//' | \
  sed '/^$/d'

This removes:

XML tags
XML entities (like &, <)
Leading/trailing whitespace
Empty lines

Step 4: Save to a text file (optional)

unzip -p filename.docx word/document.xml | \
  sed 's/<[^>]*>//g' > output.txt

Complete Example

# Extract text from a Word document
DOCX_FILE="report.docx"
OUTPUT_FILE="report_text.txt"

unzip -p "$DOCX_FILE" word/document.xml | \
  sed 's/<[^>]*>//g' | \
  sed 's/&[^;]*;//g' | \
  sed '/^$/d' > "$OUTPUT_FILE"

echo "Extracted text saved to $OUTPUT_FILE"

Verification

After extraction, verify the content was captured:

# Check if output file has content
if [ -s "$OUTPUT_FILE" ]; then
    echo "Successfully extracted $(wc -l < "$OUTPUT_FILE") lines"
    head -5 "$OUTPUT_FILE"
else
    echo "Warning: Output file is empty"
fi

Limitations

This method extracts raw text without formatting
Complex layouts, tables, and images are not preserved
Some special characters may need additional handling
Works best for text-heavy documents

Alternatives to Explore

If this approach fails or the DOCX structure differs:

Check for word/document.xml existence: unzip -l filename.docx | grep document.xml
Some documents may use word/*.xml with different naming
Consider pandoc if available: pandoc filename.docx -t plain