Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

chem-data-extractor

Name: Chem Data Extractor
Author: InternScience

// Extract structured chemical compound characterization data from chemistry supplementary material documents (PDF/Markdown). 从化学论文补充材料(PDF/Markdown)中提取结构化化合物表征数据。 Use when Kimi needs to extract compound properties including NMR spectra, HRMS, HPLC data, melting points, optical rotation, and yield information from chemistry research papers or supplementary materials. 支持提取NMR谱图、HRMS、HPLC数据、熔点、旋光度、产率等信息。 Supports both single compound extraction and batch extraction of all compounds. 支持单个化合物提取和批量提取所有化合物。

Ejecutar en Manus

$ git log --oneline --stat

stars:48

forks:11

updated:27 de marzo de 2026, 08:51

Explorador de archivos

3 archivos

SKILL.md

readonly

name

chem-data-extractor

description

Extract structured chemical compound characterization data from chemistry supplementary material documents (PDF/Markdown). 从化学论文补充材料(PDF/Markdown)中提取结构化化合物表征数据。 Use when Kimi needs to extract compound properties including NMR spectra, HRMS, HPLC data, melting points, optical rotation, and yield information from chemistry research papers or supplementary materials. 支持提取NMR谱图、HRMS、HPLC数据、熔点、旋光度、产率等信息。 Supports both single compound extraction and batch extraction of all compounds. 支持单个化合物提取和批量提取所有化合物。

Chemistry Data Extractor | 化学数据提取器

Extract structured chemical characterization data from chemistry supplementary materials and return in strict JSON format. 从化学论文补充材料中提取结构化表征数据，以严格JSON格式返回。

Supported Data Fields

compound_name: Full IUPAC or common name (including stereochemistry if given)
structure_image_description: Brief description of the molecular structure
physical_state: e.g., "white solid", "colorless oil"
mass_obtained: in mg
yield_percent: as number only
melting_point_range: in °C, as string like "126.6–127.3"
rf_value: Rf value and solvent system
optical_rotation: [α]D²⁵ value, concentration, solvent
hplc_conditions: column, mobile phase, flow rate, wavelength, retention times (major/minor), ee%
nmr_1H: frequency, solvent, chemical shifts with multiplicity and coupling constants
nmr_13C: frequency, solvent, chemical shifts with notes (e.g., d, JCF)
nmr_19F: frequency, solvent, chemical shift
hrms_data: ion type, calculated m/z, found m/z, formula
racemic_sample_note: if mentioned

Workflow

Step 1: Ask User for Extraction Mode

When user asks to extract chemistry data, first ask:

Do you want to extract data for:

A specific compound (provide compound ID like "3i" or "1a")

All compounds in a single document

Batch process multiple PDF files (creates folder for each)

Mode 1: Batch Process Multiple PDFs

For processing multiple PDF files at once. Creates a separate folder for each PDF with extracted compounds.

Usage

python scripts/batch_extract.py \
    /path/to/pdf_folder \
    -o ./output_folder

Options

-o, --output: Output base directory (default: ./chem_extract_output)
--keep-md: Keep intermediate markdown files (default: cleanup after extraction)
--skip-existing: Skip PDFs that already have output folders

Output Structure

output_folder/
├── batch_summary.json          # Overall summary of all processed PDFs
├── paper1/
│   ├── compounds.json          # All extracted compounds
│   └── summary.json            # Brief summary with compound list
├── paper2/
│   ├── compounds.json
│   └── summary.json
└── ...

Example: Batch Process

# Process all PDFs in a folder
python scripts/batch_extract.py ./pdfs/ -o ./extracted_data

# Process single PDF
python scripts/batch_extract.py ./article.pdf -o ./results

# Keep intermediate files, skip existing
python scripts/batch_extract.py ./pdfs/ --keep-md --skip-existing

Mode 2: Single Document Processing

For processing a single document (Markdown or after PDF conversion).

Step 2: Prepare Input File

If the input is a PDF file:

Use mineru-pdf-converter skill to convert PDF to Markdown first
Use the generated full.md file as input

If the input is already a Markdown file, use it directly.

Step 3: Extract Data

Use the extraction script to parse the data:

# Extract a specific compound
python scripts/extract_chem_data.py \
    /path/to/full.md -c COMPOUND_ID --compact

# Extract all compounds
python scripts/extract_chem_data.py \
    /path/to/full.md --compact

Options:

-c, --compound: Extract specific compound by ID (e.g., "3i", "1a")
--compact: Remove null/empty fields from output
-o, --output: Save output to file instead of stdout

Step 4: Return Results

Output ONLY valid JSON without any extra text, unless the user specifically asks for explanations.

Examples

Example 1: Batch Process Multiple PDFs

# Process all PDFs in a directory
python scripts/batch_extract.py ./supplementary_pdfs/ -o ./extracted_compounds

# Output structure:
# ./extracted_compounds/
# ├── batch_summary.json
# ├── paper1/
# │   ├── compounds.json
# │   └── summary.json
# └── paper2/
#     ├── compounds.json
#     └── summary.json

Example 2: Extract Single Compound from Markdown

python scripts/extract_chem_data.py full.md -c 3i --compact

Output:

{
  "compound_name": "(R)-N-Benzoyl-4-iodobenzenesulfonimidoyl fluoride",
  "physical_state": "white solid",
  "mass_obtained": 33.2,
  "yield_percent": 85,
  "melting_point_range": "126.6–127.3",
  "rf_value": "0.37 (Pet/EtOAc, 5/1, v/v)",
  "optical_rotation": "[α]D25 = +10.3 (c = 0.75, CHCl3)",
  "hplc_conditions": {
    "column": "CHIRALCEL AY-RH",
    "mobile_phase": "n-hexane/2-propanol = 60/40",
    "flow_rate": "1.0 mL/min",
    "wavelength": "256 nm",
    "retention_times": {
      "major": "9.642 min",
      "minor": "12.955 min"
    },
    "ee_percent": 90
  },
  "nmr_1H": {
    "frequency": "400 MHz",
    "solvent": "CDCl3",
    "chemical_shifts": "δ 8.17–8.09 (m, 2H), 8.06–8.00 (m, 2H)..."
  },
  "nmr_13C": {
    "frequency": "100 MHz",
    "solvent": "CDCl3",
    "chemical_shifts": "δ 170.0, 139.2, 134.3 (d, JCF = 22.0 Hz)..."
  },
  "nmr_19F": {
    "frequency": "376 MHz",
    "solvent": "CDCl3",
    "chemical_shift": "δ 65.3 (s, 1F)"
  },
  "hrms_data": {
    "ion_type": "[M+Na]+",
    "calculated_mz": 411.9275,
    "found_mz": 411.9273,
    "formula": "C13H9FINNaO2S"
  }
}

Example 3: Extract All Compounds from Single File

python scripts/extract_chem_data.py full.md --compact -o all_compounds.json

Output is a JSON array containing all extracted compounds.

Common Compound ID Patterns

Compound IDs typically follow these patterns:

Numbers: 1, 2, 3
Numbers with letters: 1a, 3i, 5b
Numbers with multiple letters: 1aa, 3ba
Numbers with primes: 1', 2''

Tips

Compound identification: The script looks for section headers like (R)-Compound Name (3i) or # Compound Name (1a)
Data completeness: Not all fields may be present for every compound - missing fields will be null (or omitted with --compact)
Stereochemistry: The script preserves stereochemical descriptors like (R), (S), (±) in compound names
Multiple compounds: When extracting all compounds, the output is a JSON array sorted by appearance in document

related-skills.json

mismo repositorio

gjf-to-xyz.md

from "InternScience/ChemClaw"

Convert Gaussian gjf input files to XYZ format. 将Gaussian gjf输入文件转换为XYZ格式。 Use when agent needs to convert molecular structure files from Gaussian input format (.gjf) to XYZ format for visualization or use with other computational chemistry software. 当智能体需要将Gaussian输入格式(.gjf)的分子结构文件转换为XYZ格式用于可视化或其他计算化学软件时使用。

2026-03-2748

mineru-pdf-converter.md

from "InternScience/ChemClaw"

Convert PDF files to Markdown using MinerU API. 使用MinerU API将PDF文件转换为Markdown格式。 Use when Kimi needs to extract structured text, images, tables, and formulas from PDF documents while preserving document layout and formatting. 适用于需要提取结构化文本、图片、表格和公式并保留文档布局的场景。 Supports batch conversion and outputs full.md with images/, JSON metadata, and other extracted assets. 支持批量转换，输出full.md、images/目录、JSON元数据等。 Now supports large PDFs (600+ pages) by automatic splitting and merging. 现已支持大文件(600+页)自动拆分和合并处理。

2026-03-2748

pdf-dft-extractor.md

from "InternScience/ChemClaw"

Extract DFT calculation coordinates from PDF files and generate Gaussian gjf files. 从PDF文件中提取DFT计算坐标并生成Gaussian gjf输入文件。 Supports batch processing with separate output folders for each PDF. 支持批量处理，每个PDF单独生成输出文件夹。

2026-03-2748

ms-spectra-simulation.md

from "InternScience/ChemClaw"

Predict and visualize MS/MS spectra from a single SMILES using the fioRa online app. Use when the user wants a mass spectrum, MGF/MSP output, or a plotted stick spectrum from SMILES, with optional custom Name, precursor type, collision energy, and instrument settings.

2026-03-2748

nmr-prediction.md

from "InternScience/ChemClaw"

Predict liquid-phase ¹H and ¹³C NMR chemical shifts from a SMILES string using NMRNet (deep learning, SE(3)-Transformer). Outputs per-atom shift values (ppm) and Lorentzian-broadened spectrum PNG files.

2026-03-2748

geometry-optimizer.md

from "InternScience/ChemClaw"

使用半经验方法 (xTB) 对分子三维结构进行几何优化，支持 SMILES 自动转 3D、XYZ 文件输入，输出优化后坐标、能量、收敛状态。

2026-03-2748

package.json

"author": "InternScience"

"repository": "InternScience/ChemClaw"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Desarrolladores de softwareOcupaciones informáticas y matemáticas15-1252L4

name

chem-data-extractor

description

Chemistry Data Extractor | 化学数据提取器

Supported Data Fields

compound_name: Full IUPAC or common name (including stereochemistry if given)
structure_image_description: Brief description of the molecular structure
physical_state: e.g., "white solid", "colorless oil"
mass_obtained: in mg
yield_percent: as number only
melting_point_range: in °C, as string like "126.6–127.3"
rf_value: Rf value and solvent system
optical_rotation: [α]D²⁵ value, concentration, solvent
hplc_conditions: column, mobile phase, flow rate, wavelength, retention times (major/minor), ee%
nmr_1H: frequency, solvent, chemical shifts with multiplicity and coupling constants
nmr_13C: frequency, solvent, chemical shifts with notes (e.g., d, JCF)
nmr_19F: frequency, solvent, chemical shift
hrms_data: ion type, calculated m/z, found m/z, formula
racemic_sample_note: if mentioned

Workflow

Step 1: Ask User for Extraction Mode

When user asks to extract chemistry data, first ask:

Do you want to extract data for:

A specific compound (provide compound ID like "3i" or "1a")

All compounds in a single document

Batch process multiple PDF files (creates folder for each)

Mode 1: Batch Process Multiple PDFs

For processing multiple PDF files at once. Creates a separate folder for each PDF with extracted compounds.

Usage

python scripts/batch_extract.py \
    /path/to/pdf_folder \
    -o ./output_folder

Options

-o, --output: Output base directory (default: ./chem_extract_output)
--keep-md: Keep intermediate markdown files (default: cleanup after extraction)
--skip-existing: Skip PDFs that already have output folders

Output Structure

output_folder/
├── batch_summary.json          # Overall summary of all processed PDFs
├── paper1/
│   ├── compounds.json          # All extracted compounds
│   └── summary.json            # Brief summary with compound list
├── paper2/
│   ├── compounds.json
│   └── summary.json
└── ...

Example: Batch Process

# Process all PDFs in a folder
python scripts/batch_extract.py ./pdfs/ -o ./extracted_data

# Process single PDF
python scripts/batch_extract.py ./article.pdf -o ./results

# Keep intermediate files, skip existing
python scripts/batch_extract.py ./pdfs/ --keep-md --skip-existing

Mode 2: Single Document Processing

For processing a single document (Markdown or after PDF conversion).

Step 2: Prepare Input File

If the input is a PDF file:

Use mineru-pdf-converter skill to convert PDF to Markdown first
Use the generated full.md file as input

If the input is already a Markdown file, use it directly.

Step 3: Extract Data

Use the extraction script to parse the data:

# Extract a specific compound
python scripts/extract_chem_data.py \
    /path/to/full.md -c COMPOUND_ID --compact

# Extract all compounds
python scripts/extract_chem_data.py \
    /path/to/full.md --compact

Options:

-c, --compound: Extract specific compound by ID (e.g., "3i", "1a")
--compact: Remove null/empty fields from output
-o, --output: Save output to file instead of stdout

Step 4: Return Results

Output ONLY valid JSON without any extra text, unless the user specifically asks for explanations.

Examples

Example 1: Batch Process Multiple PDFs

# Process all PDFs in a directory
python scripts/batch_extract.py ./supplementary_pdfs/ -o ./extracted_compounds

# Output structure:
# ./extracted_compounds/
# ├── batch_summary.json
# ├── paper1/
# │   ├── compounds.json
# │   └── summary.json
# └── paper2/
#     ├── compounds.json
#     └── summary.json

Example 2: Extract Single Compound from Markdown

python scripts/extract_chem_data.py full.md -c 3i --compact

Output:

{
  "compound_name": "(R)-N-Benzoyl-4-iodobenzenesulfonimidoyl fluoride",
  "physical_state": "white solid",
  "mass_obtained": 33.2,
  "yield_percent": 85,
  "melting_point_range": "126.6–127.3",
  "rf_value": "0.37 (Pet/EtOAc, 5/1, v/v)",
  "optical_rotation": "[α]D25 = +10.3 (c = 0.75, CHCl3)",
  "hplc_conditions": {
    "column": "CHIRALCEL AY-RH",
    "mobile_phase": "n-hexane/2-propanol = 60/40",
    "flow_rate": "1.0 mL/min",
    "wavelength": "256 nm",
    "retention_times": {
      "major": "9.642 min",
      "minor": "12.955 min"
    },
    "ee_percent": 90
  },
  "nmr_1H": {
    "frequency": "400 MHz",
    "solvent": "CDCl3",
    "chemical_shifts": "δ 8.17–8.09 (m, 2H), 8.06–8.00 (m, 2H)..."
  },
  "nmr_13C": {
    "frequency": "100 MHz",
    "solvent": "CDCl3",
    "chemical_shifts": "δ 170.0, 139.2, 134.3 (d, JCF = 22.0 Hz)..."
  },
  "nmr_19F": {
    "frequency": "376 MHz",
    "solvent": "CDCl3",
    "chemical_shift": "δ 65.3 (s, 1F)"
  },
  "hrms_data": {
    "ion_type": "[M+Na]+",
    "calculated_mz": 411.9275,
    "found_mz": 411.9273,
    "formula": "C13H9FINNaO2S"
  }
}

Example 3: Extract All Compounds from Single File

python scripts/extract_chem_data.py full.md --compact -o all_compounds.json

Output is a JSON array containing all extracted compounds.

Common Compound ID Patterns

Compound IDs typically follow these patterns:

Numbers: 1, 2, 3
Numbers with letters: 1a, 3i, 5b
Numbers with multiple letters: 1aa, 3ba
Numbers with primes: 1', 2''

Tips

Compound identification: The script looks for section headers like (R)-Compound Name (3i) or # Compound Name (1a)
Data completeness: Not all fields may be present for every compound - missing fields will be null (or omitted with --compact)
Stereochemistry: The script preserves stereochemical descriptors like (R), (S), (±) in compound names
Multiple compounds: When extracting all compounds, the output is a JSON array sorted by appearance in document

chem-data-extractor

Chemistry Data Extractor | 化学数据提取器

Supported Data Fields

Workflow

Step 1: Ask User for Extraction Mode

Mode 1: Batch Process Multiple PDFs

Usage

Options

Output Structure

Example: Batch Process

Mode 2: Single Document Processing

Step 2: Prepare Input File

Step 3: Extract Data

Step 4: Return Results

Examples

Example 1: Batch Process Multiple PDFs

Example 2: Extract Single Compound from Markdown

Example 3: Extract All Compounds from Single File

Common Compound ID Patterns

Tips

Más de este repositorio

Chemistry Data Extractor | 化学数据提取器

Supported Data Fields

Workflow

Step 1: Ask User for Extraction Mode

Mode 1: Batch Process Multiple PDFs

Usage

Options

Output Structure

Example: Batch Process

Mode 2: Single Document Processing

Step 2: Prepare Input File

Step 3: Extract Data

Step 4: Return Results

Examples

Example 1: Batch Process Multiple PDFs

Example 2: Extract Single Compound from Markdown

Example 3: Extract All Compounds from Single File

Common Compound ID Patterns

Tips

Más de este repositorio