一键在 Manus 中运行任何 Skill

$pwd:

bible-epub-processing

Name: Bible Epub Processing
Author: findinfinitelabs

// Parse JW.org NWT EPUBs to extract individual Bible verses for verse previews, parallel-corpus building, and Bible-coverage analytics. Use when working with NWT EPUB files, building parallel Chuukese↔English Bible data, or wiring scripture preview features.

在 Manus 中运行

$ git log --oneline --stat

stars:0

forks:0

updated:2026年5月4日 01:56

SKILL.md

readonly

name	bible-epub-processing
description	Parse JW.org NWT EPUBs to extract individual Bible verses for verse previews, parallel-corpus building, and Bible-coverage analytics. Use when working with NWT EPUB files, building parallel Chuukese↔English Bible data, or wiring scripture preview features.

Bible EPUB Processing

JW.org publishes the New World Translation as EPUB. We parse it on demand to fetch individual verses (for preview popups and coverage analytics), and bulk-extract for building parallel training data.

The parser

src/utils/nwt_epub_parser.py — minimal, intentional surface:

from src.utils.nwt_epub_parser import NWTEpubParser

parser = NWTEpubParser("config/data/bible/nwt_chk.epub")
text = parser.get_verse(book_num="40", chapter=1, verse=1)
# → "Chón anūn... Jesus Kraist..." or None if not found

That's it. There is no get_book_list(), get_chapters(), get_verses(), or ParallelCorpusBuilder class — older skill docs invented those. The full implementation:

__init__ loads the EPUB and calls _build_chapter_map().
_build_chapter_map() reads toc.xhtml and per-book biblechapternav<N>.xhtml files to map {book_num_str: {"name": str, "chapters": {chapter_int: chapter_file}}}.
get_verse(book_num, chapter, verse) opens the chapter XHTML, finds the element with id chapter{N}_verse{M}, and returns its text.

Book numbers are strings keyed "1" through "66". Pass "40" for Matthew, not 40.

Files & layout

EPUBs live under config/data/bible/ — typically nwt_chk.epub (Chuukese) and nwt_en.epub (English). Not under a top-level data/ directory.
Book number ↔ name mapping: config/bible_books.json.
Reference parsing (text → (book, chapter, verse)): config/scripture_books.json + src/utils/scripture_parser.py. See the scripture-reference-parsing skill.

Backend wiring

The parsers are lazily initialized — they're slow to construct (~1–2s per EPUB) so the app holds a singleton:

Module-level slot + lazy getter at app.py (Chuukese) and app.py (English).
Verse preview API: GET /api/scripture/preview?ref=... (app.py) — returns CHK + EN side-by-side.
Bible coverage analytics: GET /api/bible/coverage (app.py) — summarises which verses appear how often across the corpus.

Anti-scrape robots policy on Bible endpoints: app.py. When you add a new Bible-related endpoint, route it through the same gate.

Building parallel data

The "parallel corpus" is built ad-hoc by iterating book/chapter/verse triples and calling get_verse on both EPUBs. There is no class for it; the recipe lives in scripts/extract_brochure_sentences.py (for brochures) and analogous in-line code under scripts/. If you need a reusable parallel builder, write it new — just don't put it on NWTEpubParser itself, keep the parser's public API tiny.

from src.utils.nwt_epub_parser import NWTEpubParser
chk = NWTEpubParser("config/data/bible/nwt_chk.epub")
en  = NWTEpubParser("config/data/bible/nwt_en.epub")

pairs = []
for book_num, book in chk.book_chapter_map.items():
    for ch_num in book["chapters"]:
        for v_num in range(1, 200):  # verses are sparse; break on None
            ch = chk.get_verse(book_num, ch_num, v_num)
            if ch is None: break
            en_text = en.get_verse(book_num, ch_num, v_num)
            if en_text:
                pairs.append({"chk": ch, "en": en_text, "ref": f"{book['name']} {ch_num}:{v_num}"})

Feed pairs into AITrainingDataGenerator.export_training_data(...) for the standard jsonl/huggingface/ollama outputs.

Pitfalls

get_verse returns None when the verse doesn't exist (end of chapter, missing book). Always handle None — don't assume continuous numbering.
Both EPUBs must use the same JW.org schema. Older NWT EPUBs use id="v{N}" instead of id="chapter{N}_verse{M}" — the parser will silently return None for those. If extraction is empty, dump the chapter HTML and check the id format.
Constructing the parser is heavy. Always use the singleton accessors in app.py; never NWTEpubParser(...) per request.
The TOC parsing key (href.startswith("biblechapternav")) is brittle — JW.org occasionally changes filenames. If a new EPUB build breaks parsing, inspect book.get_items() filenames first.
Verse text may contain footnote markers (e.g. *, +). Strip them before training data generation; preview UI keeps them.

related-skills.json

同仓库

ai-training-data-generation.md

from "findinfinitelabs/chuuk"

Generate high-quality training datasets from documents, text corpora, EPUBs, and structured content. Use when creating AI training data from dictionaries, Bible EPUBs, brochures, or when generating examples for machine learning models. Optimized for low-resource languages and domain-specific knowledge extraction. Supports parallel corpus extraction from NWT Bible EPUBs.

2026-05-040

azure-container-deployment.md

from "findinfinitelabs/chuuk"

Deploy the Chuuk Dictionary stack (main app + Ollama sidecar) to Azure Container Apps. Covers ACR remote builds via `az acr build`, Key Vault prerequisites, Cosmos DB credential injection, and the env-var contract. Use when running a deploy, debugging a failed deploy, or modifying Azure infrastructure.

2026-05-040

chuukese-language-processing.md

from "findinfinitelabs/chuuk"

Specialized processing for Chuukese language text including tokenization, accent handling, cultural context preservation, and language-specific patterns. Use when working with Chuukese text, translation tasks, or when building language models for this Micronesian language.

2026-05-040

code-documentation-standards.md

from "findinfinitelabs/chuuk"

Comprehensive code documentation standards and guidelines for maintaining up-to-date documentation across Python, HTML, CSS, and JavaScript codebases. Use when creating or modifying code to ensure proper documentation practices and maintainable code.

2026-05-040

css-styling-standards.md

from "findinfinitelabs/chuuk"

Styling conventions for the Chuuk Dictionary frontend — Mantine v8 theme, CSS Modules per page, global app-shell CSS, multilingual / accented-character considerations. Use when adding or modifying styles in `frontend/src/`.

2026-05-040

database-management-operations.md

from "findinfinitelabs/chuuk"

Conventions for the Chuuk Dictionary persistence layer — Azure Cosmos DB (MongoDB API) via `db_factory`, `DictionaryDB`, `UserDB`, and `PublicationManager`. Covers connection-resolution order, the actual collection/method names in use, and the managed-identity path. Use when adding queries, debugging connection issues, or extending the schema.

2026-05-040

package.json

"author": "findinfinitelabs"

"repository": "findinfinitelabs/chuuk"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件开发工程师计算机与数学类职业15-1252L4

name	bible-epub-processing
description	Parse JW.org NWT EPUBs to extract individual Bible verses for verse previews, parallel-corpus building, and Bible-coverage analytics. Use when working with NWT EPUB files, building parallel Chuukese↔English Bible data, or wiring scripture preview features.

Bible EPUB Processing

JW.org publishes the New World Translation as EPUB. We parse it on demand to fetch individual verses (for preview popups and coverage analytics), and bulk-extract for building parallel training data.

The parser

src/utils/nwt_epub_parser.py — minimal, intentional surface:

from src.utils.nwt_epub_parser import NWTEpubParser

parser = NWTEpubParser("config/data/bible/nwt_chk.epub")
text = parser.get_verse(book_num="40", chapter=1, verse=1)
# → "Chón anūn... Jesus Kraist..." or None if not found

That's it. There is no get_book_list(), get_chapters(), get_verses(), or ParallelCorpusBuilder class — older skill docs invented those. The full implementation:

__init__ loads the EPUB and calls _build_chapter_map().
_build_chapter_map() reads toc.xhtml and per-book biblechapternav<N>.xhtml files to map {book_num_str: {"name": str, "chapters": {chapter_int: chapter_file}}}.
get_verse(book_num, chapter, verse) opens the chapter XHTML, finds the element with id chapter{N}_verse{M}, and returns its text.

Book numbers are strings keyed "1" through "66". Pass "40" for Matthew, not 40.

Files & layout

EPUBs live under config/data/bible/ — typically nwt_chk.epub (Chuukese) and nwt_en.epub (English). Not under a top-level data/ directory.
Book number ↔ name mapping: config/bible_books.json.
Reference parsing (text → (book, chapter, verse)): config/scripture_books.json + src/utils/scripture_parser.py. See the scripture-reference-parsing skill.

Backend wiring

The parsers are lazily initialized — they're slow to construct (~1–2s per EPUB) so the app holds a singleton:

Module-level slot + lazy getter at app.py (Chuukese) and app.py (English).
Verse preview API: GET /api/scripture/preview?ref=... (app.py) — returns CHK + EN side-by-side.
Bible coverage analytics: GET /api/bible/coverage (app.py) — summarises which verses appear how often across the corpus.

Anti-scrape robots policy on Bible endpoints: app.py. When you add a new Bible-related endpoint, route it through the same gate.

Building parallel data

from src.utils.nwt_epub_parser import NWTEpubParser
chk = NWTEpubParser("config/data/bible/nwt_chk.epub")
en  = NWTEpubParser("config/data/bible/nwt_en.epub")

pairs = []
for book_num, book in chk.book_chapter_map.items():
    for ch_num in book["chapters"]:
        for v_num in range(1, 200):  # verses are sparse; break on None
            ch = chk.get_verse(book_num, ch_num, v_num)
            if ch is None: break
            en_text = en.get_verse(book_num, ch_num, v_num)
            if en_text:
                pairs.append({"chk": ch, "en": en_text, "ref": f"{book['name']} {ch_num}:{v_num}"})

Feed pairs into AITrainingDataGenerator.export_training_data(...) for the standard jsonl/huggingface/ollama outputs.

Pitfalls

get_verse returns None when the verse doesn't exist (end of chapter, missing book). Always handle None — don't assume continuous numbering.
Both EPUBs must use the same JW.org schema. Older NWT EPUBs use id="v{N}" instead of id="chapter{N}_verse{M}" — the parser will silently return None for those. If extraction is empty, dump the chapter HTML and check the id format.
Constructing the parser is heavy. Always use the singleton accessors in app.py; never NWTEpubParser(...) per request.
The TOC parsing key (href.startswith("biblechapternav")) is brittle — JW.org occasionally changes filenames. If a new EPUB build breaks parsing, inspect book.get_items() filenames first.
Verse text may contain footnote markers (e.g. *, +). Strip them before training data generation; preview UI keeps them.

bible-epub-processing

Bible EPUB Processing

The parser

Files & layout

Backend wiring

Building parallel data

Pitfalls

同仓库更多 Skills

同仓库更多 Skills

Bible EPUB Processing

The parser

Files & layout

Backend wiring

Building parallel data

Pitfalls