| name | local-paper-index |
| description | Build a searchable Asta document index from a collection of local PDF files. Use when the user asks to "index these PDFs", "put these PDFs into an Asta index", "index these PDFs using Asta", "build a local paper index", "make these PDFs searchable", or "create an index from these papers". |
| metadata | {"internal":true} |
| allowed-tools | Bash(asta pdf-extraction *) Bash(asta documents *) Bash(python3 *) Bash(bash *) Bash(find *) Bash(ls *) Bash(wc *) Bash(du *) Bash(mv *) Bash(cp *) Bash(cat *) Bash(mkdir *) Read(*) Write(*) Glob(*) |
Local PDF Index Builder
Build a searchable Asta document index from a collection of local PDF files. Each PDF is converted to markdown, split into ~2000-character chunks, and written as documents to an asta-documents YAML index ā enabling semantic search across the full text of the collection.
Installation
This skill requires the asta CLI:
PLUGIN_VERSION=0.17.1
if [ "$(asta --version 2>/dev/null | grep -oE '[0-9]+\.[0-9]+\.[0-9]+')" != "$PLUGIN_VERSION" ]; then
uv tool install --force git+https://github.com/allenai/asta-plugins.git@v$PLUGIN_VERSION
fi
Prerequisites: Python 3.11+ and uv package manager
Assets
This skill includes standalone scripts in the assets/ directory:
| Script | Purpose |
|---|
assets/extract-pdfs.sh | Convert PDFs to markdown via asta pdf-extraction remote |
assets/chunk-and-index.py | Chunk markdown files and write the YAML index directly |
assets/warm-cache.sh | Run an initial search to build the search cache |
Locate the assets directory relative to this skill file. The scripts are self-contained and can be copied to the working directory or run in place.
Procedure
Step 0: Interview the user for paths and collection name
Before starting, ask the user for four things:
- PDF directory ā Where are the PDFs?
- Markdown output directory ā Where should extracted markdown go?
- Collection name ā A short label for the collection (e.g.,
my-papers, cs-reading-list).
- Include images? ā Whether to extract and save images embedded in the PDFs alongside the markdown. Images are useful for papers with figures/diagrams but increase storage. Default: no.
Suggest a directory layout where the PDFs and markdown live as siblings under a common parent, with the index file alongside them. For example, if the PDFs are at /data/papers/pdfs/:
/data/papers/ # DATASET_ROOT ā parent of everything
āāā pdfs/ # PDF_DIR (user already has this)
āāā markdown/ # MARKDOWN_DIR (suggested: sibling of pdfs/)
ā āāā paper1.md # without --images: flat .md files
ā āāā paper2.md
ā āāā paper3/ # with --images: per-PDF subdirectories
ā ā āāā paper3.md
ā ā āāā img-0.jpeg
ā āāā ...
āāā index.yaml # INDEX_PATH (auto-created here)
This layout matters because the index stores relative paths to the markdown files. Keeping everything under one root makes the index portable and git-friendly.
Key rules for the suggestion:
- The markdown directory should be adjacent to the PDF directory, not inside
.asta.
- The
DATASET_ROOT is the parent directory that contains both PDF_DIR and MARKDOWN_DIR.
- The index lives at
DATASET_ROOT/index.yaml.
- If the user's PDFs are at
/home/user/research/pdfs, suggest /home/user/research/markdown and /home/user/research/index.yaml.
Once the user confirms (or provides their own paths), set these variables:
PDF_DIR="/data/papers/pdfs"
MARKDOWN_DIR="/data/papers/markdown"
DATASET_ROOT="/data/papers"
INDEX_PATH="$DATASET_ROOT/index.yaml"
COLLECTION="my-papers"
IMAGES=false
Step 1: Discover PDFs and show estimates
PDF_COUNT=$(find "$PDF_DIR" -name "*.pdf" -type f | wc -l)
TOTAL_SIZE=$(find "$PDF_DIR" -name "*.pdf" -type f -exec du -ch {} + | tail -1 | awk '{print $1}')
echo "Found $PDF_COUNT PDFs ($TOTAL_SIZE total)"
Present this estimate to the user before proceeding:
| Metric | Estimate |
|---|
| PDFs found | N files |
| Total size on disk | X MB |
| Extraction time | ~2-5 min per 10-page PDF (remote API); faster with olmocr for batches >20 |
| Chunking + indexing | ~1-2 seconds per PDF |
| Index storage | ~2-3x the extracted text size (markdown files + YAML with chunk text) |
| Cache warm-up | 5-30 seconds (one-time, after indexing) |
| Total estimated time | Dominated by extraction: roughly N_papers x 3 min |
Ask the user to confirm before starting, especially for large collections (>20 PDFs).
Step 2: Extract PDFs to markdown
bash /path/to/assets/extract-pdfs.sh "$PDF_DIR" "$MARKDOWN_DIR"
bash /path/to/assets/extract-pdfs.sh --images "$PDF_DIR" "$MARKDOWN_DIR"
Pass --images only if the user opted in during Step 0. When --images is used, each PDF gets its own subdirectory under MARKDOWN_DIR to avoid image filename collisions across PDFs. The chunking script in Step 3 handles both layouts automatically.
The script:
- Skips PDFs whose markdown already exists (resumable)
- Handles large PDFs (>50 pages) by extracting in 50-page increments
- Reports progress and counts
For large batches (>20 PDFs), asta pdf-extraction olmocr with --workers is significantly faster. See the pdf-extraction skill for details.
Step 3: Chunk and build index
uv run --with pyyaml python3 /path/to/assets/chunk-and-index.py "$COLLECTION" "$MARKDOWN_DIR" --index-path "$INDEX_PATH"
The --index-path argument is required. The script:
- Computes paths relative to the index file's directory, storing relative paths in the
url field ā making the index portable across machines
- Reads each markdown file, splits into ~2000-char chunks at paragraph/sentence boundaries
- Writes all documents to the index YAML in a single pass
- Preserves any existing documents in the index (appends, does not overwrite)
- Skips PDFs already indexed for this collection (safe to re-run)
- Each document gets:
- Shared PDF metadata:
source_pdf, collection (in extra)
- Per-chunk metadata:
chunk_index, total_chunks, chunk_chars, chunk_offset, file_chars (in extra)
- Tags:
<collection-name>, pdf-index
Options:
--chunk-size 2000 ā adjust chunk size (default 2000 chars)
Step 4: Warm the search cache
bash /path/to/assets/warm-cache.sh "$DATASET_ROOT"
The argument is required:
$DATASET_ROOT ā the root directory containing index.yaml
This step is required. The first search after indexing builds the internal BM25 + embedding indexes. Without warming, the user's first real search will be unexpectedly slow.
Step 5: Report results
asta documents --root "$DATASET_ROOT" list --tags="$COLLECTION"
asta documents --root "$DATASET_ROOT" show
Tell the user:
- Number of PDFs processed and chunks created
- Dataset root: the
DATASET_ROOT path
- Index location: the
INDEX_PATH
- Collection tag for filtering: the chosen collection name
- How to search:
asta documents --root "$DATASET_ROOT" search --summary="query" --tags="COLLECTION"
Searching the Index
After building, search across all indexed PDFs:
asta documents --root "$DATASET_ROOT" search --summary="neural network architecture" --tags="my-papers"
asta documents --root "$DATASET_ROOT" search --summary="attention mechanism" --tags="my-papers" --show-scores
asta documents --root "$DATASET_ROOT" search --extra=".source_pdf contains some-paper"
asta documents --root "$DATASET_ROOT" list --tags="my-papers"
Storage Estimates
| Collection size | Approx. index size | Approx. markdown size |
|---|
| 10 PDFs (~10 pp each) | 2-5 MB | 1-3 MB |
| 50 PDFs (~10 pp each) | 10-25 MB | 5-15 MB |
| 100 PDFs (~10 pp each) | 20-50 MB | 10-30 MB |
Total storage is roughly 2-3x the extracted text (markdown files + index YAML with chunk text in the summary field).
Time Estimates
| Stage | Per PDF | Notes |
|---|
| Extraction (remote) | 2-5 min / 10 pages | API-bound; 50-page limit per call |
| Extraction (olmocr) | 10-20 sec / page, parallel | Better for >20 PDFs |
| Chunking + indexing | 1-2 seconds | Single YAML write, fast |
| Cache warming | 5-30 seconds total | One-time after indexing |
Important Notes
- Warm the cache. The first
asta documents search --summary=... builds the search index. Always run the warm-cache script after indexing.
- Chunk size tradeoff. 2000 chars balances search precision with context. Smaller chunks = more precise hits, less context. Larger chunks = more context, diluted relevance.
- Resumable. Both extraction and indexing skip already-processed files. Safe to re-run after interruption.
- Index is append-only. The chunking script preserves existing documents in the index. To rebuild from scratch, delete
index.yaml first.
- PyYAML required. The chunking script needs
pyyaml. Install with pip install pyyaml or uv pip install pyyaml if not available.
- Relative paths. The index stores relative paths (e.g.,
markdown/paper.md) so the dataset is portable. This requires the markdown directory to be under the same directory as the index file.