| name | blockify-integration |
| description | Process documents with Blockify API to create optimized IdeaBlocks for RAG. Search external ChromaDB knowledge bases with 100k+ blocks. Use when processing documentation, creating knowledge bases, improving AI context retrieval, or when user mentions Blockify, IdeaBlocks, or knowledge distillation. |
Blockify Integration Skill
Why This Exists
Problem: Traditional RAG systems chunk documents by character/token count, losing semantic coherence. A 500-token chunk may split a concept mid-sentence, contain unrelated paragraphs, or bury key facts in noise.
Solution: Blockify is a patented distillation platform that transforms raw text into IdeaBlocks—self-contained semantic knowledge units optimized for AI retrieval.
| Metric | Improvement |
|---|
| Enterprise Performance | 78X |
| Vector Search Accuracy | 2.29X |
| Dataset Size Reduction | 40X (to ~2.5%) |
| Token Efficiency | 3.09X |
End-to-End Process Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ BLOCKIFY PIPELINE OVERVIEW │
└─────────────────────────────────────────────────────────────────────────────┘
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Source │ │ Blockify │ │ ChromaDB │ │ Search │
│Documents │────▶│ API │────▶│ Vector │────▶│ Query │
│ .md .txt │ │ (ingest) │ │ Store │ │ Results │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │IdeaBlocks│ │ OpenAI │ │
│ │ XML │ │Embeddings│ │
│ └──────────┘ │ 1536-d │ │
│ └──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ DISTILLATION │ │
│ │ (deduplicate) │ │
│ │ │ │
│ │ raw_ideablocks │ │
│ │ ▼ │ │
│ │ distilled_ │ │
│ │ ideablocks │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────┘
Complete Setup (Step-by-Step)
Prerequisites
Step 1: Create Environment File
cd /path/to/blockify-skill-for-claude-code
cat > .env << 'EOF'
BLOCKIFY_API_KEY=blk_your_key_here
OPENAI_API_KEY=sk-your_key_here
EOF
Step 2: Load Environment Variables
IMPORTANT: You must load these before running any script:
export $(cat .env | grep -v '^#' | grep -v '^$' | xargs)
Or add to your shell profile (~/.zshrc or ~/.bashrc):
export BLOCKIFY_API_KEY="blk_your_key_here"
export OPENAI_API_KEY="sk-your_key_here"
Step 3: Install Dependencies
cd skills/blockify-integration
python3 scripts/setup_check.py --install
Expected output:
[OK] All packages installed
[OK] API keys configured
[--] ChromaDB not initialized (will create on first ingest)
Step 4: Ingest Documents
python3 scripts/ingest_to_chromadb.py /path/to/document.md
python3 scripts/ingest_to_chromadb.py /path/to/documents/ --batch
What happens:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Read File │───▶│ Chunk │───▶│ Blockify │───▶│ Parse │
│ │ │ (2000 chr) │ │ API │ │ XML │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ Store │◀───│ Dedupe │◀───│ Generate │◀─────────┘
│ ChromaDB │ │ (by ID) │ │ Embeddings │
└─────────────┘ └─────────────┘ └─────────────┘
Step 5: Distill (Deduplicate)
Option A: Docker-based (full service)
cd /path/to/blockify-distillation-service
cp .env.example .env
docker-compose up -d
python3 scripts/run_distillation.py
Option B: Direct API (no Docker required)
python3 scripts/distill_chromadb.py
What happens:
┌─────────────────────────────────────────────────────────────────┐
│ DISTILLATION PROCESS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Pass 1: Within-Document Clustering │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Doc A │ │ Doc B │ │ Doc C │ │
│ │ ┌─┐┌─┐ │ │ ┌─┐┌─┐ │ │ ┌─┐┌─┐ │ (cluster similar │
│ │ └─┘└─┘ │ │ └─┘└─┘ │ │ └─┘└─┘ │ blocks per doc) │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Pass 2: Cross-Document Clustering │
│ ┌──────────────────────────────────┐ │
│ │ Compare representatives across │ (find duplicates │
│ │ all documents for global dedup │ across documents) │
│ └──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Pass 3: Merge via Blockify Distill API │
│ ┌─────────┐ ┌─────────┐ │
│ │ Cluster │───▶│ Merged │ (LLM combines similar blocks) │
│ │ 5 blocks│ │ 1 block │ │
│ └─────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Step 6: Search
python3 scripts/search_chromadb.py "your query" --collection distilled
python3 scripts/search_chromadb.py "your query" --collection raw
python3 scripts/search_chromadb.py "your query" --entity PRODUCT
python3 scripts/search_chromadb.py "your query" --json
Data Flow Diagram
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
SOURCE FILES PROCESSING STORAGE
──────────── ────────── ───────
document1.md ─┐
document2.md ─┼──▶ ingest_to_chromadb.py ──▶ raw_ideablocks (ChromaDB)
document3.md ─┤ │ │
... ─┘ │ │
│ ▼
│ distill_chromadb.py
│ │
▼ ▼
Blockify API distilled_ideablocks
(ingest model) │
│ │
▼ ▼
OpenAI Embeddings ◀──────── search_chromadb.py
(text-embedding- (semantic search)
3-small, 1536d)
COLLECTIONS:
┌────────────────────────────────────────────────────────────────────────────┐
│ raw_ideablocks │ Pre-distillation blocks, may have duplicates │
├────────────────────────────────────────────────────────────────────────────┤
│ distilled_ideablocks │ Production-ready, deduplicated (USE THIS) │
└────────────────────────────────────────────────────────────────────────────┘
Core Concept: IdeaBlocks
An IdeaBlock is a complete, self-contained unit of knowledge that answers exactly one question:
<ideablock>
<name>Title describing this knowledge unit</name>
<critical_question>What specific question does this answer?</critical_question>
<trusted_answer>The validated answer (2-3 sentences, complete).</trusted_answer>
<tags>IMPORTANT, TECHNOLOGY, CATEGORY</tags>
<entity>
<entity_name>PRODUCT_NAME</entity_name>
<entity_type>PRODUCT</entity_type>
</entity>
<keywords>keyword1, keyword2, keyword3</keywords>
</ideablock>
Entity types: PRODUCT, ORGANIZATION, PERSON, TECHNOLOGY, CONCEPT, LOCATION, EVENT
Model Selection
Is the content ordered/sequential (manual, procedure)?
├─ YES → Use `technical-ingest` (preserves order context)
└─ NO → Is this raw source material?
├─ YES → Use `ingest` (creates new IdeaBlocks)
└─ NO → Are these existing IdeaBlocks with duplicates?
└─ YES → Use `distill` (merges similar blocks)
| Model | Input | Output | Use Case |
|---|
ingest | Raw text | New IdeaBlocks | First-time processing |
distill | IdeaBlocks XML | Merged IdeaBlocks | Deduplication |
technical-ingest | Ordered text + context | Sequenced IdeaBlocks | Manuals, procedures |
Script Reference
Scripts Overview
scripts/
├── setup_check.py # Verify environment, install deps
├── ingest_to_chromadb.py # Documents → IdeaBlocks → ChromaDB (parallel)
├── search_chromadb.py # Semantic search with OpenAI embeddings
├── distill_chromadb.py # Deduplication (NO Docker required)
├── run_distillation.py # Deduplication (requires Docker service)
├── run_full_pipeline.py # End-to-end: ingest + distill + benchmark (parallel)
├── run_benchmark.py # Compare IdeaBlocks vs chunking, generate HTML report
├── blockify_ingest.py # Documents → JSON (no ChromaDB)
├── blockify_distill.py # JSON → distilled JSON
└── blockify_search.py # Search JSON files
Note: Ingestion scripts use 5 parallel workers by default. Configure via --parallel N flag or BLOCKIFY_PARALLEL_WORKERS environment variable.
Detailed Script Usage
setup_check.py
python3 scripts/setup_check.py
python3 scripts/setup_check.py --install
ingest_to_chromadb.py
python3 scripts/ingest_to_chromadb.py input.txt
python3 scripts/ingest_to_chromadb.py docs/ --batch
python3 scripts/ingest_to_chromadb.py docs/ --batch -p 10
python3 scripts/ingest_to_chromadb.py docs/ --batch -s
python3 scripts/ingest_to_chromadb.py input.txt -c distilled
search_chromadb.py
python3 scripts/search_chromadb.py "query"
python3 scripts/search_chromadb.py "query" -c distilled
python3 scripts/search_chromadb.py "query" -e PRODUCT
python3 scripts/search_chromadb.py "query" -n 20
python3 scripts/search_chromadb.py "query" --json
distill_chromadb.py (NO Docker)
python3 scripts/distill_chromadb.py
python3 scripts/distill_chromadb.py --threshold 0.8
python3 scripts/distill_chromadb.py --dry-run
Troubleshooting
Common Errors and Solutions
┌────────────────────────────────────────────────────────────────────────────┐
│ ERROR │ CAUSE │ SOLUTION │
├────────────────────────────────────────────────────────────────────────────┤
│ DuplicateIDError │ Same IdeaBlock │ Script handles this │
│ "found duplicates of: ib_..." │ extracted twice │ automatically now │
├────────────────────────────────────────────────────────────────────────────┤
│ InvalidArgumentError │ Embedding model │ Use search_chromadb │
│ "dimension 1536, got 384" │ mismatch │ (fixed in script) │
├────────────────────────────────────────────────────────────────────────────┤
│ BLOCKIFY_API_KEY not set │ Missing env var │ export $(cat .env │
│ │ │ | grep -v '^#' | │
│ │ │ grep -v '^$' | xargs)│
├────────────────────────────────────────────────────────────────────────────┤
│ 429 Rate Limit │ Too many requests │ Script retries with │
│ │ │ exponential backoff │
├────────────────────────────────────────────────────────────────────────────┤
│ Empty output from API │ max_tokens too low │ Use 8000+ tokens │
│ │ │ (default in scripts) │
├────────────────────────────────────────────────────────────────────────────┤
│ ChromaDB not found │ Not initialized │ Run ingest first │
├────────────────────────────────────────────────────────────────────────────┤
│ Distillation service not │ Docker not running │ Use distill_chromadb │
│ available │ OR no Docker │ .py (no Docker) │
└────────────────────────────────────────────────────────────────────────────┘
Important Technical Notes
-
Embedding Model Consistency
- Ingestion uses:
text-embedding-3-small (OpenAI, 1536 dimensions)
- Search MUST use the same model
- The
search_chromadb.py script handles this automatically
-
Duplicate Handling
- IdeaBlock IDs are SHA256 hashes of
name + question + answer
- Identical content = identical ID (by design)
ingest_to_chromadb.py deduplicates within each batch automatically
-
Chunking Strategy
- 2000 characters per chunk
- 200 character overlap at sentence boundaries
- Optimal for Blockify API processing
Configuration
Environment Variables
| Variable | Required | Default | Description |
|---|
BLOCKIFY_API_KEY | Yes | - | API key from console.blockify.ai |
OPENAI_API_KEY | Yes | - | API key from platform.openai.com |
IDEABLOCK_DATA_DIR | No | ./data/ideablocks | Data storage directory |
DISTILL_SERVICE_URL | No | http://localhost:8315 | Distillation service URL |
BLOCKIFY_PARALLEL_WORKERS | No | 5 | Default parallel workers for ingestion |
API Settings (Do Not Change)
| Parameter | Value | Reason |
|---|
| max_tokens | 8000 | Minimum for complete blocks |
| temperature | 0.5 | Calibrated for consistency |
| chunk_size | 2000 chars | Optimal input chunking |
Search Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ SEARCH FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────┐
User Query ────▶ │ OpenAI Embedding│ ────▶ Query Vector (1536-d)
│ text-embedding- │
│ 3-small │
└─────────────────┘
│
▼
┌─────────────────┐
│ ChromaDB Query │
│ (cosine sim) │
└─────────────────┘
│
▼
┌─────────────────┐
│ Top-K Results │
│ (no reranker) │
└─────────────────┘
CURRENT LIMITATIONS:
- Single-stage retrieval only (no reranking)
- No hybrid search (vector only, no BM25)
- No query expansion
POTENTIAL IMPROVEMENTS:
- Add cross-encoder reranker for top-100 → top-10
- Implement hybrid search with BM25
- Add query expansion via LLM
Quick Reference Commands
export $(cat /path/to/.env | grep -v '^#' | grep -v '^$' | xargs)
python3 scripts/setup_check.py
python3 scripts/setup_check.py --install
python3 scripts/ingest_to_chromadb.py document.md
python3 scripts/ingest_to_chromadb.py /path/to/docs/ --batch
python3 scripts/ingest_to_chromadb.py /path/to/docs/ --batch --parallel 10
python3 scripts/ingest_to_chromadb.py /path/to/docs/ --batch --sequential
python3 scripts/distill_chromadb.py
python3 scripts/run_distillation.py
python3 scripts/search_chromadb.py "your query"
python3 scripts/search_chromadb.py "your query" --collection distilled
python3 scripts/search_chromadb.py "your query" --entity PRODUCT
python3 scripts/search_chromadb.py "your query" --json --limit 5
python3 scripts/run_benchmark.py
python3 scripts/run_benchmark.py --company "My Company"
python3 scripts/run_benchmark.py --config ./config/benchmark_config.yaml
python3 scripts/run_benchmark.py --init-config
open data/reports/benchmark_report_*.html
Required Execution Workflow (For Autonomous/Skill Use)
IMPORTANT: When running Blockify as a skill or autonomous task, you MUST complete ALL steps below in order. Do not skip any step.
┌─────────────────────────────────────────────────────────────────────────────┐
│ REQUIRED EXECUTION STEPS (IN ORDER) │
└─────────────────────────────────────────────────────────────────────────────┘
Step 1: Environment Setup
└─► Verify API keys configured
└─► Run setup_check.py to confirm dependencies
Step 2: Document Ingestion
└─► Run ingest_to_chromadb.py with --batch for directories
└─► Record: file count, block count, any errors
Step 3: Distillation (Deduplication)
└─► Run distill_chromadb.py (no Docker required)
└─► Record: clusters found, blocks merged, reduction %
Step 4: Search Verification
└─► Run at least 3 different test queries
└─► Verify results are relevant (scores > 0.5)
└─► Test both text and JSON output formats
Step 5: Benchmark (REQUIRED - DO NOT SKIP) ◄── MANDATORY
└─► Run: python3 scripts/run_benchmark.py --company "Company Name"
└─► Record all metrics from output:
- Vector Search Accuracy (X improvement)
- Information Distillation (X reduction)
- Aggregate Performance (X)
- Enterprise Performance (X)
- Token Efficiency (X)
- Projected Annual Savings ($X)
└─► Note the report file path for reference
Step 6: Documentation/Changelog
└─► Create or update CHANGELOG.md in target directory
└─► Include ALL metrics from Steps 2-5
└─► Document any errors or issues encountered
└─► Note any confusing steps for documentation improvement
Why Benchmark is Required
The benchmark compares IdeaBlocks performance against traditional chunking methods. Without running the benchmark:
- You cannot quantify the improvement from using Blockify
- You have no baseline for comparison
- The value proposition cannot be demonstrated
Benchmark Output Metrics Explained
| Metric | What It Measures | Good Value |
|---|
| Vector Search Accuracy | How much closer IdeaBlocks are to query intent vs chunks | > 2.0X |
| Information Distillation | Word count reduction while preserving meaning | > 1.2X |
| Aggregate Performance | Combined accuracy × distillation improvement | > 3.0X |
| Enterprise Performance | Aggregate × scale factor for enterprise workloads | > 40X |
| Token Efficiency | LLM token savings from using IdeaBlocks | > 3.0X |
Example Session (Complete Workflow)
cd /path/to/blockify-skill-for-claude-code/skills/blockify-integration
cat > ../../.env << 'EOF'
BLOCKIFY_API_KEY=blk_your_key_here
OPENAI_API_KEY=sk-your_key_here
BLOCKIFY_PARALLEL_WORKERS=5
EOF
export $(cat ../../.env | grep -v '^#' | grep -v '^$' | xargs)
python3 scripts/setup_check.py --install
python3 scripts/ingest_to_chromadb.py /path/to/documents/ --batch
python3 scripts/ingest_to_chromadb.py /path/to/documents/ --batch --parallel 10
python3 scripts/distill_chromadb.py
python3 scripts/search_chromadb.py "what are the key features?" --collection distilled
python3 scripts/search_chromadb.py "product benefits" --collection distilled
python3 scripts/search_chromadb.py "technical specifications" --collection distilled --json
python3 scripts/run_benchmark.py --company "Your Company Name"
open data/reports/benchmark_report_*.html
python3 scripts/search_chromadb.py "important concepts" --json --limit 20 > results.json
Scale Considerations
| Dataset Size | Recommended Approach | Storage | Search Time |
|---|
| < 1,000 blocks | JSON files | ~10 MB | Instant |
| 1K - 10K blocks | ChromaDB, no distill | ~50 MB | < 100ms |
| 10K - 100K blocks | ChromaDB + distill | ~500 MB | < 100ms |
| 100K+ blocks | ChromaDB + distill + FAISS | ~2 GB | < 50ms |
Distillation time estimates (2,000+ blocks):
- Pass 1 (within-document): ~30 seconds
- Pass 2 (cross-document): ~10-15 minutes
- Pass 3 (API merges): ~1-2 seconds per cluster
References