name	Comparative Estimation Study
description	Compare AI-generated estimations with actual company PDFs to identify price differences and new items. Use when users need to validate estimations against real company quotes.

Comparative Estimation Study

Compare AI-generated baseline estimations with actual company PDFs to identify pricing differences and detect new items.

When to Use This Skill

Use this skill when the user asks about:

Comparing AI estimations with company quotes
Validating baseline estimations against actual PDFs
Finding price differences between baseline and company data
Identifying new items in company estimates
Reconciling estimation data with supplier quotes

Function Signature

def process_comparative_study(file_urls: List[str], output_path: str) -> str:
    """
    Compare baseline JSON estimation with company PDF quotes

    Matching Rules:
    - Two-layer designation matching: string comparison → semantic similarity (embeddings)
    - Pre-filters: number + unit + quantity must match
    - is_new = False: For all items that exist in baseline JSON
    - is_new = True: ONLY for items in PDF with 'number' not in baseline
    - Prices updated if different or if baseline has null values
    - Section totals automatically recalculated

    Args:
        file_urls: List of URLs (first must be JSON baseline, rest are PDFs)
        output_path: Path to save output JSON

    Returns:
        str: Path to output JSON file with is_new flags (includes UUID in filename)

    Raises:
        ValueError: If no JSON or no PDFs found in URLs
        Exception: For download or processing errors
    """

CLI Usage

# Compare baseline JSON with company PDFs
python comparative_study.py \
  --file-urls https://storage.com/baseline.json https://storage.com/quote1.pdf https://storage.com/quote2.pdf \
  --output-path temp_files/temp_project_123/comparison_result.json

# Note: The script automatically appends a UUID to the output filename
# Example output: comparison_result_a1b2c3d4-5678-90ab-cdef-1234567890ab.json

⏱️ Important: When using this skill programmatically (not CLI), ensure timeout is set to at least 3 minutes to allow sufficient time for PDF extraction and semantic matching operations.

Input Format

File URLs (Required Order)

First URL: Baseline JSON (AI estimation output)
Remaining URLs: Company PDF quotes (1 or more)

Baseline JSON Structure

{
  "projectId": "123",
  "projectName": "Construction Project",
  "pricingLines": [
    {
      "type": "line",
      "number": "1.1",
      "designation": "Installation de chantier",
      "unit": "F",
      "quantity": "1",
      "unitPrice": 24000,
      "totalPrice": 24000
    }
  ]
}

PDF Content

Company estimate documents with pricing tables containing:

Designation (item description)
Unit (measurement unit)
Quantity
Unit price
Total price

Matching Logic

Two-Layer Designation Matching

Pre-filters (must pass first):

Number: Must match or either is null (e.g., "1.1" = "1.1")
Unit: Must match or either is null (e.g., "F" = "F")
Quantity: Must match or either is null (e.g., "1" = "1")

Layer 1 - String Matching (Fast):

Designation matches via string comparison (exact or substring)
Normalized text (lowercase, no accents, no special chars)

Layer 2 - Semantic Matching (Fallback):

If string match fails, use Voyage AI embeddings
Computes cosine similarity between designations
Threshold: 0.8 (80% semantic similarity)
Example: "Installation de chantier" ≈ "Installation chantier principal"

Price Update Logic

Baseline has null, PDF has values: Always update with PDF values
Both have values and differ: Update with PDF values
Both have same values: Keep baseline values unchanged

is_new Flag Rules

is_new = False: All items that exist in baseline JSON (by number)
is_new = True: ONLY items in PDF whose 'number' doesn't exist in baseline
Applies to all types: line, section, titre

Special Features

Null value handling: Null/empty values treated as wildcards in matching
Section totals: Automatically recalculated for type="titre" items
Type flexibility: Matches across different types (section → line)

Output Format

Updated JSON with is_new field added to each pricing line:

{
  "projectId": "123",
  "pricingLines": [
    {
      "type": "line",
      "number": "1.1",
      "designation": "Installation de chantier",
      "unit": "F",
      "quantity": "1",
      "unitPrice": 24000,
      "totalPrice": 24000,
      "is_new": false
    },
    {
      "type": "line",
      "number": "1.2",
      "designation": "Signalisation de chantier",
      "unit": "F",
      "quantity": "1",
      "unitPrice": 5000,
      "totalPrice": 5000,
      "is_new": true
    }
  ]
}

Processing Pipeline

Download Files - Fetches JSON and PDFs from URLs
Load Baseline - Parses baseline JSON estimation
Extract PDFs - Uses LlamaExtract to parse PDF pricing tables
Merge PDF Data - Combines pricing lines from all PDFs
Build Baseline Numbers - Collects all 'number' fields from baseline (all types)
Match & Update - Applies two-layer matching (string → embeddings) and updates prices
Add New Items - Adds PDF items whose numbers don't exist in baseline
Recalculate Totals - Updates section/titre totalPrice based on child lines
Save Output - Writes updated JSON with is_new flags (UUID automatically appended to filename)

Error Handling

No JSON URL ValueError
No PDF URLs ValueError
Download failure Exception with URL details
Invalid file format Logs error and skips
PDF extraction failure Exception

Environment Variables

# Required for PDF extraction
LLAMA_CLOUD_API_KEY=llx-...   # For LlamaExtract PDF parsing

# Required for semantic similarity matching
VOYAGE_API_KEY=pa-...          # For Voyage AI embeddings

Examples

Programmatic Usage

from comparative_study import process_comparative_study

file_urls = [
    "https://storage.com/baseline.json",
    "https://storage.com/company_quote1.pdf",
    "https://storage.com/company_quote2.pdf"
]
output_path = "temp_files/temp_project_456/comparison.json"

result = process_comparative_study(file_urls, output_path)
# Returns: "temp_files/temp_project_456/comparison_<uuid>.json"
# Example: "temp_files/temp_project_456/comparison_a1b2c3d4-5678-90ab-cdef-1234567890ab.json"

Notes

URL Order: JSON must be first, followed by PDFs
Multiple PDFs: All PDF pricing lines are merged before matching
Text Normalization: Removes accents, special chars, common French stop words
Price Tolerance: 1% tolerance for floating-point precision
Temporary Files: Auto-cleaned after processing
Designation Priority: Uses PDF designation when is_new = True

name	Comparative Estimation Study
description	Compare AI-generated estimations with actual company PDFs to identify price differences and new items. Use when users need to validate estimations against real company quotes.