Run any Skill in Manus with one click

$pwd:

online-directory-create-expert

Name: Online Directory Create Expert
Author: Sikefury

// Build a defensible online directory in any niche using a 7-step, data-first pipeline optimized for agent speed and context health.

Run Skill in Manus

$ git log --oneline --stat

stars:1

forks:0

updated:May 6, 2026 at 09:00

SKILL.md

readonly

package.json

"author": "Sikefury"

"repository": "Sikefury/online-directory-create-expert"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	online-directory-create-expert
description	Build a defensible online directory in any niche using a 7-step, data-first pipeline optimized for agent speed and context health.

online-directory-create-expert

Outcome

Turn raw listings into a verified, enriched, and deployable directory dataset using a repeatable pipeline that is fast, scalable, and safe for agent context limits.

Context Health Rules (Non-Negotiable)

Never paste full CSVs or full crawl dumps into the model. Use samples and aggregates.
Always work in passes: sample first, scale later.
Keep a stable schema and only add columns, never rename mid-pipeline.
Store artifacts on disk; feed only the delta to the model.
Use explicit batch sizes and record which batch was processed.
Prefer deterministic scripts (pandas, regex) for cleaning; reserve LLMs for classification and extraction.

Input Packet Standard (What to Send the Model)

Use this structure for every step to keep context consistent.

Sample rows: 20-50 rows max, chosen for ambiguity or edge cases.
Aggregate stats: counts, top categories, missing-field rates, duplicate rate.
Errors from last batch: top 5 failure modes with examples.
Required output schema: exact fields and allowed values.

When to Use This Skill

Use this skill when you need a directory that wins on data quality, not just listings volume.

You have raw listings (maps providers, vendor exports, industry lists, registries).
You need to verify niche relevance at scale.
You want structured enrichment (inventory, images, amenities, service areas).

Inputs You Need

Required:

Niche definition and decision drivers (price, features, availability, service radius).
Keyword lists for verification (positive and negative). Data sources (choose one or more):
Raw CSVs of listings (multiple files OK).
Registry or licensing exports.
Marketplace or association lists. If the user does not provide data, you must collect it using available sources before proceeding.

Optional:

Any crawling/extraction tool that can fetch and parse websites at scale.
Image scraping pipeline or existing image datasets.

Minimal Input Mode (User-Friendly)

Minimum user input can be just a country and/or region. If country/region is not provided, infer from available context or choose a reasonable default and state the assumption.

If other inputs are missing, do not block. Generate them yourself using deep research:

Propose a short list of niches suitable for the region.
Select one niche and generate the Niche Brief.
Identify likely data sources (registries, associations, marketplaces, maps).
Generate keywords, negatives, and trust signals.

Optional Tooling: Crawl4AI (Suggested, Not Required)

If you want a robust crawler, Crawl4AI is a good default. Use this only if it fits your stack.

Install:

pip install crawl4ai

Smoke test (minimal):

Run a single known URL and confirm it returns main content and links.
Verify rate limiting and retry behavior on a small batch.

Tuning checklist:

Start with low concurrency (1-3) and increase gradually.
Set timeouts and backoff to avoid bans.
Keep crawl depth at 1-2 until verification accuracy is stable.

Proceed to data gathering only after the smoke test passes.

Niche Discovery (Pre-step -1)

If you are starting without a niche, do this before Niche Setup.

Generate 10-20 candidate niches.
Score each by data availability, decision drivers, data moat potential, and monetization.
Pick the top 1-3 for quick validation (Quick mode).

Output: a one-page niche brief (definition, decision drivers, trust signals, data sources, keywords).

Deep research requirement:

Use first-principles research to justify niche choice and data sources.
Prefer authoritative sources (official registries, government data, industry associations).

Niche Setup (Agnostic, Minimum Viable)

Define these before any scraping or verification. Keep them short and testable.

Niche definition: one sentence describing what qualifies and what does not.
Decision drivers: 3-5 factors users care about (price, quality, speed, coverage, features).
Positive keywords: 5-15 phrases that must appear in a provider description (website or registry text).
Negative keywords: 5-15 phrases that indicate non-matches.
Trust signals: reviews, certifications, licenses, registry presence, or verified contact details.

Trust signals operational order:

Official registries or licensing proof.
Verified website with clear service pages.
Verified phone or email tied to the business name.
Reviews or third-party marketplace profiles.

Success check: run verification on a 50-200 row sample and reach >80 percent precision on matches.

Placeholder Fill (Auto)

Do not edit this skill manually. Before running any prompt, fill placeholders from the Niche Brief:

{niche}: use the niche definition.
{pos_1..pos_3}: top 3 positive keywords from the brief.
{neg_1..neg_2}: top 2 negative keywords from the brief. If the brief is missing, generate these from the candidate data and trust signals, then proceed.

Execution Modes (Choose One)

Quick (fast validation):

1 sample CSV or registry extract, 50-200 rows.
Run Steps 2-3 only.
Goal: confirm niche viability and verification accuracy. If no dataset is provided, collect a small sample first, then proceed.

Standard (default build):

Full dataset, all 7 steps.
Enrich only top verified candidates if budget-limited.

Deep (best quality):

All 7 steps plus manual review and confidence scoring.
Add a second verification pass and spot-checks.

Quick mode exit criteria:

At least 50 verified candidates (or 10 percent of sample, whichever is smaller).
Precision on a 20-row audit > 80 percent (or 10 percent sample if smaller).
Cleaned schema matches the contract below.

Core Workflow (7 Steps)

Before starting, pick a niche and list decision drivers users care about.

Collect raw data from sources (scraping, registry exports, marketplace lists).
Clean and consolidate CSVs, removing junk and irrelevant businesses.
Verify sources at scale using a crawler or registry plus LLM prompts.
Enrich inventory data (what the business offers, counts, types).
Enrich images with strict filtering rules and edge-case handling.
Extract amenities and features to power filters and comparison.
Extract service areas with location-aware accuracy rules.

Pre-step 0: Deduplicate and normalize identifiers.

Create a stable business_id (hash of name + address + phone).
Merge duplicates across sources and keep a provenance column.
Normalize name, address, and phone before hashing (lowercase, strip punctuation, standardize suite/unit).
Tie-break rule: keep the record with the most complete fields; merge the rest into provenance.
If address/phone are missing, use registry_id or name + region + country as the hash inputs.

Prompt Pack (Cleaned Templates)

Replace placeholders using the Niche Brief.

Step 2: Clean and Consolidate CSVs

You are cleaning and consolidating multiple CSVs of {niche} listings into one clean dataset.

Task:
- Merge all CSVs into a single consolidated CSV.
- Remove any row that matches ANY removal criteria below.
- Normalize columns and values (trim whitespace, consistent casing, standardize phone and URLs).
- Preserve registry_id and parent_id if present.

Removal criteria (remove if ANY applies):
- Missing business name or missing street-level address where applicable (administrative area alone is not enough).
- Permanently closed or temporarily closed.
- Not actually a {niche} business based on name or category.
- Clearly unrelated website with no supporting registry/phone evidence.
- Trust signals exist and indicate low quality or non-operational status.

Output:
- cleaned.csv (stable schema)
- removed_rows.csv (include reason)

Schema:

cleaned.csv: business_id, name, address, city, region, postal_code, country, phone, website, category, source, parent_id (optional), registry_id (optional)
removed_rows.csv: business_id, name, reason, source

Step 3: Tiered Verification (Crawler + LLM)

Goal: verify that each candidate is a real {niche} provider.

Tier 1: fast keyword filter (website sources only)
- Crawl homepage + top pages when a website exists.
- Positive keywords: {pos_1}, {pos_2}, {pos_3}
- Negative keywords: {neg_1}, {neg_2}
- If any reasonable chance it is a {niche} provider, keep it for Tier 2.

Tier 1 (offline/low-web alternative):
- Use registry or licensing lookup when websites are missing.
- Validate via phone or official contact channels where applicable.
- Record evidence_type as registry or phone.

Tier 2: LLM verification
For each candidate source, answer:
- Is this business actually a {niche} provider? (yes/no)
- What products/services indicate this?
- Extract a 1-2 sentence proof summary.
Apply trust signal order (registry > website > verified contact > reviews).

Output:
- verified_candidates.csv with columns: is_match, proof, confidence, signals_found, evidence_type, source_url, registry_id (optional)

Schema:

verified_candidates.csv: business_id, is_match (yes/no), confidence (0-1), signals_found, proof, evidence_type, source_url, registry_id (optional)

Routing:

confidence >= 0.7 -> accept
confidence 0.5-0.69 -> manual_review.csv
confidence < 0.5 -> reject

Trust signal override:

If registry evidence exists, it overrides conflicting website evidence.

Populate parent_id:

If a brand has multiple locations, set parent_id to the brand entity id.

Step 4: Offerings/Inventory Enrichment

Extract offerings/inventory data from verified sources.

For each business:
- Identify product types and counts.
- Capture any pricing or availability signals.
- Record evidence text and source page URL.

Output:
- inventory.csv with columns: business_id, inventory_items, inventory_counts, price_signals, evidence, source_url

Schema:

inventory.csv: business_id, inventory_items, inventory_counts, price_signals, evidence, source_url Note:
Use inventory.csv even for service-based niches; treat inventory_items as offerings.

Step 5: Image Enrichment

Collect only relevant product images.

Rules:
- Use images from pages already scraped.
- Or use provided datasets when pages do not exist.
- Exclude logos, favicons, and unrelated imagery.
- Only include images that clearly show the product type and relevant features.
- Quality check each image before inclusion.
If images are not relevant for the niche, skip this step and note it in logs.

Output:
- images.csv with columns: business_id, image_url, product_type, confidence, source_url

Schema:

images.csv: business_id, image_url, product_type, confidence (0-1), source_url

Step 6: Amenities and Features

Extract amenities and features into structured columns.

Rules:
- One column per feature family (lighting, climate control, ADA, generators, etc.).
- Separate "core" vs "specialty" features if the niche mixes products.
- Normalize inconsistent wording (e.g., "A/C" -> "air_conditioning").
If features are not meaningful for the niche, skip this step and note it in logs.

Output:
- features.csv with standardized boolean or categorical fields.

Schema:

features.csv: business_id, feature_family, feature_value, evidence, source_url

Step 7: Service Area Extraction

Extract service areas with location-aware accuracy.

Rules:
- Prefer explicit service area pages.
- If a business HQ is in a specific locality/region, do not assign service areas outside that area unless explicitly stated.
- For multi-location businesses, map service areas to each location separately.

Output:
- service_areas.csv with columns: business_id, service_area, scope, evidence, evidence_type, source_url

Schema:

service_areas.csv: business_id, service_area, scope (city/district/region/state/province/country), evidence, evidence_type, source_url

Confidence Rubric (Use for All LLM Decisions)

0.9-1.0: Explicit evidence on the site matching the niche, with product/service detail.
0.7-0.89: Strong signals but missing one key detail.
0.5-0.69: Ambiguous; keep only if niche requires broad inclusion.
< 0.5: Exclude or send to manual review.

Batch Protocol (Resumable)

Split work into numbered batches: batch_001.csv, batch_002.csv.
Each batch produces its own output file with the same suffix.
Maintain a pending.csv with unprocessed business_ids.
After each batch, append results to the master file and log stats. Status values:
pending, processing, done, error

Schema Contracts (Explicit)

cleaned.csv

business_id, name, address, city, region, postal_code, country, phone, website, category, source, parent_id (optional), registry_id (optional)

removed_rows.csv

business_id, name, reason, source

verified_candidates.csv

business_id, is_match, confidence, signals_found, proof, evidence_type, source_url, registry_id (optional)

inventory.csv

business_id, inventory_items, inventory_counts, price_signals, evidence, source_url

images.csv

business_id, image_url, product_type, confidence, source_url

features.csv

business_id, feature_family, feature_value, evidence, source_url

service_areas.csv

business_id, service_area, scope, evidence, evidence_type, source_url

manual_review.csv

business_id, reason, suggested_action, evidence, evidence_type, source_url, registry_id (optional)

pending.csv

business_id, batch_id, status

Field Types and Constraints (Geo-Agnostic)

business_id: stable hash string.
parent_id: optional brand/entity id for multi-location businesses.
registry_id: optional id from official registries/licensing sources.
name: string, required.
address: string, required if available; allow free-form for non-structured regions.
city/region/postal_code/country: optional; use region/country fields where applicable.
phone: string, preserve original formatting; optionally add normalized_phone.
website/source_url: valid URL or registry/phone reference string, or empty.
confidence: float 0-1.
scope: one of locality, city, district, region, state, province, country.
evidence_type: one of website, registry, phone, marketplace, manual

Compound Field Format

inventory_items: pipe-separated list (item1|item2|item3).
inventory_counts: JSON object string ({"item":"count"}).
signals_found: pipe-separated list of keyword hits.
feature_value: use normalized tokens (snake_case).

Website Quality Rules

If a website is parked, empty, or redirects to an unrelated domain, mark as no_working_site.
If only social profiles exist, keep only if the niche allows social-only presence.
Flag marketplace/aggregator pages and require explicit proof of ownership before accept.
If no website exists, keep only if the region supports phone-only or registry-only businesses. Heuristics:
Parked/empty: "domain for sale", "coming soon", no contact/about/service pages.
Aggregator: lists many unrelated businesses with no operator branding.
Redirects: if primary domain redirects to another niche, mark as unrelated.

Crawl Ethics and Politeness

Respect robots.txt where required.
Rate-limit crawling and retry with backoff.
Avoid scraping personal emails or sensitive personal data.

Offline/Low-Web Regions

If web presence is sparse, accept registry or phone validation as primary evidence.
Use evidence_type to record registry/phone/website sources.

Data Privacy Note

Follow local data protection laws; avoid storing personal data unless explicitly permitted.

Handoff Package (Website Creation Starts After This)

Deliver the following package before moving to website creation:

Niche brief and decision drivers
Cleaned dataset (cleaned.csv)
Verified dataset with evidence (verified_candidates.csv)
Enrichment files (inventory.csv, images.csv, features.csv, service_areas.csv) or explicit skip notes
Manual review queue (manual_review.csv)
Batch logs and changelog notes

Stop Conditions (Per Step)

Step 2: < 5 percent duplicates remain and removal reasons stabilize.
Step 3: Verification precision > 80 percent on audit sample.
Step 4-7: New information rate drops below 10 percent per batch.

Define "new information rate":

(new non-empty fields added in batch) / (total fields expected in batch)

Define "removal reasons stabilize":

Top 3 removal reasons change by less than 5 percent across 2 consecutive batches.

Multi-Location Handling

Split multi-location businesses into separate records per location before Step 7.
Do not apply service areas across locations without explicit evidence.
For franchise networks, keep a parent_id and child location rows.

Agent Performance Tuning

Use these constraints to keep runs fast and stable.

Batch size: 50-200 businesses per prompt for verification and enrichment.
Crawl depth: start at 1-2 pages, expand only when confidence < 0.7.
If not crawling websites, skip crawl depth and rely on registry/phone evidence.
Always write outputs to disk between passes; never keep full data in memory.
Use a stable business_id as the join key across all files.
Maintain a "pending.csv" queue for unprocessed rows to allow resumable runs.

Token Budget Guide (Context Health)

Per call target: <= 2,500 tokens input, <= 1,000 tokens output.
If larger context is needed, split into smaller batches and summarize results.

Batch summary template:

Summary stats (counts, missing rates, duplicates).
Top 10 errors and examples.
20 edge rows with expected outputs.

Minimal Schema (Recommended)

business_id (stable key)
parent_id (optional)
registry_id (optional)
name
address
city
region
postal_code
country
phone
website
category
verified_match (yes/no)
confidence

Quality Gates

Sample cleaned rows to verify removal criteria correctness.
Sample verified rows to check classification accuracy.
Manually review image sets for relevance and rights risk.
Review features columns for junk tokens or malformed entries.
Validate service areas against the listing's HQ locality/region.
Track false positives and rerun passes with edge-case fixes.

Verification audit protocol:

Sample 20 rows (or 10 percent if smaller): 10 accepted, 10 rejected.
Count false accepts and false rejects.
If > 20 percent errors, revise keywords or thresholds.

Common Failure Modes and Fixes

Too many false positives in verification: tighten negative keywords and add Tier 1 thresholds.
Missing inventory details: expand crawl depth only for low-confidence candidates.
Noisy features columns: enforce strict vocab mapping and drop rare tokens.
Service areas spanning outside declared locality/region: split multi-location businesses first.
Duplicate listings across sources: hash normalized name+address+phone and merge before Step 2.
Marketplace listings misclassified: require direct operator identifiers (about page, contact email domain).

Multilingual and Regional Variants

Maintain keyword lists in local languages and common transliterations.
If the region lacks street addresses, accept landmark-style addresses and use region/country fields.

Dataset Versioning

Version outputs per run: v1_cleaned.csv, v2_cleaned.csv, etc.
Never overwrite previous master files; append a changelog note.

Website Build Output (Optional)

If you are building the directory site:

Define required filters from features.csv and service_areas.csv.
Generate SEO pages by region and feature.
Ensure listings link to verified websites only.
Localize language, currency, and phone formatting to the target region.

Artifacts and File Layout (Suggested)

raw/ raw CSVs
cleaned/cleaned.csv
cleaned/removed_rows.csv
verify/verified_candidates.csv
verify/manual_review.csv
enrich/inventory.csv
enrich/images.csv
enrich/features.csv
enrich/service_areas.csv
logs/pending.csv
logs/ batch logs and error reports

Notes and Cautions

Image scraping carries rights risk; consider stock images or explicit permission.
SEO timelines are long; this is not a fast-revenue strategy.
Data quality drives trust; prioritize accuracy over speed.

name	online-directory-create-expert
description	Build a defensible online directory in any niche using a 7-step, data-first pipeline optimized for agent speed and context health.

online-directory-create-expert

Outcome

Turn raw listings into a verified, enriched, and deployable directory dataset using a repeatable pipeline that is fast, scalable, and safe for agent context limits.

Context Health Rules (Non-Negotiable)

Never paste full CSVs or full crawl dumps into the model. Use samples and aggregates.
Always work in passes: sample first, scale later.
Keep a stable schema and only add columns, never rename mid-pipeline.
Store artifacts on disk; feed only the delta to the model.
Use explicit batch sizes and record which batch was processed.
Prefer deterministic scripts (pandas, regex) for cleaning; reserve LLMs for classification and extraction.

Input Packet Standard (What to Send the Model)

Use this structure for every step to keep context consistent.

Sample rows: 20-50 rows max, chosen for ambiguity or edge cases.
Aggregate stats: counts, top categories, missing-field rates, duplicate rate.
Errors from last batch: top 5 failure modes with examples.
Required output schema: exact fields and allowed values.

When to Use This Skill

Use this skill when you need a directory that wins on data quality, not just listings volume.

You have raw listings (maps providers, vendor exports, industry lists, registries).
You need to verify niche relevance at scale.
You want structured enrichment (inventory, images, amenities, service areas).

Inputs You Need

Required:

Niche definition and decision drivers (price, features, availability, service radius).
Keyword lists for verification (positive and negative). Data sources (choose one or more):
Raw CSVs of listings (multiple files OK).
Registry or licensing exports.
Marketplace or association lists. If the user does not provide data, you must collect it using available sources before proceeding.

Optional:

Any crawling/extraction tool that can fetch and parse websites at scale.
Image scraping pipeline or existing image datasets.

Minimal Input Mode (User-Friendly)

Minimum user input can be just a country and/or region. If country/region is not provided, infer from available context or choose a reasonable default and state the assumption.

If other inputs are missing, do not block. Generate them yourself using deep research:

Propose a short list of niches suitable for the region.
Select one niche and generate the Niche Brief.
Identify likely data sources (registries, associations, marketplaces, maps).
Generate keywords, negatives, and trust signals.

Optional Tooling: Crawl4AI (Suggested, Not Required)

If you want a robust crawler, Crawl4AI is a good default. Use this only if it fits your stack.

Install:

pip install crawl4ai

Smoke test (minimal):

Run a single known URL and confirm it returns main content and links.
Verify rate limiting and retry behavior on a small batch.

Tuning checklist:

Start with low concurrency (1-3) and increase gradually.
Set timeouts and backoff to avoid bans.
Keep crawl depth at 1-2 until verification accuracy is stable.

Proceed to data gathering only after the smoke test passes.

Niche Discovery (Pre-step -1)

If you are starting without a niche, do this before Niche Setup.

Generate 10-20 candidate niches.
Score each by data availability, decision drivers, data moat potential, and monetization.
Pick the top 1-3 for quick validation (Quick mode).

Output: a one-page niche brief (definition, decision drivers, trust signals, data sources, keywords).

Deep research requirement:

Use first-principles research to justify niche choice and data sources.
Prefer authoritative sources (official registries, government data, industry associations).

Niche Setup (Agnostic, Minimum Viable)

Define these before any scraping or verification. Keep them short and testable.

Niche definition: one sentence describing what qualifies and what does not.
Decision drivers: 3-5 factors users care about (price, quality, speed, coverage, features).
Positive keywords: 5-15 phrases that must appear in a provider description (website or registry text).
Negative keywords: 5-15 phrases that indicate non-matches.
Trust signals: reviews, certifications, licenses, registry presence, or verified contact details.

Trust signals operational order:

Official registries or licensing proof.
Verified website with clear service pages.
Verified phone or email tied to the business name.
Reviews or third-party marketplace profiles.

Success check: run verification on a 50-200 row sample and reach >80 percent precision on matches.

Placeholder Fill (Auto)

Do not edit this skill manually. Before running any prompt, fill placeholders from the Niche Brief:

{niche}: use the niche definition.
{pos_1..pos_3}: top 3 positive keywords from the brief.
{neg_1..neg_2}: top 2 negative keywords from the brief. If the brief is missing, generate these from the candidate data and trust signals, then proceed.

Execution Modes (Choose One)

Quick (fast validation):

1 sample CSV or registry extract, 50-200 rows.
Run Steps 2-3 only.
Goal: confirm niche viability and verification accuracy. If no dataset is provided, collect a small sample first, then proceed.

Standard (default build):

Full dataset, all 7 steps.
Enrich only top verified candidates if budget-limited.

Deep (best quality):

All 7 steps plus manual review and confidence scoring.
Add a second verification pass and spot-checks.

Quick mode exit criteria:

At least 50 verified candidates (or 10 percent of sample, whichever is smaller).
Precision on a 20-row audit > 80 percent (or 10 percent sample if smaller).
Cleaned schema matches the contract below.

Core Workflow (7 Steps)

Before starting, pick a niche and list decision drivers users care about.

Collect raw data from sources (scraping, registry exports, marketplace lists).
Clean and consolidate CSVs, removing junk and irrelevant businesses.
Verify sources at scale using a crawler or registry plus LLM prompts.
Enrich inventory data (what the business offers, counts, types).
Enrich images with strict filtering rules and edge-case handling.
Extract amenities and features to power filters and comparison.
Extract service areas with location-aware accuracy rules.

Pre-step 0: Deduplicate and normalize identifiers.

Create a stable business_id (hash of name + address + phone).
Merge duplicates across sources and keep a provenance column.
Normalize name, address, and phone before hashing (lowercase, strip punctuation, standardize suite/unit).
Tie-break rule: keep the record with the most complete fields; merge the rest into provenance.
If address/phone are missing, use registry_id or name + region + country as the hash inputs.

Prompt Pack (Cleaned Templates)

Replace placeholders using the Niche Brief.

Step 2: Clean and Consolidate CSVs

You are cleaning and consolidating multiple CSVs of {niche} listings into one clean dataset.

Task:
- Merge all CSVs into a single consolidated CSV.
- Remove any row that matches ANY removal criteria below.
- Normalize columns and values (trim whitespace, consistent casing, standardize phone and URLs).
- Preserve registry_id and parent_id if present.

Removal criteria (remove if ANY applies):
- Missing business name or missing street-level address where applicable (administrative area alone is not enough).
- Permanently closed or temporarily closed.
- Not actually a {niche} business based on name or category.
- Clearly unrelated website with no supporting registry/phone evidence.
- Trust signals exist and indicate low quality or non-operational status.

Output:
- cleaned.csv (stable schema)
- removed_rows.csv (include reason)

Schema:

cleaned.csv: business_id, name, address, city, region, postal_code, country, phone, website, category, source, parent_id (optional), registry_id (optional)
removed_rows.csv: business_id, name, reason, source

Step 3: Tiered Verification (Crawler + LLM)

Goal: verify that each candidate is a real {niche} provider.

Tier 1: fast keyword filter (website sources only)
- Crawl homepage + top pages when a website exists.
- Positive keywords: {pos_1}, {pos_2}, {pos_3}
- Negative keywords: {neg_1}, {neg_2}
- If any reasonable chance it is a {niche} provider, keep it for Tier 2.

Tier 1 (offline/low-web alternative):
- Use registry or licensing lookup when websites are missing.
- Validate via phone or official contact channels where applicable.
- Record evidence_type as registry or phone.

Tier 2: LLM verification
For each candidate source, answer:
- Is this business actually a {niche} provider? (yes/no)
- What products/services indicate this?
- Extract a 1-2 sentence proof summary.
Apply trust signal order (registry > website > verified contact > reviews).

Output:
- verified_candidates.csv with columns: is_match, proof, confidence, signals_found, evidence_type, source_url, registry_id (optional)

Schema:

verified_candidates.csv: business_id, is_match (yes/no), confidence (0-1), signals_found, proof, evidence_type, source_url, registry_id (optional)

Routing:

confidence >= 0.7 -> accept
confidence 0.5-0.69 -> manual_review.csv
confidence < 0.5 -> reject

Trust signal override:

If registry evidence exists, it overrides conflicting website evidence.

Populate parent_id:

If a brand has multiple locations, set parent_id to the brand entity id.

Step 4: Offerings/Inventory Enrichment

Extract offerings/inventory data from verified sources.

For each business:
- Identify product types and counts.
- Capture any pricing or availability signals.
- Record evidence text and source page URL.

Output:
- inventory.csv with columns: business_id, inventory_items, inventory_counts, price_signals, evidence, source_url

Schema:

inventory.csv: business_id, inventory_items, inventory_counts, price_signals, evidence, source_url Note:
Use inventory.csv even for service-based niches; treat inventory_items as offerings.

Step 5: Image Enrichment

Collect only relevant product images.

Rules:
- Use images from pages already scraped.
- Or use provided datasets when pages do not exist.
- Exclude logos, favicons, and unrelated imagery.
- Only include images that clearly show the product type and relevant features.
- Quality check each image before inclusion.
If images are not relevant for the niche, skip this step and note it in logs.

Output:
- images.csv with columns: business_id, image_url, product_type, confidence, source_url

Schema:

images.csv: business_id, image_url, product_type, confidence (0-1), source_url

Step 6: Amenities and Features

Extract amenities and features into structured columns.

Rules:
- One column per feature family (lighting, climate control, ADA, generators, etc.).
- Separate "core" vs "specialty" features if the niche mixes products.
- Normalize inconsistent wording (e.g., "A/C" -> "air_conditioning").
If features are not meaningful for the niche, skip this step and note it in logs.

Output:
- features.csv with standardized boolean or categorical fields.

Schema:

features.csv: business_id, feature_family, feature_value, evidence, source_url

Step 7: Service Area Extraction

Extract service areas with location-aware accuracy.

Rules:
- Prefer explicit service area pages.
- If a business HQ is in a specific locality/region, do not assign service areas outside that area unless explicitly stated.
- For multi-location businesses, map service areas to each location separately.

Output:
- service_areas.csv with columns: business_id, service_area, scope, evidence, evidence_type, source_url

Schema:

service_areas.csv: business_id, service_area, scope (city/district/region/state/province/country), evidence, evidence_type, source_url

Confidence Rubric (Use for All LLM Decisions)

0.9-1.0: Explicit evidence on the site matching the niche, with product/service detail.
0.7-0.89: Strong signals but missing one key detail.
0.5-0.69: Ambiguous; keep only if niche requires broad inclusion.
< 0.5: Exclude or send to manual review.

Batch Protocol (Resumable)

Split work into numbered batches: batch_001.csv, batch_002.csv.
Each batch produces its own output file with the same suffix.
Maintain a pending.csv with unprocessed business_ids.
After each batch, append results to the master file and log stats. Status values:
pending, processing, done, error

Schema Contracts (Explicit)

cleaned.csv

business_id, name, address, city, region, postal_code, country, phone, website, category, source, parent_id (optional), registry_id (optional)

removed_rows.csv

business_id, name, reason, source

verified_candidates.csv

business_id, is_match, confidence, signals_found, proof, evidence_type, source_url, registry_id (optional)

inventory.csv

business_id, inventory_items, inventory_counts, price_signals, evidence, source_url

images.csv

business_id, image_url, product_type, confidence, source_url

features.csv

business_id, feature_family, feature_value, evidence, source_url

service_areas.csv

business_id, service_area, scope, evidence, evidence_type, source_url

manual_review.csv

business_id, reason, suggested_action, evidence, evidence_type, source_url, registry_id (optional)

pending.csv

business_id, batch_id, status

Field Types and Constraints (Geo-Agnostic)

business_id: stable hash string.
parent_id: optional brand/entity id for multi-location businesses.
registry_id: optional id from official registries/licensing sources.
name: string, required.
address: string, required if available; allow free-form for non-structured regions.
city/region/postal_code/country: optional; use region/country fields where applicable.
phone: string, preserve original formatting; optionally add normalized_phone.
website/source_url: valid URL or registry/phone reference string, or empty.
confidence: float 0-1.
scope: one of locality, city, district, region, state, province, country.
evidence_type: one of website, registry, phone, marketplace, manual

Compound Field Format

inventory_items: pipe-separated list (item1|item2|item3).
inventory_counts: JSON object string ({"item":"count"}).
signals_found: pipe-separated list of keyword hits.
feature_value: use normalized tokens (snake_case).

Website Quality Rules

If a website is parked, empty, or redirects to an unrelated domain, mark as no_working_site.
If only social profiles exist, keep only if the niche allows social-only presence.
Flag marketplace/aggregator pages and require explicit proof of ownership before accept.
If no website exists, keep only if the region supports phone-only or registry-only businesses. Heuristics:
Parked/empty: "domain for sale", "coming soon", no contact/about/service pages.
Aggregator: lists many unrelated businesses with no operator branding.
Redirects: if primary domain redirects to another niche, mark as unrelated.

Crawl Ethics and Politeness

Respect robots.txt where required.
Rate-limit crawling and retry with backoff.
Avoid scraping personal emails or sensitive personal data.

Offline/Low-Web Regions

If web presence is sparse, accept registry or phone validation as primary evidence.
Use evidence_type to record registry/phone/website sources.

Data Privacy Note

Follow local data protection laws; avoid storing personal data unless explicitly permitted.

Handoff Package (Website Creation Starts After This)

Deliver the following package before moving to website creation:

Niche brief and decision drivers
Cleaned dataset (cleaned.csv)
Verified dataset with evidence (verified_candidates.csv)
Enrichment files (inventory.csv, images.csv, features.csv, service_areas.csv) or explicit skip notes
Manual review queue (manual_review.csv)
Batch logs and changelog notes

Stop Conditions (Per Step)

Step 2: < 5 percent duplicates remain and removal reasons stabilize.
Step 3: Verification precision > 80 percent on audit sample.
Step 4-7: New information rate drops below 10 percent per batch.

Define "new information rate":

(new non-empty fields added in batch) / (total fields expected in batch)

Define "removal reasons stabilize":

Top 3 removal reasons change by less than 5 percent across 2 consecutive batches.

Multi-Location Handling

Split multi-location businesses into separate records per location before Step 7.
Do not apply service areas across locations without explicit evidence.
For franchise networks, keep a parent_id and child location rows.

Agent Performance Tuning

Use these constraints to keep runs fast and stable.

Batch size: 50-200 businesses per prompt for verification and enrichment.
Crawl depth: start at 1-2 pages, expand only when confidence < 0.7.
If not crawling websites, skip crawl depth and rely on registry/phone evidence.
Always write outputs to disk between passes; never keep full data in memory.
Use a stable business_id as the join key across all files.
Maintain a "pending.csv" queue for unprocessed rows to allow resumable runs.

Token Budget Guide (Context Health)

Per call target: <= 2,500 tokens input, <= 1,000 tokens output.
If larger context is needed, split into smaller batches and summarize results.

Batch summary template:

Summary stats (counts, missing rates, duplicates).
Top 10 errors and examples.
20 edge rows with expected outputs.

Minimal Schema (Recommended)

business_id (stable key)
parent_id (optional)
registry_id (optional)
name
address
city
region
postal_code
country
phone
website
category
verified_match (yes/no)
confidence

Quality Gates

Sample cleaned rows to verify removal criteria correctness.
Sample verified rows to check classification accuracy.
Manually review image sets for relevance and rights risk.
Review features columns for junk tokens or malformed entries.
Validate service areas against the listing's HQ locality/region.
Track false positives and rerun passes with edge-case fixes.

Verification audit protocol:

Sample 20 rows (or 10 percent if smaller): 10 accepted, 10 rejected.
Count false accepts and false rejects.
If > 20 percent errors, revise keywords or thresholds.

Common Failure Modes and Fixes

Too many false positives in verification: tighten negative keywords and add Tier 1 thresholds.
Missing inventory details: expand crawl depth only for low-confidence candidates.
Noisy features columns: enforce strict vocab mapping and drop rare tokens.
Service areas spanning outside declared locality/region: split multi-location businesses first.
Duplicate listings across sources: hash normalized name+address+phone and merge before Step 2.
Marketplace listings misclassified: require direct operator identifiers (about page, contact email domain).

Multilingual and Regional Variants

Maintain keyword lists in local languages and common transliterations.
If the region lacks street addresses, accept landmark-style addresses and use region/country fields.

Dataset Versioning

Version outputs per run: v1_cleaned.csv, v2_cleaned.csv, etc.
Never overwrite previous master files; append a changelog note.

Website Build Output (Optional)

If you are building the directory site:

Define required filters from features.csv and service_areas.csv.
Generate SEO pages by region and feature.
Ensure listings link to verified websites only.
Localize language, currency, and phone formatting to the target region.

Artifacts and File Layout (Suggested)

raw/ raw CSVs
cleaned/cleaned.csv
cleaned/removed_rows.csv
verify/verified_candidates.csv
verify/manual_review.csv
enrich/inventory.csv
enrich/images.csv
enrich/features.csv
enrich/service_areas.csv
logs/pending.csv
logs/ batch logs and error reports

Notes and Cautions

Image scraping carries rights risk; consider stock images or explicit permission.
SEO timelines are long; this is not a fast-revenue strategy.
Data quality drives trust; prioritize accuracy over speed.