property-listing-extraction

// Use when extracting facts from saved real-estate listing HTML in raw/, cleaning source-specific listing data, filling missing fields in data/heritage_sales.csv through reviewable candidates, proposing supplemental model features, or preserving evidence/conflict flags for property sale-price analysis.

name	property-listing-extraction
description	Use when extracting facts from saved real-estate listing HTML in raw/, cleaning source-specific listing data, filling missing fields in data/heritage_sales.csv through reviewable candidates, proposing supplemental model features, or preserving evidence/conflict flags for property sale-price analysis.
metadata	{"short-description":"Extract and clean saved listing facts"}

Property Listing Extraction

Use this skill for the property-sales repo when saved listing pages under raw/ need to become auditable sale-update candidates and modeling features.

Core Workflow

Inspect data/heritage_sales.csv, data/sales_research_targets.csv, and analysis/sale-price-eda.md before deciding how a field should be used.

Prefer the repo extractor when available:

python scripts/extract_listing_facts.py

Useful options:

python scripts/extract_listing_facts.py --raw-dir raw --output-dir data
python scripts/extract_listing_facts.py --raw-dir raw --output-dir /tmp/listing-extract-check

Treat outputs as review artifacts, not accepted sale data:
- data/listing_raw_facts.jsonl: source-specific facts with labels and provenance.
- data/listing_sale_update_candidates.csv: candidate rows shaped like heritage_sales.csv.
- data/listing_model_features.csv: supplemental cleaned features and leakage classification.
- data/listing_extraction_issues.csv: conflicts, zero dimensions, swapped coordinates, duplicate matches, and other review flags.
Do not mutate data/heritage_sales.csv directly unless the user explicitly asks to accept candidate values. When accepting values, preserve prior notes and add a short human-review note.

Extraction Rules

Preserve raw source facts before cleaning. Never replace a source label/value with only a normalized value.
Parse Zolo pages from the listing fact table and listing description; ignore nearby active-listing cards embedded later in the HTML.
Parse HouseSigma pages primarily from JSON-LD. Validate coordinates because some pages may expose latitude/longitude in swapped order.
Parse Wahi pages from JSON-LD plus the __NEXT_DATA__ listing payload. Prefer sold/list dates from the payload and compute DOM from those dates when Wahi's daysOnMarket field disagrees.
Parse Zillow not-for-sale pages from JSON-LD plus __NEXT_DATA__.props.pageProps.componentProps.gdpClientCache. Treat missing/zero Zillow sold/list prices as unknown rather than literal prices.
Parse Houseful pages from the React Flight stream (self.__next_f.push(...)). Use the listing data object for sold/list price, MLS number, building size, parking, lot dimensions, tax, coordinates, and feature groups; resolve $... description references from text records.
Keep duplicate labels as separate values first. Zolo often repeats labels such as Basement, Heating, and Exterior.
Keep source conflicts visible. If structured facts disagree with remarks or existing accepted rows, set manual_review_needed rather than choosing silently.

Cleaning Rules

Parse dates to YYYY-MM-DD, money to integer CAD strings, and dimensions to numeric feet when units are explicit.
Parse ranges as ranges. For example, 1500-2000 sqft becomes min/max/mid plus above_grade_sqft_exact_flag = No; do not put a midpoint into heritage_sales.csv as exact square footage.
Parse bedroom strings such as 2+2 into above/below/total model features while preserving the source text for the core sales candidate.
Convert full/half bathroom breakdowns to fractional bath counts when the accepted sales CSV uses that convention; keep total bathroom counts when the source only exposes a total.
Derive Lot_sqft = frontage * depth only when both values are positive and units are feet. Flag irregular lots and zero dimensions.
Normalize comparable categories without erasing source wording: detached/single-family terms can align, but Fourplex, Duplex, Att/Row/Twnhouse, and income-property claims require review before primary detached modeling.
Extract condition conservatively from remarks (renovated, updated, maintained, needs_work) and keep the supporting description in notes/features.

Modeling Rules

Use leakage-aware categories:

primary_model_ok: physical attributes, location, heritage/register attributes, age/condition, lot/sqft/beds/baths/parking, and lagged market timing such as HPI.
diagnostic_only: list date, closed date, original/final list price, DOM, price-change history, conditional status, and list-to-sale ratios.
review_only: source confidence, extraction confidence, conflict counts, missingness artifacts, notes, and human-review metadata.
target_derived: sold-price-derived features such as price per square foot, price-to-assessment ratio using sold price, and list-to-sale ratio.

Do not pre-impute modeling values globally. Let the sklearn pipeline handle imputation inside cross-validation so distribution information does not leak across folds.

Review Focus

Prioritize fields currently sparse or high-value in analysis/sale-price-eda.md: above-grade sqft, year built, lot area/dimensions, verified property type, legal/multi-unit status, condition coding, garage/parking, and outside-HCD controls. Use list price and DOM for diagnostics, not the primary prediction model.

Property Listing Extraction

Use this skill for the property-sales repo when saved listing pages under raw/ need to become auditable sale-update candidates and modeling features.

Core Workflow

Inspect data/heritage_sales.csv, data/sales_research_targets.csv, and analysis/sale-price-eda.md before deciding how a field should be used.

Prefer the repo extractor when available:

python scripts/extract_listing_facts.py

Useful options:

python scripts/extract_listing_facts.py --raw-dir raw --output-dir data
python scripts/extract_listing_facts.py --raw-dir raw --output-dir /tmp/listing-extract-check

Treat outputs as review artifacts, not accepted sale data:

data/listing_raw_facts.jsonl: source-specific facts with labels and provenance.
data/listing_sale_update_candidates.csv: candidate rows shaped like heritage_sales.csv.
data/listing_model_features.csv: supplemental cleaned features and leakage classification.
data/listing_extraction_issues.csv: conflicts, zero dimensions, swapped coordinates, duplicate matches, and other review flags.

Do not mutate data/heritage_sales.csv directly unless the user explicitly asks to accept candidate values. When accepting values, preserve prior notes and add a short human-review note.

Extraction Rules

Preserve raw source facts before cleaning. Never replace a source label/value with only a normalized value.

Parse Zolo pages from the listing fact table and listing description; ignore nearby active-listing cards embedded later in the HTML.

Parse HouseSigma pages primarily from JSON-LD. Validate coordinates because some pages may expose latitude/longitude in swapped order.

Parse Wahi pages from JSON-LD plus the __NEXT_DATA__ listing payload. Prefer sold/list dates from the payload and compute DOM from those dates when Wahi's daysOnMarket field disagrees.

Parse Zillow not-for-sale pages from JSON-LD plus __NEXT_DATA__.props.pageProps.componentProps.gdpClientCache. Treat missing/zero Zillow sold/list prices as unknown rather than literal prices.

Parse Houseful pages from the React Flight stream (self.__next_f.push(...)). Use the listing data object for sold/list price, MLS number, building size, parking, lot dimensions, tax, coordinates, and feature groups; resolve $... description references from text records.

Keep duplicate labels as separate values first. Zolo often repeats labels such as Basement, Heating, and Exterior.

Keep source conflicts visible. If structured facts disagree with remarks or existing accepted rows, set manual_review_needed rather than choosing silently.

Cleaning Rules

Parse dates to YYYY-MM-DD, money to integer CAD strings, and dimensions to numeric feet when units are explicit.

Parse ranges as ranges. For example, 1500-2000 sqft becomes min/max/mid plus above_grade_sqft_exact_flag = No; do not put a midpoint into heritage_sales.csv as exact square footage.

Parse bedroom strings such as 2+2 into above/below/total model features while preserving the source text for the core sales candidate.

Convert full/half bathroom breakdowns to fractional bath counts when the accepted sales CSV uses that convention; keep total bathroom counts when the source only exposes a total.

Derive Lot_sqft = frontage * depth only when both values are positive and units are feet. Flag irregular lots and zero dimensions.

Normalize comparable categories without erasing source wording: detached/single-family terms can align, but Fourplex, Duplex, Att/Row/Twnhouse, and income-property claims require review before primary detached modeling.

Extract condition conservatively from remarks (renovated, updated, maintained, needs_work) and keep the supporting description in notes/features.

Modeling Rules

Use leakage-aware categories:

primary_model_ok: physical attributes, location, heritage/register attributes, age/condition, lot/sqft/beds/baths/parking, and lagged market timing such as HPI.

diagnostic_only: list date, closed date, original/final list price, DOM, price-change history, conditional status, and list-to-sale ratios.

review_only: source confidence, extraction confidence, conflict counts, missingness artifacts, notes, and human-review metadata.

target_derived: sold-price-derived features such as price per square foot, price-to-assessment ratio using sold price, and list-to-sale ratio.

Do not pre-impute modeling values globally. Let the sklearn pipeline handle imputation inside cross-validation so distribution information does not leak across folds.

Review Focus