| name | property-listing-extraction |
| description | Use when extracting facts from saved real-estate listing HTML in raw/, cleaning source-specific listing data, filling missing fields in data/heritage_sales.csv through reviewable candidates, proposing supplemental model features, or preserving evidence/conflict flags for property sale-price analysis. |
| metadata | {"short-description":"Extract and clean saved listing facts"} |
Property Listing Extraction
Use this skill for the property-sales repo when saved listing pages under raw/ need to become auditable sale-update candidates and modeling features.
Core Workflow
-
Inspect data/heritage_sales.csv, data/sales_research_targets.csv, and analysis/sale-price-eda.md before deciding how a field should be used.
-
Prefer the repo extractor when available:
python scripts/extract_listing_facts.py
Useful options:
python scripts/extract_listing_facts.py --raw-dir raw --output-dir data
python scripts/extract_listing_facts.py --raw-dir raw --output-dir /tmp/listing-extract-check
-
Treat outputs as review artifacts, not accepted sale data:
data/listing_raw_facts.jsonl: source-specific facts with labels and provenance.
data/listing_sale_update_candidates.csv: candidate rows shaped like heritage_sales.csv.
data/listing_model_features.csv: supplemental cleaned features and leakage classification.
data/listing_extraction_issues.csv: conflicts, zero dimensions, swapped coordinates, duplicate matches, and other review flags.
-
Do not mutate data/heritage_sales.csv directly unless the user explicitly asks to accept candidate values. When accepting values, preserve prior notes and add a short human-review note.
Extraction Rules
- Preserve raw source facts before cleaning. Never replace a source label/value with only a normalized value.
- Parse Zolo pages from the listing fact table and listing description; ignore nearby active-listing cards embedded later in the HTML.
- Parse HouseSigma pages primarily from JSON-LD. Validate coordinates because some pages may expose latitude/longitude in swapped order.
- Parse Wahi pages from JSON-LD plus the
__NEXT_DATA__ listing payload. Prefer sold/list dates from the payload and compute DOM from those dates when Wahi's daysOnMarket field disagrees.
- Parse Zillow not-for-sale pages from JSON-LD plus
__NEXT_DATA__.props.pageProps.componentProps.gdpClientCache. Treat missing/zero Zillow sold/list prices as unknown rather than literal prices.
- Parse Houseful pages from the React Flight stream (
self.__next_f.push(...)). Use the listing data object for sold/list price, MLS number, building size, parking, lot dimensions, tax, coordinates, and feature groups; resolve $... description references from text records.
- Keep duplicate labels as separate values first. Zolo often repeats labels such as
Basement, Heating, and Exterior.
- Keep source conflicts visible. If structured facts disagree with remarks or existing accepted rows, set
manual_review_needed rather than choosing silently.
Cleaning Rules
- Parse dates to
YYYY-MM-DD, money to integer CAD strings, and dimensions to numeric feet when units are explicit.
- Parse ranges as ranges. For example,
1500-2000 sqft becomes min/max/mid plus above_grade_sqft_exact_flag = No; do not put a midpoint into heritage_sales.csv as exact square footage.
- Parse bedroom strings such as
2+2 into above/below/total model features while preserving the source text for the core sales candidate.
- Convert full/half bathroom breakdowns to fractional bath counts when the accepted sales CSV uses that convention; keep total bathroom counts when the source only exposes a total.
- Derive
Lot_sqft = frontage * depth only when both values are positive and units are feet. Flag irregular lots and zero dimensions.
- Normalize comparable categories without erasing source wording: detached/single-family terms can align, but
Fourplex, Duplex, Att/Row/Twnhouse, and income-property claims require review before primary detached modeling.
- Extract condition conservatively from remarks (
renovated, updated, maintained, needs_work) and keep the supporting description in notes/features.
Modeling Rules
Use leakage-aware categories:
primary_model_ok: physical attributes, location, heritage/register attributes, age/condition, lot/sqft/beds/baths/parking, and lagged market timing such as HPI.
diagnostic_only: list date, closed date, original/final list price, DOM, price-change history, conditional status, and list-to-sale ratios.
review_only: source confidence, extraction confidence, conflict counts, missingness artifacts, notes, and human-review metadata.
target_derived: sold-price-derived features such as price per square foot, price-to-assessment ratio using sold price, and list-to-sale ratio.
Do not pre-impute modeling values globally. Let the sklearn pipeline handle imputation inside cross-validation so distribution information does not leak across folds.
Review Focus
Prioritize fields currently sparse or high-value in analysis/sale-price-eda.md: above-grade sqft, year built, lot area/dimensions, verified property type, legal/multi-unit status, condition coding, garage/parking, and outside-HCD controls. Use list price and DOM for diagnostics, not the primary prediction model.