mit einem Klick
design-scraper
// Design and implement a kent scraper for an appellate court website. Invoked with a URL to explore. Uses Playwright for site reconnaissance, then produces DESIGN.md, models.py, and scraper.py.
// Design and implement a kent scraper for an appellate court website. Invoked with a URL to explore. Uses Playwright for site reconnaissance, then produces DESIGN.md, models.py, and scraper.py.
| name | design-scraper |
| description | Design and implement a kent scraper for an appellate court website. Invoked with a URL to explore. Uses Playwright for site reconnaissance, then produces DESIGN.md, models.py, and scraper.py. |
| user-invocable | true |
| argument-hint | <url> |
You are designing and implementing a kent framework scraper for an appellate court website. The user provides a URL as the argument to this skill.
See kent-api-reference.md for the kent framework API.
Locate the juriscraper scrapers directory. Filter out the build/ tree —
a stale wheel-build copy may shadow the canonical source path; writing files
into it silently puts them in a build artefact that nobody imports:
find . -path "*/juriscraper/sd/state" -type d 2>/dev/null \
| grep -v '/build/' | head -1
All files go under that directory at {state}/{domain_underscored}/:
DESIGN.md — Site analysis and design decisionsmodels.py — Pydantic data models (ScrapedData subclasses)scraper.py — Scraper implementation (BaseScraper subclass)__init__.py — Empty package initDerive {state} from the court's US state (lowercase, underscores:
california, new_york).
Derive {domain_underscored} from the hostname:
www. if present.Examples: appellatecases.courtinfo.ca.gov → appellatecases_courtinfo_ca_gov,
ma-appellatecourts.org → ma_appellatecourts_org,
e-courts.judicial.state.al.us → e_courts_judicial_state_al_us.
If the target directory already exists, read existing files before overwriting.
Navigate to the URL with browser_navigate.
Snapshot the page (browser_snapshot) to see forms, links, layout.
Identify all search forms — note each form's action URL, method (GET/POST), and every field (name, type, required, options for selects).
If browser_snapshot shows no <form> element on a page that obviously
has a search form (common on SPAs), fall back to browser_evaluate:
const inputs = document.querySelectorAll('input, select');
return Array.from(inputs).map(el => ({
name: el.name, type: el.type, value: el.value
}));
The accessibility-tree snapshot sometimes drops forms that lack id /
name / aria-* attributes on the <form> tag itself.
Identify all courts covered — look for:
dist=3, court=SC)Record every court's internal identifier, display name, and any division info.
Check for a calendar / oral arguments section — often has date-based search even when the main case search doesn't. If found, note the URL and search fields.
Try navigating to a prior month — some sites' calendars redirect any
historical URL pattern back to /. If no past-month URL works, the
calendar is a snapshot-only resource and a dateless @entry is the only
option.
Based on Phase 1 findings, identify 1–3 candidate sibling scrapers from the
exemplars table below. Read their scraper.py and models.py for structural
reference — imports, decorator argument styles, helper conventions — before
writing your own. Revisit after Phase 4 if the technical assessment shifts
the picture (e.g. the site turns out to be a JSON SPA you didn't catch in
Phase 1).
| Site shape | Canonical exemplar | Why |
|---|---|---|
YearlySpeculativeRange (year-partitioned) | georgia/gaappeals_gov | Five @entry per case-letter; year-aware ID formatting |
SpeculativeRange (continuous integer) | alaska/appellate_records_courts_alaska_gov | Two @entry, no year axis, simplest shape |
Multi-prefix @entry cluster + shared helper | maryland/casesearch_courts_state_md_us | Five entries (ACM-REG, ACM-ALA, SCM-PET, SCM-MISC, SCM-REG) all delegating to _build_speculative_request |
JSON API w/ json_content step | washington/acdocportal_courts_wa_gov | Uses @step(json_model=...) for validated JSON parsing |
| Date-search HTML form | new_york/nycourts_gov | DateRange-driven ASP.NET WebForms POST with paginated table results |
| Single-page RSI / Public Access portal | massachusetts/ma_appellatecourts_org | One GET returns full case file inline as <section> blocks |
| Episerver SSR JSON | michigan/courts_michigan_gov | Hits page URL with ?expand=*¤tPageUrl=... for SSR'd page object |
| Newest-sorted listing walk | michigan/courts_michigan_gov | No date filter — walks sortOrder=Newest and stops on window boundary |
| Redirect-based soft-404 | massachusetts/ma_appellatecourts_org | fails_successfully checks response.url, not body text |
| 4xx-as-speculative-miss | maryland/casesearch_courts_state_md_us | API returns HTTP 400 for invalid IDs; no fails_successfully override needed |
Before probing, identify the transport. Watch the network panel during
a manual search. If you see a fetch() returning JSON, target it directly —
don't reverse-engineer the rendered HTML. Look up the canonical sibling in
the exemplars table for either shape. The transport choice directly shapes
the scraper's @step signatures (json_content: dict vs page: PageElement).
sortOrder=Newest, walk pages until
the oldest in-window item < window start).Probing (party search etc.) is a discovery technique — see "Always probe" below — not a bulk scraping strategy.
mm/dd/yyyy, ISO, etc.).If search returns JSON instead of HTML, the step uses json_content instead
of page:
@step()
def parse_search_results(
self,
json_content: dict,
response: Response,
accumulated_data: dict,
) -> Generator[ScraperYield[MyDocket], None, None]:
for record in json_content.get("results", []):
...
headers={"Accept": "application/json"} on the request.?expand=*¤tPageUrl=....resultType= parameter — see the patterns appendix._token (Laravel), __RequestVerificationToken (ASP.NET Core),
csrfmiddlewaretoken (Django). The simplest get-out-of-jail-free path is
page.find_form().submit(...) — it preserves all hidden fields
automatically rather than requiring you to enumerate them.Search for "smith" to discover:
C000125 (letter + digits), SC-2023-0123
(court-year-seq), 2024-00003 (year-seq).Search in multiple courts if the site covers more than one, since different courts often use different docket number prefixes.
If the decision tree lands you on speculative entry, determine:
SpeculativeRangeC + up to 6 digits, highest observed
C105926 → use SpeculativeRange as the entry parameter type and seed
with {"number": 105926, "gap": 20}.YearlySpeculativeRangeFor year-partitioned numbers (e.g. 2024-00003):
YearlySpeculativeRange as the entry parameter type.{"year": YYYY, "min": N, "soft_max": M, "gap": K}.[
{"fetch_case": {"case_id": {"year": 2024, "min": 1, "soft_max": 4000, "gap": 0}}},
{"fetch_case": {"case_id": {"year": 2025, "min": 1, "soft_max": 1, "gap": 15}}},
]
@entry per prefixIf a site uses multiple case-number prefixes that each have their own
sequence (e.g. by case type — Maryland's ACM-REG, ACM-ALA, SCM-PET,
SCM-MISC, SCM-REG; Georgia's A, D, E, I, O), declare one @entry per
prefix and share a _build_speculative_request(case_id, prefix_args)
helper. The driver advances each prefix's sequence independently. See
Maryland as the canonical exemplar.
Before clicking through tabs, identify the detail shape. Snapshot a
representative case page. If every field you need (header, parties, docket,
documents) is already on one page in <section> blocks with no AJAX-loaded
sub-resources, you're in single-page-portal territory (RSI / older
Public-Access systems — Massachusetts is the exemplar). Design for one
parse_case_detail step rather than a tab chain. Otherwise, the multi-tab
walkthrough below applies.
For multi-tab cases: click into a case result and visit every available tab or section. For each tab:
| Tab | Key fields |
|---|---|
| Case Summary | Case type, filing date, completion date, caption, division |
| Docket / Register of Actions | Date, description, notes per entry |
| Briefs | Brief type, due date, filed date, party/attorney |
| Disposition | Outcome, date, publication status, author, citation |
| Parties & Attorneys | Names, roles, firms, addresses, phone numbers |
| Trial Court | Court name, county, case number, judge, judgment date |
| Scheduled Actions | Future events, hearing dates |
| Documents | Download links with types, dates, descriptions |
Look for "subscribe to email notifications" or similar links on case pages. If found:
Test each endpoint you'll need with curl to build a per-endpoint protection map:
curl -s -o /dev/null -w "%{http_code}\n" "URL"
curl -s "URL" | head -50
Mixed protection is common: listing/search endpoints are often open even when per-record endpoints are gated.
driver_requirements is a scraper-wide ClassVar. The scraper is either
httpx or Playwright — not both. If any endpoint you need requires
Playwright, the whole scraper is Playwright; otherwise pure httpx.
The key driver of Playwright requirement is bot protection (CloudFlare, Akamai, DataDome, etc.), not the server framework. ASP.NET, ColdFusion, PHP, Episerver sites all work fine with httpx when there is no JS challenge gate.
Hit the JSON API directly through the Playwright context — don't scrape the rendered DOM and don't try to call the API via httpx with stolen cookies. Playwright handles the JS challenge and obtains the bot-protection cookie; subsequent API calls go through cleanly. This is faster and more robust than DOM scraping.
If the listing is open over HTTP but per-record detail is hard-gated (e.g.
invisible hCaptcha — see patterns appendix), it's reasonable to ship a v1
that yields listing-only records with status=IN_DEVELOPMENT and a
DESIGN.md gap note, rather than block on captcha integration.
Look up each court in CourtListener's database. Find courts.json by
searching for it:
find ../.. -path "*/courts_db/data/courts.json" 2>/dev/null | head -1
Each entry has:
id — CourtListener court ID (e.g. calctapp3d)name — Full court nametype — appellate, trial, etc.level — colr (court of last resort), iac (intermediate appellate)parent — Parent court ID for sub-courtsSearch by state name and court name. Build a mapping: site internal ID → display name → CourtListener court ID.
For courts with divisions that map to a single CourtListener ID (e.g.
CA District 4 Divisions 1-3 all map to calctapp4d), note this in the
mapping.
# {Site Name} Scraper Design
## Site Overview
- **Base URL**: {url}
- **Requires Playwright**: {Yes — CloudFlare / No — server-rendered HTML}
- **Transport**: {HTML form / JSON API / single-page portal / Episerver SSR}
## Courts Covered
| Site ID | Display Name | CourtListener ID |
|---------|-------------|-----------------|
| ... | ... | ... |
## Search Capabilities
{Decision-tree result with notes on each available mode}
**Recommended approach**: {date-based / newest-walk / speculative / hybrid}
## Docket Number Formats
{Per court: prefix pattern, sequential component, year component, examples}
## Data Available
### Case Summary
{List every field with its type}
### Docket Entries
{fields}
### Briefs
{fields}
### Disposition
{fields}
### Parties & Attorneys
{fields}
### Trial Court
{fields}
### Documents
{fields}
## Email Notifications
{Available / Not available}
{If available: URL pattern, event types, registration fields}
## Oral Arguments Calendar
{Available / Not available}
{If available: search modes, fields, current-month-only caveat if applicable}
## Bot Protection Notes
{Hidden fields, session tokens, cookie requirements, redirect behavior}
## Known Gaps (if shipping listing-only v1)
{e.g. invisible hCaptcha on per-case detail; listing only currently}
## Scraper Architecture
### Entry Points
{List each @entry function with its type, params, and purpose}
### Step Functions
{Flow: entry → step1 → step2 → ... → ParsedData}
### Models
{List of ScrapedData models to create}
Import ScrapedData from kent.common.data_models. Follow these conventions:
ScrapedData.str, date, int, list[X], X | None.None; default lists to [].str | None = None over Optional[str].date (not datetime) for date fields.COURT_IDS dict mapping CourtListener IDs to display names.
(For some scraper shapes this dict is documentation only — speculative
entry methods know their own court IDs without needing to look it up.
Keep it anyway for human reference.)class {Prefix}DocketEntry(ScrapedData):
"""A single entry from the Register of Actions / Docket tab."""
date_filed: date | None = None
description: str
notes: str | None = None
class {Prefix}Party(ScrapedData):
"""A party in the case."""
name: str
role: str # e.g., "Plaintiff and Appellant"
attorneys: list[{Prefix}Attorney] = []
class {Prefix}Attorney(ScrapedData):
"""Attorney representation record."""
name: str
firm: str | None = None
address: str | None = None
phone: str | None = None
class {Prefix}Document(ScrapedData):
"""A downloadable document from the case."""
download_url: str
document_type: str
date_filed: date | None = None
description: str | None = None
local_path: str | None = None
class {Prefix}Docket(ScrapedData):
"""Main output — a complete appellate case docket."""
# Searchable fields
docket_id: str
court_id: str
date_filed: date | None = None
case_name: str
# Case metadata
case_type: str | None = None
...
# Nested data
entries: list[{Prefix}DocketEntry] = []
parties: list[{Prefix}Party] = []
documents: list[{Prefix}Document] = []
source_url: str | None = None
If a case-detail page has a "Future Calendar" or "Scheduled Hearings" section
previewing upcoming sittings for that case, model each item as a
{Prefix}DocketEntry instance on the docket — not as a separate
{Prefix}ScheduledHearing or {Prefix}Hearing model. Future-calendar items
are conceptually one more row in the register of actions; splitting them
into a parallel type creates a parallel data path that downstream consumers
have to reconcile for no benefit.
(Per-court calendar pages — a separate page listing all sittings for a
court in a month — are a separate question. If the site has both, the
per-case items go in the docket as DocketEntry; the per-court calendar is
its own decision and may warrant a top-level {Prefix}OralArgument type
with its own entry point.)
Add more models as the site warrants:
{Prefix}Brief — if briefs tab has structured columns beyond docket entries{Prefix}Disposition — if disposition has multiple structured fields{Prefix}TrialCourtInfo — embedded in the main Docket rather than separate{Prefix}OralArgument — if oral arguments are a separate data type with
their own entry point (per-court calendar pages, not per-case future
calendar)from __future__ import annotations
import re
from datetime import date, timedelta
from typing import TYPE_CHECKING, ClassVar
from urllib.parse import urljoin
from kent.common.decorators import entry, step
from kent.common.exceptions import TransientException
from kent.common.page_element import PageElement
from kent.common.param_models import DateRange, SpeculativeRange
from kent.data_types import (
BaseScraper,
DriverRequirement,
HttpMethod,
HTTPRequestParams,
ParsedData,
Request,
Response,
ScraperStatus,
SkipDeduplicationCheck,
)
from pyrate_limiter import Duration, Rate
from .models import ... # Import your models
if TYPE_CHECKING:
from collections.abc import Generator
from kent.data_types import ScraperYield
class {Name}Scraper(BaseScraper[{MainType}]):
"""Scraper for {Court Name(s)}.
{Brief description of what's scraped and how.}
"""
court_ids: ClassVar[set[str]] = {"id1", "id2", ...}
court_url: ClassVar[str] = "https://..."
data_types: ClassVar[set[str]] = {"dockets"} # or {"dockets", "oral_arguments"}
status: ClassVar[ScraperStatus] = ScraperStatus.IN_DEVELOPMENT
version: ClassVar[str] = "{YYYY-MM-DD}"
requires_auth: ClassVar[bool] = False
rate_limits: ClassVar[list[Rate] | None] = [Rate(1, Duration.SECOND)]
# Only if Playwright is needed (bot protection, JS SPA):
# driver_requirements: ClassVar[list[DriverRequirement]] = [
# DriverRequirement.JS_EVAL, DriverRequirement.FF_ALIKE,
# ]
If the scraper yields multiple top-level types, the generic parameter should
be their union: BaseScraper[Docket | OralArgument].
Date-based search available:
@entry({Docket})
def get_dockets(self) -> Generator[Request, None, None]:
"""Fetch dockets using date range from scraper params."""
date_gte, date_lte = self._get_date_params()
yield Request(
request=HTTPRequestParams(method=HttpMethod.GET, url=SEARCH_URL),
continuation=self.parse_search_page,
accumulated_data={"date_gte": ..., "date_lte": ...},
)
@entry({Docket})
def get_dockets_by_date(self, date_range: DateRange) -> Generator[Request, None, None]:
"""Fetch dockets for an explicit date range."""
...
Newest-sorted listing walk (no date filter — see Michigan exemplar):
@entry({Docket})
def get_dockets_by_date(self, date_range: DateRange) -> Generator[Request, None, None]:
"""Walk newest-first listing until oldest-on-page < date_range.start."""
yield Request(
request=HTTPRequestParams(
method=HttpMethod.GET,
url=LISTING_URL,
params={"sortOrder": "Newest", "page": 1, "pageSize": 100},
),
continuation=self.parse_listing_page,
accumulated_data={
"date_gte": date_range.start.isoformat(),
"date_lte": date_range.end.isoformat(),
"page": 1,
},
)
The step then enqueues the next page only if the oldest in-window item on
the current page is still ≥ date_gte.
Speculative entry (one per court): the driver detects speculation via the parameter type — no decorator argument needed.
@entry({Docket})
def fetch_{court_prefix}_docket(self, rid: SpeculativeRange) -> Request:
"""Speculative docket fetcher for {Court Name}."""
docket_id = f"{PREFIX}{rid.min:06d}" # Format to match site pattern
return Request(
request=HTTPRequestParams(
method=HttpMethod.POST,
url=SEARCH_URL,
data={"query_caseNumber": docket_id, ...},
),
continuation=self.parse_search_results,
accumulated_data={"court_id": "...", "docket_id": docket_id},
)
For year-partitioned numbers use YearlySpeculativeRange as the parameter
type (provides .year and .min). Seed params use min, optional
soft_max, should_advance, and gap — see Phase 2's speculative section
for the seed shape and year-rollover responsibility.
For sites with multiple case-type prefixes, see the multi-prefix pattern under Phase 2's speculative entry assessment — Maryland is the canonical exemplar.
Oral arguments (if discovered):
@entry({OralArgument})
def get_oral_arguments_by_date(self, date_range: DateRange) -> Generator[Request, None, None]:
"""Fetch oral arguments for a date range."""
...
Each step function:
page (PageElement), response
(Response), accumulated_data (dict), json_content (dict / list),
text (str), local_filepath (str | None).page.query_xpath(), page.find_form(), page.find_links() for
HTML parsing. For JSON APIs, takes json_content instead.Request for follow-on pages and ParsedData for final output.accumulated_data.accumulated_data must be JSON-serializable. Use
.model_dump(mode="json") for Pydantic models, .isoformat() for dates.Typical flow for a multi-tab case detail scraper:
entry (search) → parse_search_results → parse_case_summary
→ parse_docket_entries
→ parse_parties
→ parse_disposition
→ parse_trial_court
→ assemble_docket (yields ParsedData)
Typical flow for a single-page-portal scraper (Massachusetts):
entry (search) → parse_search_results → parse_case_detail (yields ParsedData)
For sites where all tabs are separate pages, chain them via accumulated_data, collecting fields as you go:
@step()
def parse_case_summary(self, page: PageElement, response: Response,
accumulated_data: dict) -> Generator[...]:
# Extract case summary fields
accumulated_data["case_name"] = ...
accumulated_data["case_type"] = ...
# Yield request for next tab
yield Request(
request=HTTPRequestParams(method=HttpMethod.GET, url=docket_tab_url),
continuation=self.parse_docket_entries,
accumulated_data=accumulated_data,
)
For the final step, assemble and yield the complete model:
@step()
def assemble_docket(self, accumulated_data: dict) -> Generator[...]:
docket = {Prefix}Docket(
docket_id=accumulated_data["docket_id"],
court_id=accumulated_data["court_id"],
...
)
yield ParsedData(data=docket)
Run a curl on a known-bad ID to diagnose the shape, then route to the right detector:
curl -s -o /dev/null -w "%{http_code}\n" "URL_WITH_BAD_ID"
curl -sL -o /dev/null -w "%{url_effective}\n" "URL_WITH_BAD_ID"
| Symptom | Detector |
|---|---|
| HTTP 4xx (400 / 404) | No override needed — see below |
| HTTP 200 + sentinel text in body | substring on response.text |
| HTTP 200 + redirect to a different URL | check response.url |
| HTTP 200 + empty results table | row-count on the table xpath |
HTTP 4xx — no override needed. The speculation driver auto-converts 4xx
responses into miss outcomes via SpeculationHTTPFailure
(fails_successfully is only called for 200–299). Don't write a
fails_successfully override for 4xx responses; the gap counter advances
and no error row is emitted. Maryland is the canonical exemplar.
Sentinel text in body:
def fails_successfully(self, response: Response) -> bool:
return "case not found" not in response.text.lower()
Redirect to a different URL (Massachusetts exemplar — invalid IDs 302
to the search landing):
def fails_successfully(self, response: Response) -> bool:
"""Soft-404 detection: invalid IDs redirect to /docket landing."""
url = response.url or ""
if "/docket/" not in url:
return "/calendar/" in url # other valid endpoints pass
return True
Empty results table:
def fails_successfully(self, response: Response) -> bool:
page = LxmlPageElement.from_response(response)
rows = page.query_xpath("//table[@id='results']//tr", "rows", min_count=0)
return len(rows) > 1 # > 1 because header row counts as one
For downloadable documents (opinions, briefs, etc.):
yield Request(
archive=True,
request=HTTPRequestParams(method=HttpMethod.GET, url=pdf_url),
continuation=self.handle_document_download,
expected_type="pdf",
accumulated_data={...},
)
Memory caveat for Playwright scrapers. Archive requests stream
chunk-by-chunk to disk under the default streaming archive handler — sync,
async, and persistent drivers all set ArchiveResponse.content = b"" and
never buffer the body. Playwright is the exception: when an archive Request
has no via (a bare URL — the pattern shown above), the body is fetched
through Playwright's APIRequestContext which has no streaming API, so the
whole file is materialized in memory before being re-chunked to the handler.
Via-driven archive downloads (a Playwright download event triggered by a
click on an anchor in the parent page) do stream, because Playwright
itself writes the file to disk and kent re-reads it in 64KB chunks. For
potentially large files (hundreds of MB+) on a Playwright scraper, prefer
triggering the download via a find_links() / find_form()-derived anchor
rather than constructing a bare archive Request from a URL string.
Use deduplication_key on Requests to avoid visiting the same case twice
when overlapping searches produce duplicate results:
yield Request(
request=HTTPRequestParams(method=HttpMethod.GET, url=case_url),
continuation=self.parse_case,
deduplication_key=docket_id, # same docket_id won't be fetched twice
)
For pagination requests that must always execute, skip dedup:
from kent.data_types import SkipDeduplicationCheck
yield Request(
request=HTTPRequestParams(method=HttpMethod.GET, url=next_page_url),
continuation=self.parse_results,
deduplication_key=SkipDeduplicationCheck(),
)
HTML next-link pagination: follow "Next" links with
page.find_links("//a[contains(text(), 'Next')]", ...).
API offset pagination: track page in accumulated_data, increment,
and yield a new Request until current_page >= total_pages.
Newest-sorted listing walk: (no date filter) walk pages in
sortOrder=Newest, stop when the oldest in-window item on the page is older
than date_gte.
Date-range splitting: some APIs cap results (e.g. 10,000). If a search returns the maximum, split the date range in half and re-search each half.
All pagination requests should use
deduplication_key=SkipDeduplicationCheck().
If Phase 4 determines Playwright is needed, add to the class:
from kent.data_types import DriverRequirement
driver_requirements: ClassVar[list[DriverRequirement]] = [
DriverRequirement.JS_EVAL,
DriverRequirement.FF_ALIKE,
]
For the canonical and current list of values, see
kent-api-reference.md. Existing scrapers use
site-specific values like H11_HEADER_FIXES and FOLLOW_REDIRECTS (Alaska)
in addition to the common JS_EVAL, FF_ALIKE, CHROME_ALIKE,
HCAP_HANDLER, RCAP_HANDLER.
Steps that need to wait for JS rendering should use @step(await_list=[...]):
from kent.data_types import WaitForLoadState, WaitForSelector
@step(await_list=[
WaitForLoadState("networkidle"),
WaitForSelector("table.results"),
])
def parse_results(self, page, accumulated_data):
...
*Docket, *DocketEntry, and *Document (for files, if there are any)DocketEntry, not as a separate hearing typedriver_requirements set if Playwright needed (scraper-wide; binary HTTP-vs-Playwright)SkipDeduplicationCheck() on next-page requestsdeduplication_key set where overlapping searches may yield duplicatesarchive=Trueaccumulated_data values are JSON-serializable__init__.py exists in both the scraper's directory and the parent state directory (create one if this is the first scraper for that state)Symptom-triggered patterns for site shapes that don't fit the default flow.
Each entry leads with a one-line Symptom: so you can skim/grep.
Symptom: SPA fetches a JWT-shaped token (prefixed P1_eyJ…) from
api.hcaptcha.com/getcaptcha/{sitekey} and attaches it to a custom request
header (commonly captchatoken:) on every gated fetch. There is no visible
challenge widget, no div.h-captcha in the DOM, no checkbox to click.
Diagnosis: Look for api.hcaptcha.com/getcaptcha/{sitekey} calls in the
network panel with no visible challenge surface. Direct curl to the gated
endpoint returns {"error":"Captcha validation failed."} or similar.
Recommendation: punt to listing-only v1. kent's HCAP_HANDLER driver
requirement targets visible hCaptcha widgets only — HCaptchaHandler in
kent/driver/interstitials.py looks for div.h-captcha and clicks it.
There is no built-in driver requirement for invisible / execute-mode
hCaptcha as of 2026-05.
A nonnavigating=True request alone won't fix this either: the Playwright
driver dispatches every non-archive request through page.goto(url)
regardless of nonnavigating. When the gate is purely per-fetch-header
(captchatoken JWT, no cookie fallback), navigating the same tab to a real
case page first and then page.goto(detail_api_url) still returns
Captcha validation failed. The token exists for the duration of one SPA
fetch and isn't reachable to a follow-up navigation.
Recommended path: ship listing-only v1 with status=IN_DEVELOPMENT,
list the captcha gap in DESIGN.md, and move on. A real fix requires a kent
affordance for "execute this fetch from inside the page's JS context" (a
ViaPageEvaluate request kind, or PageElement.evaluate() exposed to
steps) that does not exist today. Don't sink hours into a per-scraper
workaround.
?expand=*¤tPageUrl=...Symptom: The site is Episerver / Optimizely (look for
/api/episerver/v2.0/content/... calls in the network panel). The visible
/api/Foo/Bar endpoints have buggy or unreliable pagination, sort, or
filter behavior — sending page=, pageSize=, or date params seems to be
silently ignored.
Pattern: The page URL itself returns a JSON object when called with
?expand=*¤tPageUrl={routeSegment} and Accept: application/json.
The server returns the entire Episerver page object including a data field
(e.g. caseSearchResults) with the actual records. No captcha, no session.
Why it matters: This SSR variant is often more capable than the visible
APIs — it'll honour pagination/sort/filter params that the dedicated API
ignores. On Michigan,
/api/CaseSearch/AdvancedSearchCaseDetails silently ignores page=,
pageSize=, and every date param tried, while the SSR variant honours
page=1..N and pageSize up to 100.
Recommendation: When you detect Episerver, probe both endpoints before deciding the site can't paginate server-side. Often the SSR variant is what the SPA actually consumes. See Michigan (courts_michigan_gov) as the canonical exemplar.
resultType= parallel result containersSymptom: A single listing endpoint returns multiple result containers
(e.g. caseDetailResults, opinionResults, orderResults) in one JSON
response. A query parameter like resultType=cases | opinions | orders
selects which container is populated.
Why it matters: Sites that publish opinions/orders as separate document indexes often expose them via this parameter on the same listing endpoint that serves cases. These parallel indexes are usually the right entry points for opinion/order scrapers, even when individual cases also expose nested document arrays.
Recommendation: Test resultType=cases, resultType=opinions,
resultType=orders (or the site's equivalents) on the listing endpoint and
inspect each response. If the site exposes parallel indexes, this is
typically the right entry point for an opinions/orders scraper rather than
walking nested document arrays inside cases.