with one click
web-scraping-websocket-scraping
// Capture real-time streaming data from WebSocket connections via Playwright frame interception and curl_cffi direct clients with TLS fingerprinting.
// Capture real-time streaming data from WebSocket connections via Playwright frame interception and curl_cffi direct clients with TLS fingerprinting.
[HINT] Download the complete skill directory including SKILL.md and all related files
| name | web-scraping-websocket-scraping |
| description | Capture real-time streaming data from WebSocket connections via Playwright frame interception and curl_cffi direct clients with TLS fingerprinting. |
| tech_stack | ["playwright"] |
| language | ["python"] |
| capability | ["websocket","data-fetching"] |
| version | Playwright unversioned; curl_cffi unversioned |
| collected_at | "2025-01-01T00:00:00.000Z" |
Source: https://playwright.dev/python/docs/network, https://curl-cffi.readthedocs.io/en/latest/
Scrape real-time data from websites that push updates over persistent WebSocket connections — the protocol that powers live dashboards, trading terminals, chat systems, and streaming APIs. Unlike traditional HTTP polling, WebSocket scraping captures data as it arrives without repeatedly hammering the server.
Two complementary strategies: Playwright interception (passive — attach to an existing browser session, let the browser handle auth/cookies/handshake) and curl_cffi direct client (programmatic — open WebSocket connections with JA3/JA4 TLS fingerprint impersonation when you already know the endpoint and auth tokens).
ws:// or wss:// connections carrying payload dataPrefer this over HTTP polling when: (a) the site opens a WebSocket immediately on page load, (b) data arrives at unpredictable intervals, (c) polling would miss updates between requests.
Attach to a live page and collect every frame the server pushes:
from playwright.sync_api import sync_playwright
messages = []
def on_ws(ws):
print(f"[WS] {ws.url}")
ws.on("framereceived", lambda payload: messages.append(payload))
ws.on("close", lambda _: print("[WS] closed"))
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.on("websocket", on_ws) # fires for every new WebSocket
page.goto("https://target.site/dashboard")
page.wait_for_timeout(30_000) # collect 30s of streaming data
browser.close()
This approach requires zero knowledge of the WebSocket handshake — the browser negotiates it with the site's existing cookies and authentication.
from curl_cffi import requests
# Open WebSocket with Chrome 120 TLS fingerprint
ws = requests.ws_connect(
"wss://api.target.site/stream",
impersonate="chrome120",
headers={"Authorization": "Bearer <token>"}
)
for frame in ws:
print(frame.data) # text or bytes
This is lighter weight (no browser) but requires you to supply auth headers yourself. Pair with SSO/OAuth extraction skills to obtain tokens programmatically.
| Event | Fires when | Payload |
|---|---|---|
page.on("websocket", cb) | Any new WebSocket created by the page | WebSocket instance |
ws.on("framesent", cb) | Page sends a frame | Frame payload (text/bytes) |
ws.on("framereceived", cb) | Server sends a frame | Frame payload (text/bytes) |
ws.on("close", cb) | Connection closes | Close reason |
Access ws.url to identify which endpoint a frame belongs to — critical when a page opens multiple WebSockets.
requests.ws_connect(url, impersonate=..., headers=...) — open with fingerprintcurl_cffi.requests.async_ws_connect)| requests | aiohttp | httpx | curl_cffi | |
|---|---|---|---|---|
| WebSocket | ❌ | ✅ | ❌ | ✅ |
| TLS fingerprinting | ❌ | ❌ | ❌ | ✅ |
| Sync + Async | sync | async | both | both |
page.route() or WebSocket events seem to miss traffic, disable service workers: browser.new_context(service_workers="block").ws.on("close") and reconnect.storage_state after login so future scraper runs skip auth and go straight to WebSocket capture (web-scraping-session-state-persistence).web-scraping-auto-relogin-pattern).ws_connect headers (web-scraping-sso-oauth-extraction).web-scraping-rate-limiting-pacing).web-scraping-playwright-stealth-patchright).