Run any Skill in Manus with one click

$pwd:

web-scraping-csrf-token-handling

Name: Web Scraping Csrf Token Handling
Author: mhbzhy-lost

// Dual-source CSRF token extraction from form hidden inputs and response headers with auto-refresh and injection across Playwright and httpx.

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 09:03

SKILL.md

readonly

package.json

"author": "mhbzhy-lost"

"repository": "mhbzhy-lost/claude-personal-config"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	web-scraping-csrf-token-handling
description	Dual-source CSRF token extraction from form hidden inputs and response headers with auto-refresh and injection across Playwright and httpx.
tech_stack	["http","web"]
language	["python"]
capability	["auth","http-client"]
version	Playwright unversioned, httpx unversioned
collected_at	"2025-01-20T00:00:00.000Z"

CSRF Token Handling

Source: https://playwright.dev/python/docs/network, https://www.python-httpx.org/quickstart/

Purpose

Extract, inject, and auto-refresh CSRF tokens from two sources — HTML form hidden <input> elements and HTTP response headers (X-CSRF-Token, XSRF-TOKEN, X-CSRFToken). Handles the full token lifecycle so scrapers can transparently survive token expiration across Playwright browser sessions and httpx programmatic requests.

When to Use

Scraping sites protected by CSRF middleware (Django, Laravel, Spring Security, Express csurf, Rails)
Automating multi-step form submissions where each form embeds its own CSRF token
Building a session-persistent scraper that must survive token rotation
Bridging a Playwright-based login flow to httpx for high-throughput scraping while maintaining valid CSRF tokens

Basic Usage

Dual-Source Extraction with Playwright

from playwright.sync_api import sync_playwright

csrf_token = None

def capture_csrf(response):
    """Extract CSRF from response headers."""
    global csrf_token
    for name in ['x-csrf-token', 'xsrf-token', 'x-csrftoken']:
        if name in response.headers:
            csrf_token = response.headers[name]
            return

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.on("response", capture_csrf)

    # Navigate to form page
    page.goto("https://target-site.com/form")

    # Source 2: extract from hidden input
    csrf_input = page.locator('input[name="csrf_token"]').input_value()
    csrf_token = csrf_token or csrf_input

httpx CSRF Client with Auto-Refresh

import httpx
from bs4 import BeautifulSoup

class CSRFSession:
    def __init__(self):
        self.client = httpx.Client()
        self.csrf_token = None

    def _extract_and_store(self, response):
        for h in ['x-csrf-token', 'xsrf-token', 'x-csrftoken']:
            if h in response.headers:
                self.csrf_token = response.headers[h]
                return
        soup = BeautifulSoup(response.text, 'html.parser')
        meta = soup.find('meta', attrs={'name': 'csrf-token'})
        if meta:
            self.csrf_token = meta['content']
            return
        hidden = soup.find('input', attrs={'name': 'csrf_token'})
        if hidden:
            self.csrf_token = hidden.get('value')

    def request(self, method, url, **kwargs):
        if self.csrf_token and method.upper() in ['POST', 'PUT', 'PATCH', 'DELETE']:
            headers = kwargs.get('headers', {})
            headers['x-csrf-token'] = self.csrf_token
            kwargs['headers'] = headers
        response = self.client.request(method, url, **kwargs)
        self._extract_and_store(response)

        # Auto-refresh on 403/419
        if response.status_code in [403, 419]:
            form_resp = self.client.get(url)
            self._extract_and_store(form_resp)
            if self.csrf_token:
                kwargs.setdefault('headers', {})['x-csrf-token'] = self.csrf_token
            response = self.client.request(method, url, **kwargs)
        return response

Key APIs (Summary)

Playwright page.on("response") — capture CSRF tokens from every HTTP response header
Playwright page.route() — intercept and modify outgoing request headers (inject CSRF)
Playwright page.locator().input_value() — extract CSRF from form hidden inputs
httpx response.headers — case-insensitive header access for token extraction
httpx client.request(method, url, headers=...) — inject CSRF header into outgoing requests
httpx Cookies — manage CSRF cookie when using double-submit cookie pattern
BeautifulSoup soup.find() — extract CSRF from <meta name="csrf-token"> tags

Caveats

Header name variance: Django uses X-CSRFToken, Angular/Laravel use X-XSRF-TOKEN, others use X-CSRF-Token. Always match case-insensitively across multiple variants.
Token scoping: Some sites scope CSRF tokens per-form, not per-session. Re-extract after each page navigation.
Double-submit cookie pattern: Some implementations require both a cookie AND a header with the same value. Keep cookie jar and header synchronized.
Token refresh race condition: When concurrent requests all detect 403 and all re-fetch the token, use a lock or deduplicate the refresh GET.
Playwright route vs. page.on("request"): page.on("request") is read-only — use page.route() to actually modify outgoing headers.
SPA token refresh: Single-page apps may refresh CSRF tokens via XHR responses, not full page loads. Monitor all responses, not just HTML.

Composition Hints

Pair with session-state-persistence to save CSRF tokens as part of the serialized session state alongside cookies and auth tokens.
Pair with auto-relogin-pattern so that when a 403 triggers CSRF token refresh, an expired session also triggers re-login transparently.
Pair with multi-step-form-flow — each form step may carry its own CSRF token; the state machine should re-extract on each transition.
Use rate-limiting-pacing between CSRF refresh retries to avoid tripping WAF rate limits when multiple 403s fire in rapid succession.

name	web-scraping-csrf-token-handling
description	Dual-source CSRF token extraction from form hidden inputs and response headers with auto-refresh and injection across Playwright and httpx.
tech_stack	["http","web"]
language	["python"]
capability	["auth","http-client"]
version	Playwright unversioned, httpx unversioned
collected_at	"2025-01-20T00:00:00.000Z"