| name | playwright-scraper |
| description | Production-proven Playwright web scraping patterns with selector-first approach and robust error handling.
Use when users need to build web scrapers, extract data from websites, automate browser interactions,
or ask about Playwright selectors, text extraction (innerText vs textContent), regex patterns for HTML,
fallback hierarchies, or scraping best practices.
|
Playwright Web Scraper
Production-proven web scraping patterns using Playwright with selector-first approach and robust error handling.
Core Principles
1. Selector-First Approach
Always prefer semantic locators over CSS selectors:
await page.getByRole('button', { name: 'Submit' })
await page.getByText('Welcome')
await page.getByLabel('Email')
await page.locator('text=/\\$\\d+\\.\\d{2}/')
await page.locator('.btn-primary')
await page.locator('#submit-button')
2. Page Text Extraction
Critical difference between textContent and innerText:
const pageText = await page.textContent("body");
const pageText = await page.innerText("body");
Use case for each:
innerText("body") - Extract visible content for regex matching
textContent(selector) - Get text from specific elements
3. Regex Patterns for Extraction
Handle newlines and whitespace in HTML:
const match = pageText.match(/ADULT[^$]*(\$\d+\.\d{2})/);
const match = pageText.match(/ADULT[\s\S]{0,10}(\$\d+\.\d{2})/);
Common patterns:
/\$(\d+\.\d{2})/
/(\d{1,2}\s+[A-Za-z]{3}\s+\d{4},\s+\d{1,2}:\d{2}[ap]m)/i
/Screen\s+(\d+)/i
4. Fallback Hierarchy
Implement 4-tier fallback for robustness:
async function extractField(page: Page, fieldName: string): Promise<string | null> {
try {
const value = await page.getByLabel(fieldName).textContent();
if (value) return value.trim();
} catch {}
try {
const value = await page.locator(`[aria-label="${fieldName}"]`).textContent();
if (value) return value.trim();
} catch {}
const pageText = await page.innerText("body");
const pattern = new RegExp(`${fieldName}[\\s\\S]{0,20}([A-Z0-9].+)`, 'i');
const match = pageText.match(pattern);
if (match?.[1]) return match[1].trim();
return null;
}
5. Error Handling Patterns
try {
await page.goto(url, { waitUntil: 'domcontentloaded' });
} catch (error) {
throw new Error(`Failed to navigate to ${url}: ${error.message}`);
}
try {
await page.waitForSelector('text="Loading complete"', { timeout: 5000 });
} catch {
}
6. Image Selection Best Practices
const poster = await page.locator('img[src*="movies"]').first();
const poster = await page.locator('img[src*="movies/headers"]').first();
const poster = await page.locator('header img, [role="banner"] img').first();
7. Clean Separation of Concerns
Each scraper method should have a single responsibility:
interface ScraperClient {
scrapeMovies(): Promise<{ movies: Movie[] }>;
scrapeSession(sessionId: string): Promise<SessionData>;
scrapePricing(sessionId: string): Promise<PricingData>;
}
interface ScraperClient {
scrapeSession(sessionId: string): Promise<{
session: SessionData;
movieTitle: string;
moviePoster: string;
}>;
}
Composition over mixing concerns:
const movies = await client.scrapeMovies();
const movie = movies.find(m => m.sessionTimes.includes(sessionId));
const session = await client.scrapeSession(sessionId);
const pricing = await client.scrapePricing(sessionId);
const ticket = {
movieTitle: movie.title,
moviePoster: movie.thumbnail,
sessionDateTime: session.dateTime,
pricing: pricing,
};
Implementation Checklist
When building a scraper, follow this sequence:
Phase 1: Setup
Phase 2: Navigation
Phase 3: Data Extraction
Phase 4: Robustness
Phase 5: Testing
Common Patterns
Browser Setup
import { chromium, type Browser, type Page } from 'playwright';
async function createBrowser(): Promise<Browser> {
return await chromium.launch({
headless: true,
});
}
async function createPage(browser: Browser): Promise<Page> {
const page = await browser.newPage({
viewport: { width: 1280, height: 720 },
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
});
return page;
}
Scraper Client Pattern
export async function createScraperClient() {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
return {
async scrapeData(url: string) {
await page.goto(url, { waitUntil: 'domcontentloaded' });
const pageText = await page.innerText("body");
const selectorsUsed: Record<string, string> = {};
let field1 = null;
try {
field1 = await page.getByRole('heading').textContent();
selectorsUsed.field1 = "getByRole";
} catch {
const match = pageText.match(/Title:\s*(.+)/i);
if (match) {
field1 = match[1];
selectorsUsed.field1 = "regex";
}
}
return { field1, selectorsUsed };
},
async close() {
await browser.close();
},
};
}
CLI Integration
#!/usr/bin/env bun
import { createScraperClient } from './scraper-client.ts';
async function main() {
const args = process.argv.slice(2);
const url = args[0];
if (!url) {
console.error('Usage: bun run cli.ts <url>');
process.exit(1);
}
const client = await createScraperClient();
try {
const result = await client.scrapeData(url);
console.log(JSON.stringify(result, null, 2));
} catch (error) {
console.error(`Scraping failed: ${error.message}`);
process.exit(1);
} finally {
await client.close();
}
}
main();
Debugging Tips
Chrome DevTools Integration
Use the Chrome DevTools MCP server to inspect actual page structure:
Logging Selectors
Always track which selectors worked:
const selectorsUsed: Record<string, string> = {};
selectorsUsed.fieldName = "getByRole" | "regex" | "fallback-1";
return { data, selectorsUsed };
Visual Debugging
await page.screenshot({ path: 'debug-step-1.png' });
await page.locator(selector).highlight();
Anti-Patterns to Avoid
❌ Using hypothetical attributes
await page.locator('[data-price]');
❌ Over-relying on CSS classes
await page.locator('.MuiButton-root-xyz');
❌ Ignoring visible vs. hidden text
const text = await page.textContent("body");
❌ Not handling missing data
const price = await page.locator('.price').textContent();
const price = await page.locator('.price').textContent().catch(() => null);
Production Checklist
Before deploying a scraper:
Resources