Run any Skill in Manus with one click

playwright-scraper

Stars5

Forks1

UpdatedNovember 30, 2025 at 04:10

Production-proven Playwright web scraping patterns with selector-first approach and robust error handling. Use when users need to build web scrapers, extract data from websites, automate browser interactions, or ask about Playwright selectors, text extraction (innerText vs textContent), regex patterns for HTML, fallback hierarchies, or scraping best practices.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

nathanvale

nathanvale/side-quest-marketplace-old

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

SKILL.md

readonly

More from this repository

same repository

firecrawl

nathanvale/side-quest-marketplace-old

Web scraping, site crawling, search, structured data extraction, and AI-powered research with Firecrawl CLI. Use when you need full page content as markdown, JS-rendered pages, anti-bot bypass, crawling entire documentation sites, extracting structured data with schemas, or deep web research. Prefer WebFetch for quick questions about a known URL. Prefer WebSearch for finding links without full content.

2026-02-065

mcp-development

nathanvale/side-quest-marketplace-old

Build production-grade MCP (Model Context Protocol) servers with observability, correlation ID tracing, and dual logging. Use when creating new MCP servers, adding tools to existing servers, implementing file logging, debugging MCP issues, wrapping CLI tools with spawnSyncCollect, or following Side Quest marketplace patterns. Covers @side-quest/core/mcp declarative API, @side-quest/core/spawn CLI wrapper patterns, Zod schemas, Bun runtime, and 9 gold standard patterns validated across Kit plugin (18 tools). Includes error handling, response format switching, MCP annotations, and graceful degradation.

2026-02-035

triage

nathanvale/side-quest-marketplace-old

Unified inbox processor - handles ALL content types (clippings, transcriptions, VTT files, attachments) with parallel subagents and single-table review. Routes to appropriate creator based on proposed_template.

2026-02-035

bun-cli

nathanvale/side-quest-marketplace-old

Build production-grade CLI tools with Bun. Reference implementation covering argument parsing patterns (--flag value, --flag=value, --flag), dual markdown/JSON output, error handling, subcommands, and testing. Use when building CLIs, designing argument parsing, implementing command structures, reviewing CLI quality, or learning Bun CLI best practices.

2026-02-025

bun-fs-helpers

nathanvale/side-quest-marketplace-old

Pure Bun-native filesystem utilities from @side-quest/core/fs. Use when you need command-injection-safe filesystem operations, prefer Bun over node:fs, or want token-efficient fs helpers. All functions use Bun.spawn, Bun.file(), or Bun.write() - no node:fs dependencies.

2026-02-025

inbox-processing-expert

nathanvale/side-quest-marketplace-old

Expert guidance for building and maintaining the Para Obsidian inbox processing system - a security-hardened automation framework for processing PDFs and attachments with AI-powered metadata extraction. Use when building inbox processors, implementing security patterns (TOCTOU, command injection prevention, atomic writes), designing interactive CLIs with suggestion workflows, integrating LLM detection, implementing idempotency with SHA256 registries, or working with the para-obsidian inbox codebase. Covers engine/interface separation, suggestion-based architecture, confidence scoring, error taxonomy, structured logging, and testing patterns. Useful when user mentions inbox automation, PDF processing, document classification, security-hardened file processing, or interactive CLI design.

2026-02-025

name	playwright-scraper
description	Production-proven Playwright web scraping patterns with selector-first approach and robust error handling. Use when users need to build web scrapers, extract data from websites, automate browser interactions, or ask about Playwright selectors, text extraction (innerText vs textContent), regex patterns for HTML, fallback hierarchies, or scraping best practices.

Playwright Web Scraper

Production-proven web scraping patterns using Playwright with selector-first approach and robust error handling.

Core Principles

1. Selector-First Approach

Always prefer semantic locators over CSS selectors:

// ✅ BEST: Semantic locators (accessible, maintainable)
await page.getByRole('button', { name: 'Submit' })
await page.getByText('Welcome')
await page.getByLabel('Email')

// ⚠️ ACCEPTABLE: Text patterns for dynamic content
await page.locator('text=/\\$\\d+\\.\\d{2}/')

// ❌ AVOID: Brittle CSS selectors
await page.locator('.btn-primary')
await page.locator('#submit-button')

2. Page Text Extraction

Critical difference between textContent and innerText:

// ❌ WRONG: Returns ALL text including hidden elements, scripts, iframes
const pageText = await page.textContent("body");

// ✅ CORRECT: Returns only VISIBLE text (what users see)
const pageText = await page.innerText("body");

Use case for each:

innerText("body") - Extract visible content for regex matching
textContent(selector) - Get text from specific elements

3. Regex Patterns for Extraction

Handle newlines and whitespace in HTML:

// ❌ FAILS: [^$]* doesn't match across newlines
const match = pageText.match(/ADULT[^$]*(\$\d+\.\d{2})/);

// ✅ WORKS: [\s\S]{0,10} matches any character including newlines
const match = pageText.match(/ADULT[\s\S]{0,10}(\$\d+\.\d{2})/);

Common patterns:

// Price extraction
/\$(\d+\.\d{2})/

// Date/time
/(\d{1,2}\s+[A-Za-z]{3}\s+\d{4},\s+\d{1,2}:\d{2}[ap]m)/i

// Screen number
/Screen\s+(\d+)/i

4. Fallback Hierarchy

Implement 4-tier fallback for robustness:

async function extractField(page: Page, fieldName: string): Promise<string | null> {
  // Tier 1: Primary semantic selector
  try {
    const value = await page.getByLabel(fieldName).textContent();
    if (value) return value.trim();
  } catch {}

  // Tier 2: Alternative selectors
  try {
    const value = await page.locator(`[aria-label="${fieldName}"]`).textContent();
    if (value) return value.trim();
  } catch {}

  // Tier 3: Text pattern matching
  const pageText = await page.innerText("body");
  const pattern = new RegExp(`${fieldName}[\\s\\S]{0,20}([A-Z0-9].+)`, 'i');
  const match = pageText.match(pattern);
  if (match?.[1]) return match[1].trim();

  // Tier 4: Return null (caller handles missing data)
  return null;
}

5. Error Handling Patterns

// ✅ GOOD: Try-catch with specific actions
try {
  await page.goto(url, { waitUntil: 'domcontentloaded' });
} catch (error) {
  throw new Error(`Failed to navigate to ${url}: ${error.message}`);
}

// ✅ GOOD: Timeout with clear error
try {
  await page.waitForSelector('text="Loading complete"', { timeout: 5000 });
} catch {
  // Continue anyway - loading indicator is optional
}

6. Image Selection Best Practices

// ❌ WRONG: Grabs first matching image (could be from carousel/ads)
const poster = await page.locator('img[src*="movies"]').first();

// ✅ CORRECT: Target specific hero/header image
const poster = await page.locator('img[src*="movies/headers"]').first();

// ✅ BETTER: Use semantic structure
const poster = await page.locator('header img, [role="banner"] img').first();

7. Clean Separation of Concerns

Each scraper method should have a single responsibility:

// ✅ GOOD: Each method scrapes ONE resource type
interface ScraperClient {
  scrapeMovies(): Promise<{ movies: Movie[] }>;
  scrapeSession(sessionId: string): Promise<SessionData>;
  scrapePricing(sessionId: string): Promise<PricingData>;
}

// ❌ BAD: Session method returns movie data (violates SRP)
interface ScraperClient {
  scrapeSession(sessionId: string): Promise<{
    session: SessionData;
    movieTitle: string;  // ❌ Cross-concern
    moviePoster: string; // ❌ Cross-concern
  }>;
}

Composition over mixing concerns:

// ✅ Compose data from multiple focused scrapes
const movies = await client.scrapeMovies();
const movie = movies.find(m => m.sessionTimes.includes(sessionId));
const session = await client.scrapeSession(sessionId);
const pricing = await client.scrapePricing(sessionId);

// Build composite response
const ticket = {
  movieTitle: movie.title,        // From movies scrape
  moviePoster: movie.thumbnail,   // From movies scrape
  sessionDateTime: session.dateTime, // From session scrape
  pricing: pricing,               // From pricing scrape
};

Implementation Checklist

When building a scraper, follow this sequence:

Phase 1: Setup

Install Playwright: bun add playwright
Create browser instance with headless option
Set user agent and viewport for realistic browsing

Phase 2: Navigation

Navigate to target URL
Wait for page load (domcontentloaded or networkidle)
Handle any cookie banners / popups

Phase 3: Data Extraction

Use innerText("body") for visible page text
Extract data with semantic selectors first
Add fallback selectors for each field
Use regex patterns for dynamic content
Validate extracted data format

Phase 4: Robustness

Add error handling with clear messages
Implement timeout protection
Track which selectors worked (selectorsUsed)
Test against actual page HTML

Phase 5: Testing

Test with valid data
Test with missing fields (use fallbacks)
Test with network errors
Verify no data leaks between scrapes

Common Patterns

Browser Setup

import { chromium, type Browser, type Page } from 'playwright';

async function createBrowser(): Promise<Browser> {
  return await chromium.launch({
    headless: true, // Set false for debugging
  });
}

async function createPage(browser: Browser): Promise<Page> {
  const page = await browser.newPage({
    viewport: { width: 1280, height: 720 },
    userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
  });
  return page;
}

Scraper Client Pattern

export async function createScraperClient() {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  return {
    async scrapeData(url: string) {
      await page.goto(url, { waitUntil: 'domcontentloaded' });

      const pageText = await page.innerText("body");
      const selectorsUsed: Record<string, string> = {};

      // Extract fields with fallbacks
      let field1 = null;
      try {
        field1 = await page.getByRole('heading').textContent();
        selectorsUsed.field1 = "getByRole";
      } catch {
        const match = pageText.match(/Title:\s*(.+)/i);
        if (match) {
          field1 = match[1];
          selectorsUsed.field1 = "regex";
        }
      }

      return { field1, selectorsUsed };
    },

    async close() {
      await browser.close();
    },
  };
}

CLI Integration

#!/usr/bin/env bun

import { createScraperClient } from './scraper-client.ts';

async function main() {
  const args = process.argv.slice(2);
  const url = args[0];

  if (!url) {
    console.error('Usage: bun run cli.ts <url>');
    process.exit(1);
  }

  const client = await createScraperClient();

  try {
    const result = await client.scrapeData(url);
    console.log(JSON.stringify(result, null, 2));
  } catch (error) {
    console.error(`Scraping failed: ${error.message}`);
    process.exit(1);
  } finally {
    await client.close();
  }
}

main();

Debugging Tips

Chrome DevTools Integration

Use the Chrome DevTools MCP server to inspect actual page structure:

// In your conversation with Claude:
// "Use Chrome DevTools to inspect the pricing page"
// Claude will use: take_snapshot, evaluate_script, etc.

Logging Selectors

Always track which selectors worked:

const selectorsUsed: Record<string, string> = {};

// After each extraction
selectorsUsed.fieldName = "getByRole" | "regex" | "fallback-1";

// Return in response for debugging
return { data, selectorsUsed };

Visual Debugging

// Take screenshot at key points
await page.screenshot({ path: 'debug-step-1.png' });

// Highlight element before extraction
await page.locator(selector).highlight();

Anti-Patterns to Avoid

❌ Using hypothetical attributes

// DON'T assume data attributes exist
await page.locator('[data-price]'); // Might not exist!

❌ Over-relying on CSS classes

// DON'T use implementation-specific classes
await page.locator('.MuiButton-root-xyz'); // Will break when CSS changes

❌ Ignoring visible vs. hidden text

// DON'T use textContent for regex extraction
const text = await page.textContent("body"); // Includes hidden iframes!

❌ Not handling missing data

// DON'T assume data exists
const price = await page.locator('.price').textContent(); // Might throw!

// DO use optional chaining and null returns
const price = await page.locator('.price').textContent().catch(() => null);

Production Checklist

Before deploying a scraper:

All selectors have fallbacks
Error messages are clear and actionable
Browser closes properly (use try/finally)
No hardcoded delays (use waitForSelector)
Respects rate limits / politeness delays
Tracks which selectors worked for debugging
Tests pass with missing/malformed data
No cross-concern data mixing

Resources

Playwright Selectors: https://playwright.dev/docs/selectors
Playwright Best Practices: https://playwright.dev/docs/best-practices
Chrome DevTools MCP: Use for live page inspection