Run any Skill in Manus with one click

cf-crawl

Crawl entire websites using Cloudflare Browser Rendering /crawl API. Initiates async crawl jobs, polls for completion, and saves results as markdown files. Useful for ingesting documentation sites, knowledge bases, or any web content into your project context. Requires CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN environment variables.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/davila7/claude-code-templates --skill cf-crawl

Copy and paste this command into Claude Code to install the skill

Source

davila7/claude-code-templates

Stars27,748

Forks2,864

UpdatedMarch 13, 2026 at 01:22

SKILL.md

readonly

More from this repository

same repository

pdf-fill-studio

davila7/claude-code-templates

Fill any PDF locally and place each value precisely in a visual editor. Use when the user wants to fill out a PDF form, enter data into a PDF, complete a tax/insurance/bank form, or position text on a flat/scanned PDF. Handles flat (field-less) PDFs, per-character (comb) fields, and native AcroForm fields; leaves the signature blank for the user to sign.

2026-05-2927.7k

android-cicd

davila7/claude-code-templates

Automated Android CI/CD pipeline to Google Play — supports TWA, React Native, Flutter, and native Android. Run npx android-cicd to set up keystore generation, GitHub Secrets, and a multi-stage workflow (internal/alpha/beta/production) with auto-bump versionCode.

2026-05-2327.7k

bleu

davila7/claude-code-templates

Use this skill whenever a developer wants to turn an idea into a complete, production-ready, end-to-end system plan BEFORE writing any code. Trigger on 'plan this system', 'design the architecture for', 'help me blueprint', 'deep plan for X', 'break this idea into components', 'expand into action points', 'full implementation plan', or when the user pastes a project idea wanting architecture, components, pipelines, and file-level execution mapped out. Casual phrasing also triggers: 'help me think this through end-to-end', 'plan before coding'. Also covers living-workspace patterns: self-improving knowledge bases, reflection loops with auditor agents, four-agent teams, schema-as-code, wiki health scoring. **Resume triggers**: 'where did we leave off', 'continue this plan', 'resume my blueprint' - rehydrates state from disk via SESSION.md/NEXT.md/decisions/. Web research is mandatory every invocation.

2026-05-2327.7k

building-blog

davila7/claude-code-templates

Use when adding a blog to a Next.js + Sanity site, building a blog section from scratch, integrating Sanity CMS for editorial content, or setting up an SEO-optimized article surface. Triggers on phrases like 'add a blog', 'build the blog', 'create blog section', 'set up blog with Sanity', 'integrate Sanity CMS', 'blog architecture', or any new-blog scoping conversation on a Next.js project.

2026-05-2327.7k

session-handoff

davila7/claude-code-templates

Creates comprehensive handoff documents for seamless AI agent session transfers. Triggered when: (1) user requests handoff/memory/context save, (2) context window approaches capacity, (3) major task milestone completed, (4) work session ending, (5) user says 'save state', 'create handoff', 'I need to pause', 'context is getting full', (6) resuming work with 'load handoff', 'resume from', 'continue where we left off'. Proactively suggests handoffs after substantial work (multiple file edits, complex debugging, architecture decisions). Solves long-running agent context exhaustion by enabling fresh agents to continue with zero ambiguity.

2026-05-2327.7k

docx-official

davila7/claude-code-templates

Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation.

2026-05-2327.7k

Source

davila7

davila7/claude-code-templates

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

name	cf-crawl
description	Crawl entire websites using Cloudflare Browser Rendering /crawl API. Initiates async crawl jobs, polls for completion, and saves results as markdown files. Useful for ingesting documentation sites, knowledge bases, or any web content into your project context. Requires CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN environment variables.

Cloudflare Website Crawler

You are a web crawling assistant that uses Cloudflare's Browser Rendering /crawl REST API to crawl websites and save their content as markdown files for local use.

Prerequisites

The user must have:

A Cloudflare account with Browser Rendering enabled
CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN available (see below)

Workflow

When the user asks to crawl a website, follow this exact workflow:

Step 1: Load Credentials

Look for CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN in this order:

Current environment variables - Check if already exported in the shell
Project .env file - Read .env in the current working directory and extract the values
Project .env.local file - Read .env.local in the current working directory
Home directory .env - Read ~/.env as a last resort

To load from a .env file, parse it line by line looking for CLOUDFLARE_ACCOUNT_ID= and CLOUDFLARE_API_TOKEN= entries. Use this bash approach:

# Load from .env if vars are not already set
if [ -z "$CLOUDFLARE_ACCOUNT_ID" ] || [ -z "$CLOUDFLARE_API_TOKEN" ]; then
  for envfile in .env .env.local "$HOME/.env"; do
    if [ -f "$envfile" ]; then
      eval "$(grep -E '^CLOUDFLARE_(ACCOUNT_ID|API_TOKEN)=' "$envfile" | sed 's/^/export /')"
    fi
  done
fi

If credentials are still missing after checking all sources, tell the user to add them to their project .env file:

CLOUDFLARE_ACCOUNT_ID=your-account-id
CLOUDFLARE_API_TOKEN=your-api-token

The API token needs "Browser Rendering - Edit" permission. Create one at Cloudflare Dashboard > API Tokens.

Step 2: Validate Credentials

Verify both variables are set and non-empty before proceeding.

Step 3: Initiate Crawl

Send a POST request to start the crawl job. Choose parameters based on user needs:

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "options": {
      "excludePatterns": ["**/changelog/**", "**/api-reference/**"]
    }
  }'

For incremental crawls, add the modifiedSince parameter (Unix timestamp in seconds):

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "modifiedSince": <UNIX_TIMESTAMP>
  }'

When --since is provided, convert to Unix timestamp: date -d "2026-03-10" +%s (Linux) or date -j -f "%Y-%m-%d" "2026-03-10" +%s (macOS).

The response returns a job ID:

{"success": true, "result": "job-uuid-here"}

Step 4: Poll for Completion

Poll the job status every 5 seconds until it completes:

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')"

Possible job statuses:

running - Still in progress, keep polling
completed - All pages processed
cancelled_due_to_timeout - Exceeded 7-day limit
cancelled_due_to_limits - Hit account limits
errored - Something went wrong

Step 5: Retrieve Results

When using modifiedSince, check for skipped pages to see what was unchanged:

# See which pages were skipped (not modified since the given timestamp)
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=skipped&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

Fetch all completed records using pagination (cursor-based):

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

If there are more records, use the cursor value from the response:

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

Step 6: Save Results

Save each page's markdown content to a local directory. Use a script like:

# Create output directory
mkdir -p .crawl-output

# Fetch and save all pages
python3 -c "
import json, os, re, sys, urllib.request

account_id = os.environ['CLOUDFLARE_ACCOUNT_ID']
api_token = os.environ['CLOUDFLARE_API_TOKEN']
job_id = '<JOB_ID>'
base = f'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}'
outdir = '.crawl-output'
os.makedirs(outdir, exist_ok=True)

cursor = None
total_saved = 0

while True:
    url = f'{base}?status=completed&limit=50'
    if cursor:
        url += f'&cursor={cursor}'

    req = urllib.request.Request(url, headers={
        'Authorization': f'Bearer {api_token}'
    })
    with urllib.request.urlopen(req) as resp:
        data = json.load(resp)

    records = data.get('result', {}).get('records', [])
    if not records:
        break

    for rec in records:
        page_url = rec.get('url', '')
        md = rec.get('markdown', '')
        if not md:
            continue
        # Convert URL to filename
        name = re.sub(r'https?://', '', page_url)
        name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120]
        filepath = os.path.join(outdir, f'{name}.md')
        with open(filepath, 'w') as f:
            f.write(f'<!-- Source: {page_url} -->\n\n')
            f.write(md)
        total_saved += 1

    cursor = data.get('result', {}).get('cursor')
    if cursor is None:
        break

print(f'Saved {total_saved} pages to {outdir}/')
"

Parameter Reference

Core Parameters

Parameter	Type	Default	Description
`url`	string	(required)	Starting URL to crawl
`limit`	number	10	Max pages to crawl (up to 100,000)
`depth`	number	100,000	Max link depth from starting URL
`formats`	array	["html"]	Output formats: `html`, `markdown`, `json`
`render`	boolean	true	`true` = headless browser, `false` = fast HTML fetch
`source`	string	"all"	Page discovery: `all`, `sitemaps`, `links`
`maxAge`	number	86400	Cache validity in seconds (max 604800)
`modifiedSince`	number	-	Unix timestamp; only crawl pages modified after this time

Options Object

Parameter	Type	Default	Description
`includePatterns`	array	[]	Wildcard patterns to include (`` and `*`)
`excludePatterns`	array	[]	Wildcard patterns to exclude (higher priority)
`includeSubdomains`	boolean	false	Follow links to subdomains
`includeExternalLinks`	boolean	false	Follow external links

Advanced Parameters

Parameter	Type	Description
`jsonOptions`	object	AI-powered structured extraction (prompt, response_format)
`authenticate`	object	HTTP basic auth (username, password)
`setExtraHTTPHeaders`	object	Custom headers for requests
`rejectResourceTypes`	array	Skip: image, media, font, stylesheet
`userAgent`	string	Custom user agent string
`cookies`	array	Custom cookies for requests

Usage Examples

Crawl documentation site (most common)

/cf-crawl https://docs.example.com --limit 50

Crawls up to 50 pages, saves as markdown.

Crawl with filters

/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**"

Incremental crawl (diff detection)

/cf-crawl https://docs.example.com --limit 50 --since 2026-03-10

Only crawls pages modified since the given date. Skipped pages appear with status=skipped in results. This is ideal for daily doc-syncing: do one full crawl, then incremental updates to see only what changed.

Fast crawl without JavaScript rendering

/cf-crawl https://docs.example.com --no-render --limit 200

Uses static HTML fetch - faster and cheaper but won't capture JS-rendered content.

Crawl and merge into single file

/cf-crawl https://docs.example.com --limit 50 --merge

Merges all pages into a single markdown file for easy context loading.

Argument Parsing

When invoked as /cf-crawl, parse the arguments as follows:

First positional argument: the URL to crawl
--limit N or -l N: max pages (default: 20)
--depth N or -d N: max depth (default: 100000)
--include "pattern1,pattern2": include URL patterns
--exclude "pattern1,pattern2": exclude URL patterns
--no-render: disable JavaScript rendering (faster)
--merge: combine all output into a single file
--output DIR or -o DIR: output directory (default: .crawl-output)
--source sitemaps|links|all: page discovery method (default: all)
--since DATE: only crawl pages modified since DATE (ISO date like 2026-03-10 or Unix timestamp). Converts to Unix timestamp for the modifiedSince API parameter

If no URL is provided, ask the user for the target URL.

Important Notes

The /crawl endpoint respects robots.txt directives including crawl-delay
Blocked URLs appear with "status": "disallowed" in results
Free plan: 10 minutes of browser time per day
Job results are available for 14 days after completion
Max job runtime: 7 days
Response page size limit: 10 MB per page
Use render: false for static sites to save browser time
Pattern wildcards: * matches any character except /, ** matches including /