| name | baoyu-url-to-markdown |
| description | Fetch any URL and convert to markdown using Chrome CDP. Saves the rendered HTML snapshot alongside the markdown, uses an upgraded Defuddle pipeline with better web-component handling and YouTube transcript extraction, and automatically falls back to the pre-Defuddle HTML-to-Markdown pipeline when needed. If local browser capture fails entirely, it can fall back to the hosted defuddle.md API. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown. |
| version | 1.59.0 |
| metadata | {"openclaw":{"homepage":"https://github.com/JimLiu/baoyu-skills#baoyu-url-to-markdown","requires":{"anyBins":["bun","npx"]}}} |
URL to Markdown
Fetches any URL via Chrome CDP, saves the rendered HTML snapshot, and converts it to clean markdown.
Script Directory
Important: All scripts are located in the scripts/ subdirectory of this skill.
Agent Execution Instructions:
- Determine this SKILL.md file's directory path as
{baseDir}
- Script path =
{baseDir}/scripts/<script-name>.ts
- Resolve
${BUN_X} runtime: if bun installed โ bun; if npx available โ npx -y bun; else suggest installing bun
- Replace all
{baseDir} and ${BUN_X} in this document with actual values
Script Reference:
| Script | Purpose |
|---|
scripts/main.ts | CLI entry point for URL fetching |
scripts/html-to-markdown.ts | Markdown conversion entry point and converter selection |
scripts/parsers/index.ts | Unified parser entry: dispatches URL-specific rules before generic converters |
scripts/parsers/types.ts | Unified parser interface shared by all rule files |
scripts/parsers/rules/*.ts | One file per URL rule, for example X status and X article |
scripts/defuddle-converter.ts | Defuddle-based conversion |
scripts/legacy-converter.ts | Pre-Defuddle legacy extraction and markdown conversion |
scripts/markdown-conversion-shared.ts | Shared metadata parsing and markdown document helpers |
Preferences (EXTEND.md)
Check EXTEND.md existence (priority order):
test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project"
test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg"
test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"
# PowerShell (Windows)
if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" }
$xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" }
if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" }
if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" }
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ Path โ Location โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ .baoyu-skills/baoyu-url-to-markdown/EXTEND.md โ Project directory โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโค
โ $HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md โ User home โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Result โ Action โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Found โ Read, parse, apply settings โ
โโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Not found โ MUST run first-time setup (see below) โ do NOT silently create defaults โ
โโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
EXTEND.md Supports: Download media by default | Default output directory | Default capture mode | Timeout settings
First-Time Setup (BLOCKING)
CRITICAL: When EXTEND.md is not found, you MUST use AskUserQuestion to ask the user for their preferences before creating EXTEND.md. NEVER create EXTEND.md with defaults without asking. This is a BLOCKING operation โ do NOT proceed with any conversion until setup is complete.
Use AskUserQuestion with ALL questions in ONE call:
Question 1 โ header: "Media", question: "How to handle images and videos in pages?"
- "Ask each time (Recommended)" โ After saving markdown, ask whether to download media
- "Always download" โ Always download media to local imgs/ and videos/ directories
- "Never download" โ Keep original remote URLs in markdown
Question 2 โ header: "Output", question: "Default output directory?"
- "url-to-markdown (Recommended)" โ Save to ./url-to-markdown/{domain}/{slug}.md
- (User may choose "Other" to type a custom path)
Question 3 โ header: "Save", question: "Where to save preferences?"
- "User (Recommended)" โ ~/.baoyu-skills/ (all projects)
- "Project" โ .baoyu-skills/ (this project only)
After user answers, create EXTEND.md at the chosen location, confirm "Preferences saved to [path]", then continue.
Full reference: references/config/first-time-setup.md
Supported Keys
| Key | Default | Values | Description |
|---|
download_media | ask | ask / 1 / 0 | ask = prompt each time, 1 = always download, 0 = never |
default_output_dir | empty | path or empty | Default output directory (empty = ./url-to-markdown/) |
EXTEND.md โ CLI mapping:
| EXTEND.md key | CLI argument | Notes |
|---|
download_media: 1 | --download-media | |
default_output_dir: ./posts/ | --output-dir ./posts/ | Directory path. Do NOT pass to -o (which expects a file path) |
Value priority:
- CLI arguments (
--download-media, -o, --output-dir)
- EXTEND.md
- Skill defaults
Features
- Chrome CDP for full JavaScript rendering
- Browser strategy fallback: default headless first, then visible Chrome on technical failure
- URL-specific parser layer for sites that need custom HTML rules before generic extraction
- Two capture modes: auto or wait-for-user
- Save rendered HTML as a sibling
-captured.html file
- Clean markdown output with metadata
- Upgraded Defuddle-first markdown conversion with automatic fallback to the pre-Defuddle extractor from git history
- X/Twitter pages can use HTML-specific parsing for Tweets and Articles, which improves title/body/media extraction on
x.com / twitter.com
archive.ph / related archive mirrors can restore the original URL from input[name=q] and prefer #CONTENT before falling back to the page body
- Materializes shadow DOM content before conversion so web-component pages survive serialization better
- YouTube pages can include transcript/caption text in the markdown when YouTube exposes a caption track
- If local browser capture fails completely, can fall back to
defuddle.md/<url> and still save markdown
- Handles login-required pages via wait mode
- Download images and videos to local directories
Usage
${BUN_X} {baseDir}/scripts/main.ts <url>
${BUN_X} {baseDir}/scripts/main.ts <url> --browser headless
${BUN_X} {baseDir}/scripts/main.ts <url> --browser headed
${BUN_X} {baseDir}/scripts/main.ts <url> --wait
${BUN_X} {baseDir}/scripts/main.ts <url> -o output.md
${BUN_X} {baseDir}/scripts/main.ts <url> --output-dir ./posts/
${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
Options
| Option | Description |
|---|
<url> | URL to fetch |
-o <path> | Output file path โ must be a file path, not directory (default: auto-generated) |
--output-dir <dir> | Base output directory โ auto-generates {dir}/{domain}/{slug}.md (default: ./url-to-markdown/) |
--wait | Wait for user signal before capturing |
--browser <mode> | Browser strategy: auto (default), headless, or headed |
--headless | Shortcut for --browser headless |
--headed | Shortcut for --browser headed |
--timeout <ms> | Page load timeout (default: 30000) |
--download-media | Download image/video assets to local imgs/ and videos/, and rewrite markdown links to local relative paths |
Capture Modes
| Mode | Behavior | Use When |
|---|
| Auto (default) | Try headless first, then retry in visible Chrome if needed | Public pages, static content, unknown pages |
Wait (--wait) | User signals when ready | Login-required, lazy loading, paywalls |
Wait mode workflow:
- Run with
--wait โ script outputs "Press Enter when ready"
- Ask user to confirm page is ready
- Send newline to stdin to trigger capture
Default browser fallback:
- Auto mode starts with headless Chrome and captures on network idle
- If headless capture fails technically, retry with visible Chrome
- If a shared Chrome session for this profile already exists, reuse it instead of launching a new browser
- The script does not hard-code login or paywall detection; the agent must inspect the captured markdown or HTML and decide whether to rerun with
--browser headed --wait
Agent Quality Gate
CRITICAL: The agent must treat headless capture as provisional. Some sites render differently in headless mode and can silently return an error shell, partially hydrated page, or low-quality extraction without causing the CLI to fail.
After every run that used --browser auto or --browser headless, the agent MUST inspect the saved markdown first, and inspect the saved -captured.html when the markdown looks suspicious.
Quality checks the agent must perform
- Confirm the markdown title matches the target page, not a generic site shell
- Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error
- Watch for obvious failure signs such as:
Application error
This page could not be found
- login, signup, subscribe, or verification shells
- extremely short markdown for a page that should be long-form
- raw framework payloads or mostly boilerplate content
- If the result is low quality, incomplete, or clearly wrong, do not accept the run as successful just because the CLI exited with code 0
Recovery workflow the agent must follow
- First run with default
auto unless there is already a clear reason to use wait mode
- Review markdown quality immediately after the run
- If the content is low quality, rerun locally with visible Chrome:
--browser headed for ordinary rendering issues
--browser headed --wait when the page may need login, anti-bot interaction, cookie acceptance, or extra hydration time
- If
--wait is used, tell the user exactly what to do:
- if login is required, ask them to sign in
- if the page needs time to hydrate, ask them to wait until the full content is visible
- once ready, ask them to press Enter so capture can continue
- Only fall back to hosted
defuddle.md after the local browser strategies have failed or are clearly lower fidelity
Output Format
Each run saves two files side by side:
- Markdown: YAML front matter with
url, title, description, author, published, optional coverImage, and captured_at, followed by converted markdown content
- HTML snapshot:
*-captured.html, containing the rendered page HTML captured from Chrome
When Defuddle or page metadata provides a language hint, the markdown front matter also includes language.
The HTML snapshot is saved before any markdown media localization, so it stays a faithful capture of the page DOM used for conversion.
If the hosted defuddle.md API fallback is used, markdown is still saved, but there is no local -captured.html snapshot for that run.
Output Directory
Default: url-to-markdown/<domain>/<slug>.md
With --output-dir ./posts/: ./posts/<domain>/<slug>.md
HTML snapshot path uses the same basename:
-
url-to-markdown/<domain>/<slug>-captured.html
-
./posts/<domain>/<slug>-captured.html
-
<slug>: From page title or URL path (kebab-case, 2-6 words)
-
Conflict resolution: Append timestamp <slug>-YYYYMMDD-HHMMSS.md
When --download-media is enabled:
- Images are saved to
imgs/ next to the markdown file
- Videos are saved to
videos/ next to the markdown file
- Markdown media links are rewritten to local relative paths
Conversion Fallback
Conversion order:
- Try the URL-specific parser layer first when a site rule matches
- If no specialized parser matches, try Defuddle
- For rich pages such as YouTube, prefer Defuddle's extractor-specific output (including transcripts when available) instead of replacing it with the legacy pipeline
- If Defuddle throws, cannot load, returns obviously incomplete markdown, or captures lower-quality content than the legacy pipeline, automatically fall back to the pre-Defuddle extractor
- If the agent determines the captured result is a login screen, verification screen, or paywall shell, rerun locally with
--browser headed --wait and ask the user to complete access before capture
- If the entire local browser capture flow still fails before markdown can be produced, try the hosted
https://defuddle.md/<url> API and save its markdown output directly
- The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history
CLI output will show:
Converter: parser:... when a URL-specific parser succeeded
Converter: defuddle when Defuddle succeeds
Converter: legacy:... plus Fallback used: ... when fallback was needed
Converter: defuddle-api when local browser capture failed and the hosted API was used instead
Media Download Workflow
Based on download_media setting in EXTEND.md:
| Setting | Behavior |
|---|
1 (always) | Run script with --download-media flag |
0 (never) | Run script without --download-media flag |
ask (default) | Follow the ask-each-time flow below |
Ask-Each-Time Flow
- Run script without
--download-media โ markdown saved
- Check saved markdown for remote media URLs (
https:// in image/video links)
- If no remote media found โ done, no prompt needed
- If remote media found โ use
AskUserQuestion:
- header: "Media", question: "Download N images/videos to local files?"
- "Yes" โ Download to local directories
- "No" โ Keep remote URLs
- If user confirms โ run script again with
--download-media (overwrites markdown with localized links)
Environment Variables
| Variable | Description |
|---|
URL_CHROME_PATH | Custom Chrome executable path |
URL_DATA_DIR | Custom data directory |
URL_CHROME_PROFILE_DIR | Custom Chrome profile directory |
Troubleshooting: Chrome not found โ set URL_CHROME_PATH. Timeout โ increase --timeout. Complex pages โ try --wait mode. If markdown quality is poor, inspect the saved -captured.html and check whether the run logged a legacy fallback.
YouTube Notes
- The upgraded Defuddle path uses async extractors, so YouTube pages can include transcript text directly in the markdown body.
- Transcript availability depends on YouTube exposing a caption track. Videos with captions disabled, restricted playback, or blocked regional access may still produce description-only output.
- If the page needs time to finish loading descriptions, chapters, or player metadata, prefer
--wait and capture after the watch page is fully hydrated.
Hosted API Fallback
- The hosted fallback endpoint is
https://defuddle.md/<url>. In shell form: curl https://defuddle.md/stephango.com
- Use it only when the local Chrome/CDP capture path fails outright. The local path still has higher fidelity because it can save the captured HTML and handle authenticated pages.
- The hosted API already returns Markdown with YAML frontmatter, so save that response as-is and then apply the normal media-localization step if requested.
Extension Support
Custom configurations via EXTEND.md. See Preferences section for paths and supported options.