ワンクリックで
web-extract
Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed.
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
メニュー
Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed.
Codex または Claude でインストール この Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。
SOC 職業分類に基づく
Create and edit PowerPoint (.pptx) presentations programmatically. Requires python-pptx.
Create and edit Excel (.xlsx) workbooks with openpyxl. Supports formulas, charts, formatting, and data analysis.
Generate images via DALL-E, Stable Diffusion, or free alternatives. Supports multi-channel delivery.
Generate meme images with text overlays using Pillow. Pick templates or create custom image macros.
Execute Python code snippets in a sandboxed environment. Supports data analysis, visualization, and quick scripts.
GitHub CLI for issues, PRs, code search, CI logs, releases, and API queries. Requires gh CLI and auth.
| name | web-extract |
| description | Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed. |
| version | 1.0.0 |
| metadata | {"echo":{"tags":["Web","Extract","Scraping","Content","URL"]}} |
Extract readable text/markdown from any URL. Uses trafilatura — the best Python content extraction library (handles news, blogs, docs reliably).
pip install trafilatura httpx
import trafilatura
# Fetch and extract in one step
text = trafilatura.fetch_and_extract("https://example.com/article")
print(text)
# With more options
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="markdown", include_links=True)
python3 scripts/extract_url.py "https://example.com/article"
python3 scripts/extract_url.py "https://example.com" --format markdown --links
| Parameter | Effect |
|---|---|
output_format="markdown" | Markdown with headers |
include_links=True | Preserve hyperlinks |
include_images=True | Include image references |
include_tables=True | Preserve table structure |
favor_recall=True | Extract more (less precision) |
For pages where trafilatura struggles:
import httpx
from readability import Document
resp = httpx.get(url, follow_redirects=True, timeout=15)
doc = Document(resp.text)
title = doc.title()
content = doc.summary() # HTML, needs html2text for markdown
For SPAs or JS-rendered content, use playwright (optional):
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until="networkidle")
html = page.content()
browser.close()
# Then pass html to trafilatura.extract()
Be respectful: add 1-2 second delays between requests to the same domain.
Set a User-Agent: trafilatura.fetch_url(url, config=config) with custom config.