ワンクリックでManusで任意のスキルを実行

始める

web-extract

スター526

フォーク19

更新日2026年6月13日 14:21

Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed.

インストール

Codex または Claude でインストールこの Prompt をコピーして Codex、Claude、または他のアシスタントに貼り付けると、Skill ページを確認してインストールできます。

Manusで実行

ソース

fuyuxiang

fuyuxiang/echo-agent

GitHub リポジトリを開く Creator のリポジトリを見る

ダウンロード

Manusで実行

Web Extract

Extract readable text/markdown from any URL. Uses trafilatura — the best Python content extraction library (handles news, blogs, docs reliably).

Quick Usage

pip install trafilatura httpx

import trafilatura

# Fetch and extract in one step
text = trafilatura.fetch_and_extract("https://example.com/article")
print(text)

# With more options
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="markdown", include_links=True)

Helper script

python3 scripts/extract_url.py "https://example.com/article"
python3 scripts/extract_url.py "https://example.com" --format markdown --links

Options

Parameter	Effect
`output_format="markdown"`	Markdown with headers
`include_links=True`	Preserve hyperlinks
`include_images=True`	Include image references
`include_tables=True`	Preserve table structure
`favor_recall=True`	Extract more (less precision)

Fallback: httpx + readability

For pages where trafilatura struggles:

import httpx
from readability import Document

resp = httpx.get(url, follow_redirects=True, timeout=15)
doc = Document(resp.text)
title = doc.title()
content = doc.summary()  # HTML, needs html2text for markdown

JavaScript-heavy sites

For SPAs or JS-rendered content, use playwright (optional):

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url, wait_until="networkidle")
    html = page.content()
    browser.close()
# Then pass html to trafilatura.extract()

Rate Limits

Be respectful: add 1-2 second delays between requests to the same domain. Set a User-Agent: trafilatura.fetch_url(url, config=config) with custom config.

このリポジトリの他の Skills

同じリポジトリ

ppt-author

fuyuxiang/echo-agent

Create and edit PowerPoint (.pptx) presentations programmatically. Requires python-pptx.

2026-06-22526

excel-author

fuyuxiang/echo-agent

Create and edit Excel (.xlsx) workbooks with openpyxl. Supports formulas, charts, formatting, and data analysis.

2026-06-13526

image-gen

fuyuxiang/echo-agent

Generate images via DALL-E, Stable Diffusion, or free alternatives. Supports multi-channel delivery.

2026-06-13526

meme-gen

fuyuxiang/echo-agent

Generate meme images with text overlays using Pillow. Pick templates or create custom image macros.

2026-06-13526

code-runner

fuyuxiang/echo-agent

Execute Python code snippets in a sandboxed environment. Supports data analysis, visualization, and quick scripts.

2026-06-13526

github-ops

fuyuxiang/echo-agent

GitHub CLI for issues, PRs, code search, CI logs, releases, and API queries. Requires gh CLI and auth.

2026-06-13526

name	web-extract
description	Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed.
version	1.0.0
metadata	{"echo":{"tags":["Web","Extract","Scraping","Content","URL"]}}

Web Extract

Extract readable text/markdown from any URL. Uses trafilatura — the best Python content extraction library (handles news, blogs, docs reliably).

Quick Usage

pip install trafilatura httpx

import trafilatura

# Fetch and extract in one step
text = trafilatura.fetch_and_extract("https://example.com/article")
print(text)

# With more options
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="markdown", include_links=True)

Helper script

python3 scripts/extract_url.py "https://example.com/article"
python3 scripts/extract_url.py "https://example.com" --format markdown --links

Options

Parameter	Effect
`output_format="markdown"`	Markdown with headers
`include_links=True`	Preserve hyperlinks
`include_images=True`	Include image references
`include_tables=True`	Preserve table structure
`favor_recall=True`	Extract more (less precision)

Fallback: httpx + readability

For pages where trafilatura struggles:

import httpx
from readability import Document

resp = httpx.get(url, follow_redirects=True, timeout=15)
doc = Document(resp.text)
title = doc.title()
content = doc.summary()  # HTML, needs html2text for markdown

JavaScript-heavy sites

For SPAs or JS-rendered content, use playwright (optional):

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url, wait_until="networkidle")
    html = page.content()
    browser.close()
# Then pass html to trafilatura.extract()

Rate Limits

Be respectful: add 1-2 second delays between requests to the same domain. Set a User-Agent: trafilatura.fetch_url(url, config=config) with custom config.