تشغيل أي مهارة في Manus بنقرة واحدة

ابدأ الآن

web-extract

النجوم٥٢٦

التفرعات١٩

آخر تحديث١٣ يونيو ٢٠٢٦ في ١٤:٢١

Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed.

التثبيت

التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.

تشغيل في Manus

المصدر

fuyuxiang

fuyuxiang/echo-agent

فتح مستودع GitHub عرض مستودعات المنشئ

تنزيل

تشغيل في Manus

المهن ذات الصلةSOC

استنادا إلى تصنيف SOC المهني

مطوّرو البرمجياتمهن الحاسوب والرياضيات·SOC 15-1252

مستكشف الملفات

2 ملفات

SKILL.md

readonly

name	web-extract
description	Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed.
version	1.0.0
metadata	{"echo":{"tags":["Web","Extract","Scraping","Content","URL"]}}

Web Extract

Extract readable text/markdown from any URL. Uses trafilatura — the best Python content extraction library (handles news, blogs, docs reliably).

Quick Usage

pip install trafilatura httpx

import trafilatura

# Fetch and extract in one step
text = trafilatura.fetch_and_extract("https://example.com/article")
print(text)

# With more options
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="markdown", include_links=True)

Helper script

python3 scripts/extract_url.py "https://example.com/article"
python3 scripts/extract_url.py "https://example.com" --format markdown --links

Options

Parameter	Effect
`output_format="markdown"`	Markdown with headers
`include_links=True`	Preserve hyperlinks
`include_images=True`	Include image references
`include_tables=True`	Preserve table structure
`favor_recall=True`	Extract more (less precision)

Fallback: httpx + readability

For pages where trafilatura struggles:

import httpx
from readability import Document

resp = httpx.get(url, follow_redirects=True, timeout=15)
doc = Document(resp.text)
title = doc.title()
content = doc.summary()  # HTML, needs html2text for markdown

JavaScript-heavy sites

For SPAs or JS-rendered content, use playwright (optional):

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url, wait_until="networkidle")
    html = page.content()
    browser.close()
# Then pass html to trafilatura.extract()

Rate Limits

Be respectful: add 1-2 second delays between requests to the same domain. Set a User-Agent: trafilatura.fetch_url(url, config=config) with custom config.

المزيد من هذا المستودع

نفس المستودع

ppt-author

fuyuxiang/echo-agent

Create and edit PowerPoint (.pptx) presentations programmatically. Requires python-pptx.

2026-06-22526

excel-author

fuyuxiang/echo-agent

Create and edit Excel (.xlsx) workbooks with openpyxl. Supports formulas, charts, formatting, and data analysis.

2026-06-13526

image-gen

fuyuxiang/echo-agent

Generate images via DALL-E, Stable Diffusion, or free alternatives. Supports multi-channel delivery.

2026-06-13526

meme-gen

fuyuxiang/echo-agent

Generate meme images with text overlays using Pillow. Pick templates or create custom image macros.

2026-06-13526

code-runner

fuyuxiang/echo-agent

Execute Python code snippets in a sandboxed environment. Supports data analysis, visualization, and quick scripts.

2026-06-13526

github-ops

fuyuxiang/echo-agent

GitHub CLI for issues, PRs, code search, CI logs, releases, and API queries. Requires gh CLI and auth.

2026-06-13526

name	web-extract
description	Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed.
version	1.0.0
metadata	{"echo":{"tags":["Web","Extract","Scraping","Content","URL"]}}

Web Extract

Extract readable text/markdown from any URL. Uses trafilatura — the best Python content extraction library (handles news, blogs, docs reliably).

Quick Usage

pip install trafilatura httpx

import trafilatura

# Fetch and extract in one step
text = trafilatura.fetch_and_extract("https://example.com/article")
print(text)

# With more options
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="markdown", include_links=True)

Helper script

python3 scripts/extract_url.py "https://example.com/article"
python3 scripts/extract_url.py "https://example.com" --format markdown --links

Options

Parameter	Effect
`output_format="markdown"`	Markdown with headers
`include_links=True`	Preserve hyperlinks
`include_images=True`	Include image references
`include_tables=True`	Preserve table structure
`favor_recall=True`	Extract more (less precision)

Fallback: httpx + readability

For pages where trafilatura struggles:

import httpx
from readability import Document

resp = httpx.get(url, follow_redirects=True, timeout=15)
doc = Document(resp.text)
title = doc.title()
content = doc.summary()  # HTML, needs html2text for markdown

JavaScript-heavy sites

For SPAs or JS-rendered content, use playwright (optional):

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url, wait_until="networkidle")
    html = page.content()
    browser.close()
# Then pass html to trafilatura.extract()

Rate Limits

Be respectful: add 1-2 second delays between requests to the same domain. Set a User-Agent: trafilatura.fetch_url(url, config=config) with custom config.