بنقرة واحدة
web-extract
Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed.
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
القائمة
Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed.
التثبيت باستخدام Codex أو Claude انسخ هذا Prompt والصقه في Codex أو Claude أو مساعد آخر ليراجع صفحة Skill ويثبّتها لك.
استنادا إلى تصنيف SOC المهني
Create and edit PowerPoint (.pptx) presentations programmatically. Requires python-pptx.
Create and edit Excel (.xlsx) workbooks with openpyxl. Supports formulas, charts, formatting, and data analysis.
Generate images via DALL-E, Stable Diffusion, or free alternatives. Supports multi-channel delivery.
Generate meme images with text overlays using Pillow. Pick templates or create custom image macros.
Execute Python code snippets in a sandboxed environment. Supports data analysis, visualization, and quick scripts.
GitHub CLI for issues, PRs, code search, CI logs, releases, and API queries. Requires gh CLI and auth.
| name | web-extract |
| description | Extract clean text content from any URL. Uses trafilatura for high-quality extraction, no API key needed. |
| version | 1.0.0 |
| metadata | {"echo":{"tags":["Web","Extract","Scraping","Content","URL"]}} |
Extract readable text/markdown from any URL. Uses trafilatura — the best Python content extraction library (handles news, blogs, docs reliably).
pip install trafilatura httpx
import trafilatura
# Fetch and extract in one step
text = trafilatura.fetch_and_extract("https://example.com/article")
print(text)
# With more options
downloaded = trafilatura.fetch_url("https://example.com/article")
result = trafilatura.extract(downloaded, output_format="markdown", include_links=True)
python3 scripts/extract_url.py "https://example.com/article"
python3 scripts/extract_url.py "https://example.com" --format markdown --links
| Parameter | Effect |
|---|---|
output_format="markdown" | Markdown with headers |
include_links=True | Preserve hyperlinks |
include_images=True | Include image references |
include_tables=True | Preserve table structure |
favor_recall=True | Extract more (less precision) |
For pages where trafilatura struggles:
import httpx
from readability import Document
resp = httpx.get(url, follow_redirects=True, timeout=15)
doc = Document(resp.text)
title = doc.title()
content = doc.summary() # HTML, needs html2text for markdown
For SPAs or JS-rendered content, use playwright (optional):
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until="networkidle")
html = page.content()
browser.close()
# Then pass html to trafilatura.extract()
Be respectful: add 1-2 second delays between requests to the same domain.
Set a User-Agent: trafilatura.fetch_url(url, config=config) with custom config.