一键导入
ghostfetch
Stealthy web fetcher that bypasses anti-bot protections. Fetches content from sites like X.com and converts to clean Markdown for AI agents.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Stealthy web fetcher that bypasses anti-bot protections. Fetches content from sites like X.com and converts to clean Markdown for AI agents.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
| name | ghostfetch |
| description | Stealthy web fetcher that bypasses anti-bot protections. Fetches content from sites like X.com and converts to clean Markdown for AI agents. |
| version | 1.0.0 |
| author | iArsalanshah |
| tags | ["web-scraping","stealth","markdown","browser-automation","anti-bot-bypass"] |
Fetch web content from sites that block AI agents. Uses a stealthy headless browser with advanced fingerprinting to bypass anti-bot protections and returns clean Markdown.
GhostFetch must be running as a service. Start it with:
# Option 1: If installed via pip
ghostfetch serve
# Option 2: Docker
docker run -p 8000:8000 iarsalanshah/ghostfetch
Use the /fetch/sync endpoint for simple, blocking requests:
curl "http://localhost:8000/fetch/sync?url=https://example.com"
import requests
def ghostfetch(url: str, timeout: float = 120.0) -> dict:
"""
Fetch content from a URL using GhostFetch.
Returns:
dict with 'metadata' and 'markdown' keys
"""
response = requests.post(
"http://localhost:8000/fetch/sync",
json={"url": url, "timeout": timeout}
)
response.raise_for_status()
return response.json()
# Example
result = ghostfetch("https://x.com/user/status/123")
print(result["markdown"])
from ghostfetch import fetch
result = fetch("https://x.com/user/status/123")
print(result["metadata"]["title"])
print(result["markdown"])
{
"metadata": {
"title": "Page Title",
"author": "Author Name",
"publish_date": "2024-01-15",
"images": ["https://example.com/image.jpg"]
},
"markdown": "# Page Title\n\nPage content in clean Markdown..."
}
Synchronous fetch - blocks until content is ready.
Request:
{
"url": "https://example.com",
"context_id": "optional-session-id",
"timeout": 120
}
Response: See Response Format above.
Same as POST but via query parameters:
GET /fetch/sync?url=https://example.com&timeout=60
Async fetch - returns job ID immediately, poll for results.
Request:
{
"url": "https://example.com",
"callback_url": "https://your-webhook.com/callback",
"github_issue": 42
}
Response:
{
"job_id": "abc123",
"url": "https://example.com",
"status": "queued"
}
Check job status and get results.
Health check endpoint.
Set via environment variables when running the service:
| Variable | Default | Description |
|---|---|---|
SYNC_TIMEOUT_DEFAULT | 120 | Default timeout for sync requests (seconds) |
MAX_SYNC_TIMEOUT | 300 | Maximum allowed timeout |
MAX_CONCURRENT_BROWSERS | 2 | Concurrent browser contexts |
MIN_DOMAIN_DELAY | 10 | Seconds between requests to same domain |
| Status Code | Meaning |
|---|---|
| 200 | Success |
| 400 | Invalid request (non-retryable error) |
| 502 | Fetch failed (retryable) |
| 504 | Request timeout |
Use context_id for multi-step workflows - Sessions are persisted per context, maintaining cookies between requests.
Respect rate limits - GhostFetch has built-in domain delays. Don't bypass these.
Check metadata first - The structured metadata often has what you need without parsing Markdown.
browser - General browser automationweb_fetch - Simple HTTP fetching (for non-protected sites)