ワンクリックでManusで任意のスキルを実行

$pwd:

web-scraping

Name: Web Scraping
Author: ginlix-ai

// Web scraping with Scrapling: MCP tool wrappers for quick fetching, plus direct Python API for advanced scraping with selectors, sessions, and spiders

Manusで実行

$ git log --oneline --stat

stars:1,186

forks:173

updated:2026年4月22日 10:17

ファイルエクスプローラー

2 ファイル

SKILL.md

readonly

related-skills.json

同じリポジトリ

user-profile.md

from "ginlix-ai/LangAlpha"

Manage user profile including watchlists, portfolio, and preferences.

2026-05-061.2k

dcf-model.md

from "ginlix-ai/LangAlpha"

DCF valuation: free cash flow projections, WACC, terminal value, sensitivity analysis

2026-04-221.2k

inline-widget.md

from "ginlix-ai/LangAlpha"

Inline HTML widgets: charts, dashboards, data tables rendered directly in the chat via ShowWidget

2026-04-221.2k

xlsx.md

from "ginlix-ai/LangAlpha"

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

2026-04-221.2k

x-api.md

from "ginlix-ai/LangAlpha"

Search X (Twitter) posts, pull user profiles, fetch specific tweets, and read reply threads for sentiment, news, and event research. Triggers on 'X', 'Twitter', 'tweets about', 'sentiment on', 'what are people saying about', 'historical tweets', or any request to read public X content.

2026-04-211.2k

self-improve.md

from "ginlix-ai/LangAlpha"

Report issues and propose fixes to improve your own capabilities when you encounter errors or limitations

2026-04-201.2k

package.json

"author": "ginlix-ai"

"repository": "ginlix-ai/LangAlpha"

GitHub リポジトリを開く Creator のリポジトリを見る

$ install --global

$ download --local

Manusで実行

$ useful --forSOC

ソフトウェア開発者コンピュータ・数学職15-1252L4

name	web-scraping
description	Web scraping with Scrapling: MCP tool wrappers for quick fetching, plus direct Python API for advanced scraping with selectors, sessions, and spiders
license	MIT

Web Scraping with Scrapling

Overview

Two ways to scrape in the sandbox:

MCP tool wrappers (recommended for simple fetches) — call get(), fetch(), stealthy_fetch() directly. Synchronous, returns dicts.
Direct Python API (for advanced use) — import Scrapling classes for selectors, sessions, spiders. Async, returns Page objects.

MCP Tool Wrappers (via Python)

Auto-registered as top-level functions in the sandbox. No imports needed. Synchronous — no await.

Quick fetches can run inline via ExecuteCode. For spiders, multi-URL crawls, or anything you'll iterate on, write the scraper to work/<task_name>/scraper.py and run it via Bash — edit-and-rerun beats resubmitting code.

Basic Usage

# Fast HTTP fetch → markdown
result = get(url="https://example.com", extraction_type="markdown")
print(result["status"])      # 200
print(result["url"])         # "https://example.com"
print(result["content"][0])  # markdown string (first element of list)

# Browser fetch for JS-rendered pages
result = fetch(url="https://spa-site.com", extraction_type="markdown", network_idle=True)

# Anti-bot bypass (Cloudflare, etc.)
result = stealthy_fetch(url="https://protected-site.com", extraction_type="markdown", solve_cloudflare=True)

Response Format

All MCP tools return a dict (not a Page object):

{
    "status": 200,
    "url": "https://example.com",
    "content": ["<markdown or html text>", ""]  # list, use [0] for content
}

No .css(), .xpath(), .find_all() methods — use BeautifulSoup to parse if needed
No .body, .headers, .cookies — only status, url, content
content is always a list; the actual text is content[0]

CSS Selector with MCP Tools

The css_selector param returns raw HTML of matched elements, not parsed text:

# Returns HTML of matched elements — must parse manually
result = get(url="https://example.com", css_selector="h1", extraction_type="HTML")
html_fragment = result["content"][0]

# Parse with BeautifulSoup if you need text/attributes
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_fragment, "html.parser")
titles = [h1.get_text() for h1 in soup.find_all("h1")]

Available Tools

Function	Use case	Key params
`get(url, ...)`	Static pages, APIs	`impersonate`, `stealthy_headers`, `timeout` (seconds)
`fetch(url, ...)`	JS-rendered SPAs	`headless`, `network_idle`, `wait_selector`, `disable_resources`, `timeout` (ms)
`stealthy_fetch(url, ...)`	Anti-bot sites	All `fetch` params + `solve_cloudflare`, `hide_canvas`
`bulk_get(urls, ...)`	Parallel HTTP	`urls: list[str]`, same params as `get`
`bulk_fetch(urls, ...)`	Parallel browser	`urls: list[str]`, same params as `fetch`
`bulk_stealthy_fetch(urls, ...)`	Parallel stealth	`urls: list[str]`, same params as `stealthy_fetch`

Common Parameters

Param	Default	Notes
`extraction_type`	`"markdown"`	`"markdown"`, `"HTML"`, or `"text"`
`css_selector`	`None`	Returns raw HTML of matched elements
`main_content_only`	`True`	Extract `<body>` only
`proxy`	`None`	Proxy URL

Direct Python API (Advanced)

For selectors, sessions, spiders, or when you need the full Page object. Requires imports. Async.

Fetcher (Fast HTTP — Tier 1)

from scrapling.fetchers import AsyncFetcher

page = await AsyncFetcher.get("https://example.com", stealthy_headers=True)
print(page.status)       # 200
print(page.body)         # Raw bytes
print(page.headers)      # Response headers

# CSS selectors (Scrapy-style pseudo-elements)
titles = page.css("h1::text").getall()
links = page.css("a::attr(href)").getall()

# XPath
items = page.xpath("//div[@class='item']/text()").getall()

# BeautifulSoup-style
divs = page.find_all("div", class_="content")

DynamicFetcher (Browser — Tier 2)

from scrapling.fetchers import DynamicFetcher

page = await DynamicFetcher.async_fetch(
    "https://spa-website.com",
    headless=True,
    network_idle=True,
    disable_resources=True,
    timeout=30000,
    wait_selector=".data-table",
)
rows = page.css("table.data-table tr")
for row in rows:
    cells = row.css("td::text").getall()

StealthyFetcher (Anti-Bot — Tier 3)

from scrapling.fetchers import StealthyFetcher

page = await StealthyFetcher.async_fetch(
    "https://protected-site.com",
    headless=True,
    solve_cloudflare=True,
    network_idle=True,
)

Sessions (Persistent Connections)

from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate="chrome") as session:
    login_page = session.post("https://site.com/login", data={...})
    dashboard = session.get("https://site.com/dashboard")
    data = dashboard.css(".user-data::text").getall()

Spider (Multi-Page Crawl)

from scrapling.spiders import Spider, Request, Response

class PriceScraper(Spider):
    name = "prices"
    start_urls = ["https://example.com/products"]
    concurrent_requests = 5

    async def parse(self, response: Response):
        for product in response.css(".product"):
            yield {
                "name": product.css(".name::text").get(),
                "price": product.css(".price::text").get(),
            }
        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield Request(next_page)

spider = PriceScraper()
result = spider.start()
result.items.to_json("results/prices.json")

Converting HTML to Markdown

import html2text

converter = html2text.HTML2Text()
converter.body_width = 0  # No line wrapping
markdown = converter.handle(html_string)

When to Use Which

Need	Use
Quick page content as markdown	MCP `get()` or `fetch()`
Extract specific elements (CSS/XPath)	Direct Python API with selectors
Login + scrape authenticated pages	Direct Python API with sessions
Crawl many pages with pagination	Direct Python API with Spider
Bypass Cloudflare	MCP `stealthy_fetch()` or direct `StealthyFetcher`
Save results to file	Direct Python API (spider `.to_json()`)

web-scraping

このリポジトリの他の Skills

このリポジトリの他の Skills

Web Scraping with Scrapling

Overview

MCP Tool Wrappers (via Python)

Basic Usage

Response Format

CSS Selector with MCP Tools

Available Tools

Common Parameters

Direct Python API (Advanced)

Fetcher (Fast HTTP — Tier 1)

DynamicFetcher (Browser — Tier 2)

StealthyFetcher (Anti-Bot — Tier 3)

Sessions (Persistent Connections)

Spider (Multi-Page Crawl)

Converting HTML to Markdown

When to Use Which

Web Scraping with Scrapling

Overview

MCP Tool Wrappers (via Python)

Basic Usage

Response Format

CSS Selector with MCP Tools

Available Tools

Common Parameters

Direct Python API (Advanced)

Fetcher (Fast HTTP — Tier 1)

DynamicFetcher (Browser — Tier 2)

StealthyFetcher (Anti-Bot — Tier 3)

Sessions (Persistent Connections)

Spider (Multi-Page Crawl)

Converting HTML to Markdown

When to Use Which