Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

integrate

Embed this repo's scraping primitives into your own autonomous agent. Use this when you want to call the scraping engine programmatically - not via CLI - and drive the generate/execute/repair loop from your own code with your own LLM.

In Manus ausführen

Überblick

Installationsbefehl

npx skills add https://github.com/heldernoid/scrapping --skill integrate

Kopieren Sie diesen Befehl und fügen Sie ihn in Claude Code ein, um den Skill zu installieren

Quelle

heldernoid/scrapping

Sterne0

Forks0

Aktualisiert11. April 2026 um 15:43

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

autonomous-scrape

heldernoid/scrapping

Run the full AI-driven scraping loop - the LLM generates a config, executes it, validates results, and repairs the config if extraction fails. Use this when you want to scrape a site with zero human configuration.

2026-04-110

generate-config

heldernoid/scrapping

Use an LLM to automatically generate a scraper config from a URL. Use this when you have a target URL and a list of fields to extract but no existing config. The LLM inspects the page structure and writes the JSON config for you.

2026-04-110

mcp-scraping

heldernoid/scrapping

Expose the scraping toolkit as MCP tools so any MCP-compatible AI agent can autonomously discover, configure, and execute web scrapes. Use this to connect Claude Desktop, Claude Code, or any MCP client to the scraping engine.

2026-04-110

scrape-url

heldernoid/scrapping

Scrape structured data from a URL using a declarative JSON config. Use this when you have a scraper config (or can load one) and want to extract data from a website into structured JSON records.

2026-04-110

Quelle

heldernoid

heldernoid/scrapping

GitHub-Repository öffnen Creator-Repositorys ansehen

Installationsbefehl

Download

In Manus ausführen

Nützlich fürSOC

SoftwareentwicklerInformatik- und Mathematikberufe15-1252L4

import asyncio, sys sys.path.insert(0, "/path/to/scrapping/agents") from config_generator import generate_config from autonomous_scraper import run # 1. LLM generates config from page structure config = generate_config("https://target-site.com/listings") # 2. Execute the scrape results = asyncio.run(run("https://target-site.com/listings", config, max_items=50)) print(results) # list of dicts, one per record

Function

Signature

Returns

chat

(messages: list[dict], temperature=0.2, max_tokens=4096)

str - assistant reply

get_backend

()

"openrouter" or "ollama"

get_model

()

model name string

get_client

()

OpenAI client instance

Function

Signature

Returns

summarise_html

(html: str, max_elements=60)

compact HTML structure string for LLM

generate_config

(url: str)

dict - full scraper config

Function

Signature

Returns

fetch_static

async (url: str, headers=None)

str - HTML

fetch_playwright

async (url: str)

str - HTML after JS execution

select

(soup, selector: str)

list of BS4 tags

extract_field

(soup, name: str, cfg: dict)

(name, value) tuple

scrape_detail_page

(html: str, config: dict)

dict - one record

generate_urls

(source: dict)

list[str] - paginated URLs

validate_results

(results: list[dict], config: dict)

str - failure reason or ""

run

async (url: str, config: dict, max_items=20)

list[dict] - all records

import agents.llm as llm def my_chat(messages, temperature=0.2, max_tokens=4096): # call your own LLM here return my_llm_client.complete(messages) llm.chat = my_chat # monkey-patch before importing config_generator

import asyncio from config_generator import generate_config, summarise_html from autonomous_scraper import run, validate_results, fetch_static from llm import chat async def scrape_with_repair(url: str, max_retries: int = 3): config = generate_config(url) for attempt in range(max_retries): results = await run(url, config, max_items=20) failure = validate_results(results, config) if not failure: return results # success # Ask LLM to repair the config html = await fetch_static(url) structure = summarise_html(html) repair_prompt = f""" The scraper config failed: {failure} Page structure: {structure} Current config: {config} Return a corrected config JSON only. """ import json response = chat([{"role": "user", "content": repair_prompt}]) config = json.loads(response) return results # return whatever we have after max retries results = asyncio.run(scrape_with_repair("https://target-site.com/listings"))

# Start as subprocess from your agent import subprocess proc = subprocess.Popen( ["python", "/path/to/scrapping/agents/mcp_server.py"], stdin=subprocess.PIPE, stdout=subprocess.PIPE ) # Then connect your MCP client to proc.stdin / proc.stdout

{ "render_mode": "static | playwright | auto", "sources": [ { "url_template": "https://site.com/page/{n}", "pagination": {"start": 1, "step": 1, "max_pages": 10} } ], "listing": { "link_selector": "a.item-link", "link_prefix": "https://site.com" }, "fields": { "field_name": { "selector": "css-selector", "retrieve": "plaintext | html | attr:NAME | regexp", "pattern": "regex with one capture group (regexp only)", "multiple": false } } }

Function

Signature

Returns

chat

(messages: list[dict], temperature=0.2, max_tokens=4096)

str - assistant reply

get_backend

()

"openrouter" or "ollama"

get_model

()

model name string

get_client

()

OpenAI client instance

Function

Signature

Returns

summarise_html

(html: str, max_elements=60)

compact HTML structure string for LLM

generate_config

(url: str)

dict - full scraper config

Function

Signature

Returns

fetch_static

async (url: str, headers=None)

str - HTML

fetch_playwright

async (url: str)

str - HTML after JS execution

select

(soup, selector: str)

list of BS4 tags

extract_field

(soup, name: str, cfg: dict)

(name, value) tuple

scrape_detail_page

(html: str, config: dict)

dict - one record

generate_urls

(source: dict)

list[str] - paginated URLs

validate_results

(results: list[dict], config: dict)

str - failure reason or ""

run

async (url: str, config: dict, max_items=20)

list[dict] - all records

integrate

What This Skill Does

Minimal Working Example

Full API Reference

`agents/llm.py`

`agents/config_generator.py`

`agents/autonomous_scraper.py`

Plug In Your Own LLM

Build Your Own Repair Loop

Embed the MCP Server in Your Agent System

Config Schema Reference

What This Skill Does

Minimal Working Example

Full API Reference

`agents/llm.py`

`agents/config_generator.py`

`agents/autonomous_scraper.py`

Plug In Your Own LLM

Build Your Own Repair Loop

Embed the MCP Server in Your Agent System

Config Schema Reference

name	integrate
description	Embed this repo's scraping primitives into your own autonomous agent. Use this when you want to call the scraping engine programmatically - not via CLI - and drive the generate/execute/repair loop from your own code with your own LLM.