원클릭으로 Manus에서 모든 스킬 실행

html2md

스타1

포크0

업데이트2026년 6월 18일 12:57

Use when converting a web page (URL) or a saved .html/.htm/.mhtml/.webarchive into clean Markdown — a web-clipper for Obsidian notes and a universal HTML→Markdown step for agent workflows. Triggers include "html to markdown", "url to markdown", "web page to obsidian", "webarchive to markdown", "mhtml to markdown", "scrape page to notes", "clip this article".

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

MatrixFounder

MatrixFounder/Universal-skills

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

관련 직업SOC

SOC 직업 분류 기준

소프트웨어 개발자컴퓨터 및 수학직·SOC 15-1252

파일 탐색기

49 개 파일

SKILL.md

readonly

이 저장소의 다른 Skills

같은 저장소

pdf

MatrixFounder/Universal-skills

Use when the user asks to create, combine, split, preview, or extract content from PDF files. Triggers include "markdown to pdf", "mermaid in pdf", "merge PDFs", "split a PDF", "extract text from pdf", "fill AcroForm", "preview pdf as image", and similar PDF generation or manipulation tasks.

2026-06-191

summarizing-meetings

MatrixFounder/Universal-skills

Use when summarizing meeting transcripts OR articles, papers, and threads into structured Markdown or wiki note-JSON. Model-agnostic meta-skill: auto-detects content type, selects a template, and produces a two-level pyramid (or opt-in structured note-JSON) optimized for people, AI agents, RAG, and Obsidian.

2026-06-181

docx

MatrixFounder/Universal-skills

Use when the user asks to create, edit, convert, validate, preview, or password-protect Microsoft Word .docx documents. Triggers include "markdown to docx", "docx to markdown", "fill Word template", "accept tracked changes", "validate docx", "preview docx as image", "encrypt/decrypt docx", and related .docx round-trip or template-fill tasks.

2026-06-171

obsidian-cli

MatrixFounder/Universal-skills

Use to DRIVE the running Obsidian desktop app from the shell via its official CLI: link-safe rename/move, typed properties, task toggles, daily-note capture, template insertion, Base queries, file-history restore, open notes/panes. Triggers: "rename/move the note", "open in Obsidian", "daily note", "set a property", "query the base", "restore a version", "obsidian cli". NOT for knowledge lookup — for anything ABOUT vault content use wiki-ingest query mode first.

2026-06-121

pptx

MatrixFounder/Universal-skills

Use when the user asks to create, edit, convert, preview, clean, or password-protect Microsoft PowerPoint .pptx presentations. Triggers include "markdown to pptx", "pptx to markdown", "slides from outline", "mermaid in slides", "pptx to pdf", "slide thumbnails", "drop orphan slides", "OCR slide images", "encrypt/decrypt pptx", and related presentation or OOXML round-trip tasks.

2026-06-091

xlsx

MatrixFounder/Universal-skills

Use when the user asks to create, transform, validate, chart, preview, or password-protect Microsoft Excel .xlsx workbooks. Triggers include "csv to xlsx", "recalculate this workbook", "scan formula errors", "add a chart to xlsx", "bar / line / pie chart over a range", "financial model in xlsx", "fix

2026-06-081

name	html2md
description	Use when converting a web page (URL) or a saved .html/.htm/.mhtml/.webarchive into clean Markdown — a web-clipper for Obsidian notes and a universal HTML→Markdown step for agent workflows. Triggers include "html to markdown", "url to markdown", "web page to obsidian", "webarchive to markdown", "mhtml to markdown", "scrape page to notes", "clip this article".
tier	2
version	1
license	LicenseRef-Proprietary

html2md skill

Purpose: Convert a web URL or a downloaded .html/.htm/.mhtml/.webarchive into clean Markdown — with YAML frontmatter and a shared _attachments/ folder — for two consumers: (1) an Obsidian web-clipper (self-contained note), and (2) a universal HTML→Markdown step any agent workflow can call.

1. Red Flags (Anti-Rationalization)

"I'll just paste the HTML and convert it in my head" → WRONG. The script reuses the docx-mastered turndown core (GFM tables, rowspan→flat grid) and the pdf-mastered cleaner (reader-mode, SPA-chrome strip); reimplementing in prose regresses on every edge case.
"I'll fetch the page with curl and strip tags with regex" → WRONG. Use the script — it has SSRF protection, dual-output, sha1-deduped attachments.

2. Capabilities

URL → Markdown (--engine lite|chrome|auto|jina): httpx+trafilatura lite fetch (also yields title/date/author) with retry + backoff + 429/Retry-After and a 403 → browser-UA escalation (honest UA by default); auto-fallback to headless Chrome for JS/SPA pages; or --engine jina (Jina Reader — server-side JS render, no local browser; sends the URL to an external service). --rate-limit throttles fetches.
Site-specific clean-source endpoints (proactive, auto/lite): Wikipedia /wiki/<Title> → the Parsoid REST page/html endpoint (the canonical page is chrome-only and strips to empty); arXiv /abs/ or /pdf/<id> → the full-text /html/<id> rendering (PDF-only papers return an actionable "use the pdf skill" hint); HackerNoon /<slug> → /lite/<slug>.
Empty-extraction guard: a substantial source page that converts to a near-empty body is a typed EmptyExtraction (exit 11) — never a silent exit 0 with an empty note.
Archive → Markdown: Safari .webarchive + Chrome .mhtml (subframe-aware) + plain .html/.htm, fully offline.
Obsidian emit: YAML frontmatter; --download-images → _attachments/ (sha1-dedup, relative links); dual-output (<slug>.md + <slug>.reader.md).
Agent step: --stdout (Markdown to stdout) + --json-errors envelope.

3. Execution Mode

Mode: script-first.
Why this mode: HTML→Markdown is a deterministic, edge-case-heavy pipeline (fetch → clean → turndown → emit) reusing hardened docx/pdf code. Inline agent conversion regresses on tables, SPA chrome, encodings, and image handling, and has no SSRF protection.

4. Script Contract

Command:
- python3 scripts/html2md.py INPUT [OUTPUT_DIR] [--engine lite|chrome|auto|jina] [--reader-mode|--no-reader] [--download-images|--no-download-images] [--attachments-dir _attachments] [--archive-frame main|N|all|auto] [--max-bytes N] [--max-images N] [--retries N] [--rate-limit REQS_PER_SEC] [--stdout] [--json-errors]
INPUT: a http(s) URL, or a local .html/.htm/.mhtml/.mht/.webarchive.
OUTPUT_DIR: directory to write <slug>.md (+ <slug>.reader.md by default) and _attachments/ into. Omit → defaults to ./tmp/html2md_out/ (created on demand, in the working directory). --stdout opts into stdout mode: YAML frontmatter + whole-page Markdown (the reader variant and image files are skipped — not the reader-extracted text).
Defaults: --engine auto, dual-output ON (--no-reader to suppress), --download-images ON (--no-download-images keeps remote URLs), attachments dir _attachments, --archive-frame main.
Outputs: <slug>.md + <slug>.reader.md + _attachments/<sha1>.<ext>; or Markdown on stdout. <slug> is derived from the input filename / URL path (deterministic); the human title lives in frontmatter.
Failure semantics / exit codes: 0 ok · 1 BadInput/ConvertFailed/internal · 2 usage · 3 EngineNotInstalled (Chrome requested, Playwright absent) · 6 SelfOverwriteRefused · 10 FetchFailed (unreachable / blocked / over --max-bytes; details.kind ∈ bot_blocked/auth_required/not_found/rate_limited/server_error/ unreachable/pdf/binary/arxiv_no_html) · 11 EmptyExtraction (substantial source → near-empty body). --json-errors emits {v:1, error, code, type?, details?} on stderr.
Idempotency: same input → same output filenames + deduped attachments. URL fetches reflect live content (not idempotent across server changes).

5. Safety Boundaries

Allowed scope: only the input + the named OUTPUT_DIR (and its _attachments/). Never writes elsewhere.
Image reads are confined: a malicious <img src="../../etc/passwd"> / file:///… / absolute path is refused — local image reads are confined to the input's base dir (CWE-22/73 guard).
SSRF protection (lite path): every fetch hop (initial + redirects) is refused if it resolves to a loopback / private / link-local / cloud-metadata (169.254.169.254) address; body is streamed with a --max-bytes abort; --max-images bounds remote fetches; non-http(s) top-level INPUT is treated as a local path, never fetched.
Honest-scope residuals: DNS-rebinding (resolve-then-connect TOCTOU) and the opt-in Chrome engine are NOT SSRF-hardened — run untrusted conversions in an egress-restricted sandbox. --engine jina sends the target URL to the external r.jina.ai service (it fetches server-side) — opt-in only, never part of auto; do not use it for sensitive/internal URLs. See references/html-to-markdown.md.
No global installs: deps live in scripts/.venv + scripts/node_modules.

6. Validation Evidence

Local verification:
- bash scripts/install.sh — creates .venv (httpx, trafilatura), node_modules (turndown, turndown-plugin-gfm). --with-chrome adds Playwright Chromium.
- python3 scripts/html2md.py examples/sample.html /tmp/h2m && test -s /tmp/h2m/*.md — offline file → dual Markdown + frontmatter.
- ./scripts/.venv/bin/python -m unittest discover -s scripts/html2md/tests and -s scripts/tests — full unit + E2E suite (file/archive/url mocked + real tmp/ fixtures when present).
- bash scripts/tests/test_e2e.sh — runs the suite + the diff -q replication gate.
CI signal: python3 .claude/skills/skill-creator/scripts/validate_skill.py skills/html2md — exits 0.

7. Instructions

7.1 Clip a live URL into an Obsidian vault

python3 scripts/html2md.py https://example.com/article ./MyVault/Clips/

Produces article.md (whole) + article.reader.md (reader-extracted) + deduped _attachments/. Use --engine chrome (after install.sh --with-chrome) for JS/SPA pages.

7.2 Convert a saved archive offline

python3 scripts/html2md.py ./saved.webarchive ./out/ --archive-frame main
python3 scripts/html2md.py ./thread.mhtml ./out/ --archive-frame all

7.3 Use as a universal agent step

python3 scripts/html2md.py ./page.html --stdout --no-download-images --no-reader --json-errors

Whole-page Markdown on stdout; failures as a single-line JSON envelope.

8. Architecture & Replication (for maintainers)

html2md is the repo's first two-master skill (CLAUDE.md §2). It carries byte-identical replicas — do not edit them here, diff -q gated:

web_clean/{archives,reader_mode,preprocess,dom_utils,normalize_css}.py — MASTER = pdf.
html2md_core.js — MASTER = docx.
_errors.py, _venv_bootstrap.py — MASTER = docx (4→5-skill).

The pdf render.py/chrome_engine.py/package __init__.py (weasyprint/playwright carriers) are never replicated; web_clean/__init__.py is an html2md-owned thin facade. See scripts/.AGENTS.md.

9. License

10. Resources

references/html-to-markdown.md — decision tree (URL/archive/file; reader vs whole; lite vs chrome) + honest scope.
examples/basic-usage.md — copy-paste examples.