Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification.
Installation
Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.
Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification.
allowed-tools
Read, Bash, Grep, Glob, Edit, Write
Research Archival
Scrape AI research conversations (ChatGPT, Gemini, Claude) and web pages, archive them as markdown files with YAML frontmatter, and create cross-referenced GitHub Issues — with mandatory identity verification at every step.
Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.
FIRST - TodoWrite Task Templates
MANDATORY: Select and load the appropriate template before any archival work.
Template A - Full Archival (scrape + save + issue)
1. Identity preflight — verify GH_ACCOUNT or resolve via curl /user
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
4. Survey labels — gh label list, reuse existing, max 3-6
5. Create GitHub Issue — use --body with heredoc or --body-file
6. Update frontmatter — add github_issue_url and github_issue_number
7. Post canonical backlink comment on Issue
Template B - Save Only (no issue)
1. Identity preflight (still required for consistency)
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
Template C - Issue Only (file already exists)
1. Identity preflight
2. Read existing file frontmatter
3. Survey labels — gh label list, reuse existing, max 3-6
4. Create GitHub Issue — use --body with heredoc or --body-file
5. Update file frontmatter with issue cross-reference
6. Post canonical backlink comment on Issue
Identity Preflight (MANDATORY — Step 0)
MUST execute before any gh write command. Non-negotiable.
The gh-repo-identity-guard.mjs PreToolUse hook provides a safety net, but this skill performs its own check as defense-in-depth.
Resolution Order
Fast-path — GH_ACCOUNT env var (set by mise per-directory)
Token filename — scan ~/.claude/.secrets/gh-token-* for single base match
API call — curl -sH "Authorization: token $GH_TOKEN" https://api.github.com/user
BLOCK if mismatch — display diagnostic and do NOT continue to any gh write operation.
Scraping Workflow
Route scrape requests based on URL pattern. See url-routing.md for full details.
Decision Tree
URL contains chatgpt.com/share/
→ Jina Reader (https://r.jina.ai/{URL})
→ Use curl (not WebFetch — it summarizes instead of returning raw)
URL contains gemini.google.com/share/
→ Firecrawl (JS-heavy SPA)
→ Preflight: ping -c1 -W2 littleblack
URL contains claude.ai/artifacts/ or is a static web page
→ Jina Reader (https://r.jina.ai/{URL})
→ Use WebFetch or curl
Firecrawl Scrape (with Health Check + Auto-Revival)
CRITICAL: Firecrawl containers can show "Up" in docker ps while internal processes are dead (RAM/CPU overload crashes the worker inside the container). Always perform a deep health check before scraping.
/usr/bin/env bash << 'SCRAPE_EOF'set -euo pipefail
# Step 1: Check Tailscale connectivity (littleblack primary, ZeroTier legacy at 172.25.236.1)if ! ping -c1 -W2 littleblack >/dev/null 2>&1; thenecho"ERROR: Firecrawl host unreachable. Check Tailscale: tailscale status"exit 1
fi# Step 2: Deep health check — test actual API response, not just container status# Port 3003 (wrapper) may accept TCP but return empty if Firecrawl API (3002) is dead inside
HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
-X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo"000")
if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; thenecho"WARNING: Firecrawl API unhealthy (HTTP $HTTP_CODE). Attempting revival..."# Step 2a: Check docker logs for WORKER STALLED (RAM/CPU overload)
ssh littleblack 'docker logs firecrawl-api-1 --tail 20 2>&1 | grep -i "stalled\|error\|exit" || true'# Step 2b: Restart the critical containers
ssh littleblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1' 2>/dev/null
echo"Containers restarted. Waiting 20s for API to initialize..."sleep 20
# Step 2c: Verify recovery
HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
-X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo"000")
if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; thenecho"ERROR: Firecrawl still unhealthy after restart (HTTP $HTTP_CODE)."echo"Manual intervention needed. Try: ssh littleblack 'cd ~/firecrawl && docker compose up -d --force-recreate'"echo"Falling back to Jina Reader: https://r.jina.ai/${URL}"exit 1
fiecho"Firecrawl recovered successfully."fi# Step 3: Scrape via wrapper
CONTENT=$(curl -s --max-time 120 "http://littleblack:3003/scrape?url=${URL}&name=${SLUG}")
if [ -z "$CONTENT" ]; thenecho"ERROR: Scrape returned empty. Try Jina fallback: https://r.jina.ai/${URL}"exit 1
fiecho"$CONTENT"
SCRAPE_EOF
Known Failure Mode: Container "Up" But Processes Dead
Symptom: docker ps shows containers with status "Up 4 days" but curl localhost:3002 returns connection reset.
Root cause: Firecrawl worker exhausts RAM/CPU (observed: cpuUsage=0.998, memoryUsage=0.858). Internal Node.js processes exit but Docker container stays alive because the entrypoint shell is still running.
Did the command succeed? — If not, fix the instruction or error table that caused the failure.
Did parameters or output change? — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
Was a workaround needed? — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.
Only update if the issue is real and reproducible — not speculative.