Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification.
Installation
Installer avec Codex ou Claude Copiez ce prompt, collez-le dans Codex, Claude ou un autre assistant, puis laissez-le vérifier la page du skill et l'installer pour vous.
Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification.
allowed-tools
Read, Bash, Grep, Glob, Edit, Write
Research Archival
Scrape AI research conversations (ChatGPT, Gemini, Claude) and web pages, archive them as markdown files with YAML frontmatter, and create cross-referenced GitHub Issues — with mandatory identity verification at every step.
Self-Evolving Skill: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.
FIRST - TodoWrite Task Templates
MANDATORY: Select and load the appropriate template before any archival work.
Template A - Full Archival (scrape + save + issue)
1. Identity preflight — verify GH_ACCOUNT or resolve via curl /user
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
4. Survey labels — gh label list, reuse existing, max 3-6
5. Create GitHub Issue — use --body with heredoc or --body-file
6. Update frontmatter — add github_issue_url and github_issue_number
7. Post canonical backlink comment on Issue
Template B - Save Only (no issue)
1. Identity preflight (still required for consistency)
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
Template C - Issue Only (file already exists)
1. Identity preflight
2. Read existing file frontmatter
3. Survey labels — gh label list, reuse existing, max 3-6
4. Create GitHub Issue — use --body with heredoc or --body-file
5. Update file frontmatter with issue cross-reference
6. Post canonical backlink comment on Issue
Identity Preflight (MANDATORY — Step 0)
MUST execute before any gh write command. Non-negotiable.
The gh-repo-identity-guard.mjs PreToolUse hook provides a safety net, but this skill performs its own check as defense-in-depth.
Resolution Order
Fast-path — GH_ACCOUNT env var (set by mise per-directory)
Token filename — scan ~/.claude/.secrets/gh-token-* for single base match
API call — curl -sH "Authorization: token $GH_TOKEN" https://api.github.com/user
BLOCK if mismatch — display diagnostic and do NOT continue to any gh write operation.
Scraping Workflow
Route scrape requests based on URL pattern. See url-routing.md for full details.
Decision Tree
URL contains chatgpt.com/share/
→ Jina Reader (https://r.jina.ai/{URL})
→ Use curl (not WebFetch — it summarizes instead of returning raw)
URL contains gemini.google.com/share/
→ Firecrawl (JS-heavy SPA)
→ Preflight: ping -c1 -W2 littleblack
URL contains claude.ai/artifacts/ or is a static web page
→ Jina Reader (https://r.jina.ai/{URL})
→ Use WebFetch or curl
Firecrawl Scrape (with Health Check + Auto-Revival)
CRITICAL: Firecrawl containers can show "Up" in docker ps while internal processes are dead (RAM/CPU overload crashes the worker inside the container). Always perform a deep health check before scraping.
/usr/bin/env bash << 'SCRAPE_EOF'set -euo pipefail
# Step 1: Check Tailscale connectivity (littleblack primary, ZeroTier legacy at 172.25.236.1)if ! ping -c1 -W2 littleblack >/dev/null 2>&1; thenecho"ERROR: Firecrawl host unreachable. Check Tailscale: tailscale status"exit 1
fi# Step 2: Deep health check — test actual API response, not just container status# Port 3003 (wrapper) may accept TCP but return empty if Firecrawl API (3002) is dead inside
HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
-X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo"000")
if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; thenecho"WARNING: Firecrawl API unhealthy (HTTP $HTTP_CODE). Attempting revival..."# Step 2a: Check docker logs for WORKER STALLED (RAM/CPU overload)
ssh littleblack 'docker logs firecrawl-api-1 --tail 20 2>&1 | grep -i "stalled\|error\|exit" || true'# Step 2b: Restart the critical containers
ssh littleblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1' 2>/dev/null
echo"Containers restarted. Waiting 20s for API to initialize..."sleep 20
# Step 2c: Verify recovery
HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
-X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo"000")
if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; thenecho"ERROR: Firecrawl still unhealthy after restart (HTTP $HTTP_CODE)."echo"Manual intervention needed. Try: ssh littleblack 'cd ~/firecrawl && docker compose up -d --force-recreate'"echo"Falling back to Jina Reader: https://r.jina.ai/${URL}"exit 1
fiecho"Firecrawl recovered successfully."fi# Step 3: Scrape via wrapper
CONTENT=$(curl -s --max-time 120 "http://littleblack:3003/scrape?url=${URL}&name=${SLUG}")
if [ -z "$CONTENT" ]; thenecho"ERROR: Scrape returned empty. Try Jina fallback: https://r.jina.ai/${URL}"exit 1
fiecho"$CONTENT"
SCRAPE_EOF
Known Failure Mode: Container "Up" But Processes Dead
Symptom: docker ps shows containers with status "Up 4 days" but curl localhost:3002 returns connection reset.
Root cause: Firecrawl worker exhausts RAM/CPU (observed: cpuUsage=0.998, memoryUsage=0.858). Internal Node.js processes exit but Docker container stays alive because the entrypoint shell is still running.
Did the command succeed? — If not, fix the instruction or error table that caused the failure.
Did parameters or output change? — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
Was a workaround needed? — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.
Only update if the issue is real and reproducible — not speculative.