一键在 Manus 中运行任何 Skill

sweep-ai-safety

星标11

分支1

更新时间2026年6月10日 16:15

Sweep recent AI safety research from curated sources (Anthropic alignment science / red team, OpenAI, GDM, Apollo, Redwood, METR, FAR AI, Truthful AI, alphaxiv, arXiv) and surface items matching tracked topic terms (inoculation prompting, reward hacking, exploration hacking, metagaming, eval gaming, OOCR, scheming, alignment faking, sandbagging, etc.). Use when asked to "sweep AI safety", "what's new in alignment", "any recent papers on X", "weekly safety digest", or for staying current on AI safety literature.

安装

用 Codex 或 Claude 帮你安装复制这段 Prompt，粘贴到 Codex、Claude 或其他助手里，让它检查 Skill 页面并帮你完成安装。

在 Manus 中运行

来源

yulonglin

yulonglin/dotfiles

打开 GitHub 仓库查看创作者相关仓库

下载

在 Manus 中运行

相关职业SOC

基于 SOC 职业分类

软件开发工程师计算机与数学类职业·SOC 15-1252

文件资源管理器

4 个文件

SKILL.md

readonly

同仓库更多 Skills

同仓库

commit-push-sync

yulonglin/dotfiles

This skill should be used when the user asks to "commit and push", "commit push", "sync changes", "push changes", "commit and sync", or "update remote". Handles the full workflow of committing changes, pulling with rebase, and pushing to remote.

2026-06-2511

anthropic-style

yulonglin/dotfiles

Anthropic visual style for plots, diagrams, slides, and web. Use when creating any visual output that should have Anthropic's look-and-feel — matplotlib charts, TikZ diagrams, HTML/CSS, or presentations.

2026-06-0711

check-bib-references

yulonglin/dotfiles

Catch LLM-fabricated citations in BibTeX files. Verifies arXiv/OpenReview entries against live metadata (titles, first authors), then guides manual verification of authorless prose claims. Use before submitting papers, after any LLM-assisted citation generation, or when a reference smells off.

2026-05-2411

check-prose-claims

yulonglin/dotfiles

Fact-check prose claims in slides, reports, PDFs, and papers — statistics, comparatives, attributions, causal claims, quotes. Two-pass extract-then-verify protocol with strict numerical precision and a doc-only mode. Use when the user asks to "check the claims in this deck", "fact-check this report", "audit this PDF", "verify the numbers in these slides", or before publishing/shipping any externally-facing document with quantitative claims. Complements `check-bib-references` (which handles BibTeX entries) — this skill handles the prose around them.

2026-05-2411

log-gap

yulonglin/dotfiles

Log a one-line knowledge gap to the project's gaps.md file. Use when the user is surprised by Claude's answer, says "I didn't know that", "wait what", or wants to record a misconception they just discovered. Format "I assumed X but actually Y". Personal misconception log — much higher learning signal than feedback memories.

2026-05-0211

recall-feedback

yulonglin/dotfiles

Resurface a random sample of feedback memories for spaced-repetition review — "still true? changed? promote to global rule?". Use when user asks for feedback retrospective, weekly memory review, or to audit accumulated coaching corrections. Also good for periodic via /schedule.

2026-05-0211

name

sweep-ai-safety

description

Sweep AI Safety Research

A hybrid skill: a curated source registry + topic glossary Claude can consult during research, and a Python script that fetches feeds and produces a dated markdown digest of the last 7 days.

When to invoke

User asks: "what's new in AI safety", "sweep alignment research", "any recent papers on ", "weekly safety digest", "what did Anthropic/Apollo/Redwood post recently"
Before research planning — to check whether a question is already addressed by recent work
When a user mentions an unfamiliar term (inoculation prompting, OOCR, exploration hacking, metagaming) — consult terms.md
Periodic — schedule via /loop 7d /sweep-ai-safety or a cron routine

Quick start

# Default: last 7 days, all sources, markdown to stdout
uv run ~/.claude/skills/sweep-ai-safety/sweep.py

# Save to a dated file
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --output digest.md

# Wider window
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --since 30d

# Filter by topic term (matches in title/summary)
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --term "reward hacking"

# Single source
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --source anthropic-alignment

# arXiv keyword search (uses term registry by default)
uv run ~/.claude/skills/sweep-ai-safety/sweep.py --arxiv-only

Architecture

sweep-ai-safety/
├── SKILL.md          # this file
├── sources.yaml      # source registry (orgs, blogs, feed URLs)
├── terms.md          # topic-term glossary with short definitions
└── sweep.py          # fetcher (PEP 723, uv run directly)

Source registry (`sources.yaml`)

One entry per source. Fields:

- key: anthropic-alignment        # short id used in --source
  org: Anthropic
  name: Alignment Science blog
  url: https://alignment.anthropic.com/
  rss: https://alignment.anthropic.com/rss.xml   # null if no feed
  lane: rss | scrape               # rss != null <=> lane: rss
  kind: blog | papers | aggregator | researcher
  verified: false                  # flip to true after first successful fetch
  notes: ...

The script does NOT trust rss: blindly — on first run, it reports which feeds resolved and which need manual URL correction. Edit sources.yaml to fix.

Two-lane reality (RSS vs scrape)

Most curated orgs do not expose a working RSS feed. Live-verified state:

RSS lane (have a real feed, fetched by the feed reader): openai-blog, openai-alignment, metr, redwood. These are the only sources that produce items from a plain sweep.py run.
Scrape lane (rss: null, no usable feed — fetched via WebFetch on the landing url): anthropic-alignment, anthropic-redteam, anthropic-research, openai-safety, apollo, transluce, deepmind-safety, far-ai, truthful-ai, owain-evans, alphaxiv-safety. Anthropic (both blogs) and Apollo are SPAs/404 — their /rss.xml serves HTML, not XML.
arXiv (arxiv-terms): scrape-lane by the rss-null convention, but fetched by the script's dedicated arXiv export-API path, not the generic WebFetch scrape step.

A scrape-lane source returning 0 items is EXPECTED from a plain sweep.py run — the script only reads RSS feeds. Those sources surface only when the scrape step runs (WebFetch the url). Treat "0 items from a scrape-lane source" as "not fetched", not "nothing published". The scrape-lane fetch implementation lands in a later phase; until then, scrape-lane sources are checked manually via WebFetch (see reference-mode workflow).

Zotero (dedup + sink)

Zotero is not a fetch source. It's the dedup reference (what's already been collected) and the storage sink (where surfaced items land). It's handled by the downstream pipeline, not by the sweep fetch itself — sweep.py neither reads from nor writes to Zotero.

Topic glossary (`terms.md`)

A flat list of tracked terms with one-sentence definitions and (where useful) seminal paper anchors. Used:

By Claude as a reference when a user mentions a term — read the entry and the linked paper
By sweep.py for cross-reference: items mentioning any registered term are tagged in the digest

Add a new term: append to terms.md. Optionally add it to the arxiv_search_terms: list in sources.yaml so the script searches arXiv for it.

Reference-mode workflow (no script)

When the user mentions a term or asks about recent work without wanting a full sweep:

Open terms.md — does the term have a glossary entry? Read it
Open sources.yaml — identify the likely source(s) (e.g., scheming → Apollo; sandbagging → Anthropic alignment / METR; inoculation prompting → Truthful AI / Owain Evans)
Use WebSearch or WebFetch on the relevant source's blog / publications page
If a paper title is mentioned, look it up via arXiv API:
- curl -sL "http://export.arxiv.org/api/query?search_query=ti:%22<title%20words>%22&max_results=5" (Atom XML)

Sweep-mode workflow (script)

uv run ~/.claude/skills/sweep-ai-safety/sweep.py --output "$HOME/scratch/safety-digest-$(utc_date).md"
Review the digest. Flag any failed fetches — they indicate URLs that drifted
Update sources.yaml if URLs need correction; mark verified: true for ones that worked
Read the high-signal items (matched-term or known-author hits) in full via WebFetch

Failure modes & gotchas

Issue	Why	Fix
Most sources return 0 items	RSS URL drifted or site has no feed	Open the org's blog page, find the actual feed URL, update `sources.yaml`. If no feed exists, the org has to be checked manually via WebFetch
arXiv requests get throttled	arXiv rate-limits at ~5 req/3s; sticky penalty 30-60s if exceeded	Script already batches arXiv term searches. If still throttled, wait 60s
Same paper appears under multiple sources	A paper can be on arXiv + an org blog + alphaxiv	Script dedupes by arXiv ID and by normalized title (case + punctuation collapsed). Subtitles or site-specific suffixes will still slip through — flag duplicates manually
Term doesn't match because of variant spelling	"OOCR" vs "out-of-context reasoning" vs "out of context reasoning"	Add aliases to the term entry in `terms.md` and the regex in `sources.yaml`
Sandbox blocks external HTTP	Most non-allowlisted hosts return connection error from Claude Code's sandbox	Run with `dangerouslyDisableSandbox: true`, or run from a normal shell
Item is in the right time window but old content	Some blogs republish/redate posts	Cross-check the canonical URL date; trust arXiv `submittedDate` over blog dates

Verification policy (research integrity)

This skill surfaces candidates. Always verify before acting on a finding:

Don't cite a paper from the digest without WebFetching the actual source
Don't claim a term is "from paper X" without checking — the script's glossary is a starting point, not ground truth
If the digest says "no items from in last 7d", that means either nothing was published OR the feed isn't working — distinguish before relying on the absence

Adapting

Add a source: append to sources.yaml, set verified: false, run sweep, update if it works
Add a tracked term: append to terms.md with a one-line definition; optionally add to sources.yaml arxiv_search_terms: for arXiv inclusion
Different cadence: pass --since 14d / --since 30d; or schedule via /loop 14d /sweep-ai-safety (or as a routine via /schedule)
JSON output for programmatic use: pass --json (emits one item per line, NDJSON)
Conference papers: the weekly sweep relies on arXiv catching venue cross-posts — tag an item with its venue when the arXiv comment field names one (e.g. "Accepted at NeurIPS 2025"). A separate episodic --conference <venue> roundup mode (scraping accepted-paper lists plus safety workshops like SoLaR) is planned for when proceedings drop — NeurIPS/ICLR/ICML land ~3x/year, not weekly, so it doesn't belong in the weekly cadence.

sweep-ai-safety

同仓库更多 Skills

同仓库更多 Skills

Sweep AI Safety Research

When to invoke

Quick start

Architecture

Source registry (sources.yaml)

Two-lane reality (RSS vs scrape)

Zotero (dedup + sink)

Topic glossary (terms.md)

Reference-mode workflow (no script)

Sweep-mode workflow (script)

Failure modes & gotchas

Verification policy (research integrity)

Adapting

Sweep AI Safety Research

When to invoke

Quick start

Architecture

Source registry (sources.yaml)

Two-lane reality (RSS vs scrape)

Zotero (dedup + sink)

Topic glossary (terms.md)

Reference-mode workflow (no script)

Sweep-mode workflow (script)

Failure modes & gotchas

Verification policy (research integrity)

Adapting

Source registry (`sources.yaml`)

Topic glossary (`terms.md`)

Source registry (`sources.yaml`)

Topic glossary (`terms.md`)