Run any Skill in Manus with one click

$pwd:

content-moderation-patterns

Name: Content Moderation Patterns
Author: softspark

// Content moderation with Claude: pre-filter vs LLM-classify, categories, thresholds, HITL. Triggers: moderation, safety filter, policy enforcement, content classifier.

Run Skill in Manus

$ git log --oneline --stat

stars:143

forks:18

updated:May 6, 2026 at 11:14

SKILL.md

readonly

package.json

"author": "softspark"

"repository": "softspark/ai-toolkit"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	content-moderation-patterns
description	Content moderation with Claude: pre-filter vs LLM-classify, categories, thresholds, HITL. Triggers: moderation, safety filter, policy enforcement, content classifier.
effort	medium
user-invocable	false
allowed-tools	Read

Content Moderation Patterns

Two-stage pattern that balances cost, latency, and quality: cheap deterministic filters first, then LLM classification only on survivors.

Architecture

[ input ]
   │
   ▼
[ pre-filter ]  ── (regex, allow/block lists, length check) ──► reject early
   │
   ▼
[ LLM classifier ]  ── (Haiku, structured output) ──► categories + confidence
   │
   ▼
[ decision router ]
   ├── high confidence + policy violation → reject
   ├── high confidence + clean → pass
   └── low confidence or edge categories → human review queue

Pre-filter Stage (cheap)

Catch the obvious cases before paying an LLM call:

BANNED_PATTERNS = [
    re.compile(r"\b(banned_term_1|banned_term_2)\b", re.I),
    re.compile(r"\bhttps?://(?!allowed-domain\.com)", re.I),  # external links
]

def pre_filter(text: str) -> tuple[bool, str]:
    if len(text) > 10_000:
        return False, "too_long"
    for pat in BANNED_PATTERNS:
        if pat.search(text):
            return False, f"banned_pattern:{pat.pattern}"
    return True, "pass"

Roughly 40-70% of spammy input should die here. Log counts by rule so you can tune.

LLM Classifier Stage (Haiku)

Use the smallest capable model. Haiku is usually right for moderation.

CATEGORIES = ["harassment", "self_harm", "spam", "off_topic", "pii", "clean"]

def classify(text: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=256,
        tools=[{
            "name": "moderate",
            "description": "Classify content against policy",
            "input_schema": {
                "type": "object",
                "properties": {
                    "categories": {
                        "type": "array",
                        "items": {"type": "string", "enum": CATEGORIES}
                    },
                    "confidence": {"type": "number", "minimum": 0, "maximum": 1},
                    "reasoning": {"type": "string", "maxLength": 200}
                },
                "required": ["categories", "confidence", "reasoning"]
            }
        }],
        tool_choice={"type": "tool", "name": "moderate"},
        system=POLICY_DESCRIPTION,  # cached — stable across requests
        messages=[{"role": "user", "content": text}]
    )
    return extract_tool_result(response)

Cache the policy description — it's the same on every call. See prompt-caching-patterns.

Category Design

Start with 5-8 categories, not 50. Fewer = higher per-category accuracy.
One clean category — easier than trying to define "not bad"
No overlapping categories — harassment and hate_speech should be merged or clearly separated by specific criteria in the policy doc
unclear / needs_review category — gives the model a graceful escape hatch instead of forcing a wrong label

Threshold Router

def route(classification: dict) -> str:
    conf = classification["confidence"]
    cats = set(classification["categories"])

    if "clean" in cats and conf > 0.8:
        return "pass"
    if cats & BLOCK_CATEGORIES and conf > 0.85:
        return "reject"
    if cats & BLOCK_CATEGORIES:
        return "human_review"
    return "human_review"  # default to review on ambiguity

Thresholds belong in config, not code — they change as you learn.

Human-in-the-Loop

All human_review cases go to a queue with the model's reasoning
Human decisions flow back into a dataset used to evaluate new model versions
Track disagreement rate between human and model — rising disagreement signals policy drift

Evaluation Loop

Build a golden set of ~500 examples per category with ground truth. Track:

Precision per category (of what we flagged, how much was truly bad)
Recall per category (of truly bad, how much we caught)
False-positive cost per category — harassment FP is cheap, "clean FP" (wrongly blocking good content) is expensive

Anti-patterns

Anti-pattern	Why it bites	Fix
One huge prompt asking "is this okay?"	Unstable answers, no tracking	Structured categories + confidence
Using Opus for moderation	10x cost, no accuracy gain for this task	Haiku is fine
Hiding policy in user message	Policy gets mixed with input	Policy in system prompt, cached
Binary block/allow only	No signal for edge cases	Add review queue
No audit trail	Can't improve	Log every decision with full classification

security-patterns — input validation at system boundaries
prompt-caching-patterns — cache the policy doc
model-routing-patterns — when to escalate from Haiku to Sonnet
Anthropic docs: https://docs.claude.com/en/docs/about-claude/use-case-guides/content-moderation

Rules

MUST pre-filter the obvious cases (regex, deny-lists, length caps) before sending to an LLM — LLM moderation on a 500MB comment is unusable
MUST return a structured JSON classification (category, confidence, reason), not a prose verdict — prose breaks audit trails
NEVER ship a moderation pipeline without a human-in-the-loop escalation path for ambiguous cases
NEVER hide the policy in the user message; the policy belongs in the cached system prompt so it is versioned and auditable
CRITICAL: log every decision (input, category, confidence, model, policy version, timestamp) — moderation without an audit trail cannot be improved or appealed
MANDATORY: calibrate confidence thresholds per category; harassment FP is cheap, clean-content FP is expensive

Gotchas

False positives on clean content are much more expensive than false negatives on borderline content, in user-trust terms. Optimize for recall on hard-fail categories (CSAM, doxing) but precision on soft-fail categories (spam, rudeness).
Haiku is sufficient for most moderation classification; using Opus inflates cost 10× with no measurable accuracy gain on this task. Reach for Opus only for edge cases that Haiku consistently misclassifies.
Prompt caching on the policy doc only hits when the cache window is still warm (5 minutes). Bursty traffic with long quiet periods loses the cache every window — amortize by keeping a heartbeat call.
Structured JSON output via tool-use is more reliable than free-form JSON in the response — parse errors happen ~1-3% of the time with free-form, near zero with tool-use schemas.
The golden test set drifts. Policy changes, new attack patterns, and new product surfaces all invalidate old examples. Refresh quarterly or after any policy update.

When NOT to Load

For structured JSON output design in general — use /json-mode-patterns
For security input validation (SQLi, XSS) — use /security-patterns
For caching the policy doc mechanics — use /prompt-caching-patterns
For picking the model tier (Haiku vs Sonnet vs Opus) — use /model-routing-patterns
For moderation of voice/audio content — this skill covers text; audio adds a transcription failure mode not covered here

name	content-moderation-patterns
description	Content moderation with Claude: pre-filter vs LLM-classify, categories, thresholds, HITL. Triggers: moderation, safety filter, policy enforcement, content classifier.
effort	medium
user-invocable	false
allowed-tools	Read

Content Moderation Patterns

Two-stage pattern that balances cost, latency, and quality: cheap deterministic filters first, then LLM classification only on survivors.

Architecture

[ input ]
   │
   ▼
[ pre-filter ]  ── (regex, allow/block lists, length check) ──► reject early
   │
   ▼
[ LLM classifier ]  ── (Haiku, structured output) ──► categories + confidence
   │
   ▼
[ decision router ]
   ├── high confidence + policy violation → reject
   ├── high confidence + clean → pass
   └── low confidence or edge categories → human review queue

Pre-filter Stage (cheap)

Catch the obvious cases before paying an LLM call:

BANNED_PATTERNS = [
    re.compile(r"\b(banned_term_1|banned_term_2)\b", re.I),
    re.compile(r"\bhttps?://(?!allowed-domain\.com)", re.I),  # external links
]

def pre_filter(text: str) -> tuple[bool, str]:
    if len(text) > 10_000:
        return False, "too_long"
    for pat in BANNED_PATTERNS:
        if pat.search(text):
            return False, f"banned_pattern:{pat.pattern}"
    return True, "pass"

Roughly 40-70% of spammy input should die here. Log counts by rule so you can tune.

LLM Classifier Stage (Haiku)

Use the smallest capable model. Haiku is usually right for moderation.

CATEGORIES = ["harassment", "self_harm", "spam", "off_topic", "pii", "clean"]

def classify(text: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=256,
        tools=[{
            "name": "moderate",
            "description": "Classify content against policy",
            "input_schema": {
                "type": "object",
                "properties": {
                    "categories": {
                        "type": "array",
                        "items": {"type": "string", "enum": CATEGORIES}
                    },
                    "confidence": {"type": "number", "minimum": 0, "maximum": 1},
                    "reasoning": {"type": "string", "maxLength": 200}
                },
                "required": ["categories", "confidence", "reasoning"]
            }
        }],
        tool_choice={"type": "tool", "name": "moderate"},
        system=POLICY_DESCRIPTION,  # cached — stable across requests
        messages=[{"role": "user", "content": text}]
    )
    return extract_tool_result(response)

Cache the policy description — it's the same on every call. See prompt-caching-patterns.

Category Design

Start with 5-8 categories, not 50. Fewer = higher per-category accuracy.
One clean category — easier than trying to define "not bad"
No overlapping categories — harassment and hate_speech should be merged or clearly separated by specific criteria in the policy doc
unclear / needs_review category — gives the model a graceful escape hatch instead of forcing a wrong label

Threshold Router

def route(classification: dict) -> str:
    conf = classification["confidence"]
    cats = set(classification["categories"])

    if "clean" in cats and conf > 0.8:
        return "pass"
    if cats & BLOCK_CATEGORIES and conf > 0.85:
        return "reject"
    if cats & BLOCK_CATEGORIES:
        return "human_review"
    return "human_review"  # default to review on ambiguity

Thresholds belong in config, not code — they change as you learn.

Human-in-the-Loop

All human_review cases go to a queue with the model's reasoning
Human decisions flow back into a dataset used to evaluate new model versions
Track disagreement rate between human and model — rising disagreement signals policy drift

Evaluation Loop

Build a golden set of ~500 examples per category with ground truth. Track:

Precision per category (of what we flagged, how much was truly bad)
Recall per category (of truly bad, how much we caught)
False-positive cost per category — harassment FP is cheap, "clean FP" (wrongly blocking good content) is expensive

Anti-patterns

Anti-pattern	Why it bites	Fix
One huge prompt asking "is this okay?"	Unstable answers, no tracking	Structured categories + confidence
Using Opus for moderation	10x cost, no accuracy gain for this task	Haiku is fine
Hiding policy in user message	Policy gets mixed with input	Policy in system prompt, cached
Binary block/allow only	No signal for edge cases	Add review queue
No audit trail	Can't improve	Log every decision with full classification

security-patterns — input validation at system boundaries
prompt-caching-patterns — cache the policy doc
model-routing-patterns — when to escalate from Haiku to Sonnet
Anthropic docs: https://docs.claude.com/en/docs/about-claude/use-case-guides/content-moderation

Rules

MUST pre-filter the obvious cases (regex, deny-lists, length caps) before sending to an LLM — LLM moderation on a 500MB comment is unusable
MUST return a structured JSON classification (category, confidence, reason), not a prose verdict — prose breaks audit trails
NEVER ship a moderation pipeline without a human-in-the-loop escalation path for ambiguous cases
NEVER hide the policy in the user message; the policy belongs in the cached system prompt so it is versioned and auditable
CRITICAL: log every decision (input, category, confidence, model, policy version, timestamp) — moderation without an audit trail cannot be improved or appealed
MANDATORY: calibrate confidence thresholds per category; harassment FP is cheap, clean-content FP is expensive

Gotchas

False positives on clean content are much more expensive than false negatives on borderline content, in user-trust terms. Optimize for recall on hard-fail categories (CSAM, doxing) but precision on soft-fail categories (spam, rudeness).
Haiku is sufficient for most moderation classification; using Opus inflates cost 10× with no measurable accuracy gain on this task. Reach for Opus only for edge cases that Haiku consistently misclassifies.
Prompt caching on the policy doc only hits when the cache window is still warm (5 minutes). Bursty traffic with long quiet periods loses the cache every window — amortize by keeping a heartbeat call.
Structured JSON output via tool-use is more reliable than free-form JSON in the response — parse errors happen ~1-3% of the time with free-form, near zero with tool-use schemas.
The golden test set drifts. Policy changes, new attack patterns, and new product surfaces all invalidate old examples. Refresh quarterly or after any policy update.

When NOT to Load

For structured JSON output design in general — use /json-mode-patterns
For security input validation (SQLi, XSS) — use /security-patterns
For caching the policy doc mechanics — use /prompt-caching-patterns
For picking the model tier (Haiku vs Sonnet vs Opus) — use /model-routing-patterns
For moderation of voice/audio content — this skill covers text; audio adds a transcription failure mode not covered here

content-moderation-patterns

Content Moderation Patterns

Architecture

Pre-filter Stage (cheap)

LLM Classifier Stage (Haiku)

Category Design

Threshold Router

Human-in-the-Loop

Evaluation Loop

Anti-patterns

Related

Rules

Gotchas

When NOT to Load

Content Moderation Patterns

Architecture

Pre-filter Stage (cheap)

LLM Classifier Stage (Haiku)

Category Design

Threshold Router

Human-in-the-Loop

Evaluation Loop

Anti-patterns

Related

Rules

Gotchas

When NOT to Load