| name | anonymizer |
| description | Use when the user wants to anonymize a text dataset, redact PII, de-identify free-text data, or rewrite text to remove sensitive or inferable identifying information. Produces a runnable Python script that calls the NeMo Anonymizer pipeline (detection → replace or rewrite). |
| argument-hint | ["describe the dataset and how you want it anonymized"] |
Before You Start
Do not explore the workspace first. The workflow's data-inspection step shows you what you need.
Goal
Anonymize a text dataset using NeMo Anonymizer in the way the user describes:
$ARGUMENTS
The output is a single runnable Python script that builds an AnonymizerConfig, previews results on a few rows, inspects failures and quality metrics, and (on user approval) runs the full pipeline. The script is the durable artifact — the user keeps it for re-runs, version control, and production.
Workflow
Read workflows/interactive.md and follow it. Anonymization is high-stakes — there is no autopilot mode. Even when the user says "you decide" or "be opinionated", you still ask the minimum questions needed to choose risk_tolerance and phrase privacy_goal, because those are fundamentally the user's calls based on their regulatory and business context.
Rules
- Always preview before running the full pipeline. Preview is cheap; a full run can be expensive and slow.
- If
result.failed_records is non-empty after preview, fix that before tweaking strategy. Dropped rows are a model/provider/infra problem (rate limits, auth, etc.), not a config problem. Strategy knobs won't help. See docs/troubleshooting.md "Did the run actually complete cleanly?".
- Ask the user which mode. Briefly describe both: Replace detects entities and replaces each in place (faster, cheaper, keeps shape); Rewrite transforms the full text to also remove inferable identifiers (more expensive, may restructure). Use the data shape as a hint — free-text with implicit identifiers (clinical notes, biographies, depositions) leans Rewrite; structured records / log lines lean Replace — but the user picks.
- For cross-record consistency (same value → same replacement everywhere), use
Hash, not Substitute. Substitute is consistent within a row only.
- In Replace mode, default to
Substitute if the user hasn't specified a strategy. It's the most general-purpose choice and matches the bulk of production usage.
Annotate is for inspection, not production. Its output keeps the original entity text and is not privacy-safe. Use it during iteration to confirm detection is working, then switch.
- Always set
AnonymizerInput.data_summary, even briefly. It is the single cheapest quality lever and it improves both detection and rewrite.
- Never claim privacy guarantees. Anonymizer is best-effort. Outputs may need human review depending on
risk_tolerance. Tell the user this when you finalize.
Usage Tips and Common Pitfalls
Detect.entity_labels=None (the default) is permissive — the augmenter LLM may invent labels not in DEFAULT_ENTITY_LABELS. Setting an explicit list switches to strict mode where only the listed labels are detected. To add domain labels, extend the default, don't replace it: entity_labels=[*DEFAULT_ENTITY_LABELS, "clinical_facility", ...] (DEFAULT_ENTITY_LABELS is a tuple, so unpack it into a list). Match the snake_case convention of DEFAULT_ENTITY_LABELS.
- GLiNER is zero-shot — entity labels are natural-language concept names (e.g.
"clinical_facility", "internal_project_codename"), not codes or enum values. Any concept you can name in English is a label GLiNER can detect.
Rewrite.instructions is a dead field today — it exists on the model but the rewrite engine never reads it. Do not use it. Put rewriter guidance in privacy_goal.protect / privacy_goal.preserve instead.
risk_tolerance only applies to Rewrite mode, not Replace.
PrivacyGoal.protect and .preserve must each be 10–1000 chars and at least 3 words. Be specific (categories, named identifiers, structural facets); avoid generic phrasing like "preserve meaning".
- Validator pool is the only model role with built-in load-spreading. Set
entity_validator: [a, b, c] in models.yaml if rate limits drop rows. Other roles (rewriter, evaluator, etc.) are single-alias.
Reference Docs
The agent should consult these as it goes — do not try to enumerate field reference inline:
docs/concepts/choosing-a-strategy.md — decision tree for picking mode (Replace vs Rewrite), strategy (Substitute / Redact / Annotate / Hash), risk tolerance, privacy goal phrasing, and detection knobs.
docs/troubleshooting.md — symptom-first guide for diagnosing dropped rows, leakage, low utility, and pipeline failures. Read section by section as symptoms appear, not all at once.
docs/concepts/detection.md — detection internals (GLiNER threshold semantics, label catalogue, augmenter/validator behavior).
docs/concepts/models.md — model role configuration, validator pools.
Troubleshooting
Environment-level issues only. Quality and pipeline issues are in docs/troubleshooting.md.
anonymizer not installed: Tell the user nemo-anonymizer is not in this Python environment (requires Python ≥ 3.11). Ask if they want you to install it (pip install nemo-anonymizer) or do it themselves. Do not install without permission.
- Model aliases not configured: Anonymizer can't run without
models.yaml and providers.yaml. Tell the user to set these up — see docs/concepts/models.md. If they don't have a config yet, point them at src/anonymizer/config/default_model_configs/ for the shipped defaults.
- LLM calls failing at preview: Usually an auth issue (missing or invalid API key), a network problem, or a wrong base URL. See
docs/troubleshooting.md "Validation passed but preview errors at LLM call".
Output Template
Write a Python script to the current directory. Name it after the dataset (e.g. anonymize_clinical_notes.py, anonymize_support_logs.py). Use this template — fill in the TODO markers, drop unused sections.
"""Anonymize <dataset> using NeMo Anonymizer.
Generated by the anonymizer agent skill.
Usage:
python <this_script>.py # preview on 5 rows (fast, cheap)
python <this_script>.py --full # run on the full dataset
"""
from __future__ import annotations
import argparse
import sys
from anonymizer import (
Anonymizer,
AnonymizerConfig,
AnonymizerInput,
DEFAULT_ENTITY_LABELS,
Detect,
Substitute, Redact, Annotate, Hash,
Rewrite, PrivacyGoal,
)
def build_config() -> tuple[AnonymizerInput, AnonymizerConfig]:
"""Single source of truth for what we anonymize and how."""
data = AnonymizerInput(
source="TODO: path to .csv / .parquet / .jsonl",
text_column="TODO: name of the text column",
data_summary="TODO: one-line description of the data (domain, genre, anything non-obvious)",
)
detect = Detect(
gliner_threshold=0.3,
)
config = AnonymizerConfig(
detect=detect,
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="TODO: what must not appear in the output, even by inference",
preserve="TODO: what must be kept so the rewritten text is still useful",
),
risk_tolerance="low",
strict_entity_protection=False,
max_repair_iterations=3,
),
)
return data, config
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--full", action="store_true", help="Run on full dataset (default: preview 5 rows)")
parser.add_argument("--num-records", type=int, default=5, help="Rows to preview (ignored with --full)")
args = parser.parse_args()
anonymizer = Anonymizer()
data, config = build_config()
if args.full:
result = anonymizer.run(config=config, data=data)
out_path = "output.parquet"
result.dataframe.to_parquet(out_path)
print(f"Wrote {len(result.dataframe)} rows to {out_path}")
else:
result = anonymizer.preview(config=config, data=data, num_records=args.num_records)
print(f"Previewed {len(result.dataframe)} rows.")
result.trace_dataframe.to_parquet("preview.parquet")
print("Saved: preview.parquet (load with pd.read_parquet)")
if result.failed_records:
print(f"\n⚠️ {len(result.failed_records)} record(s) failed:")
for fr in result.failed_records[:3]:
print(f" - record_id={fr.record_id} step={fr.step} reason={fr.reason}")
print("\nFix dropped rows before tweaking strategy. See docs/troubleshooting.md.")
sys.exit(1)
if config.rewrite is not None:
df = result.dataframe
print(f"\nleakage_mass: mean={df['leakage_mass'].mean():.3f} max={df['leakage_mass'].max():.3f}")
print(f"utility_score: mean={df['utility_score'].mean():.3f} min={df['utility_score'].min():.3f}")
print(f"flagged for review: {int(df['needs_human_review'].sum())} / {len(df)}")
if __name__ == "__main__":
main()