| name | phi_detection |
| description | Scan repository for Protected Health Information (PHI) using HIPAA Safe Harbor patterns. Ensures evaluation data remains synthetic-only.
|
| version | 1.0.0 |
| author | ScribeGoat2 Team |
| license | MIT |
| safety_level | critical |
PHI Detection Skill
Purpose
Ensure no Protected Health Information (PHI) enters the evaluation pipeline.
Enforces ScribeGoat2's "synthetic only" data policy for HIPAA compliance.
When to Use
- Before committing new scenario files
- CI/CD pre-merge validation
- Periodic repository audits
- Before sharing evaluation data externally
Triggers
- "scan for PHI"
- "check for protected health information"
- "validate data is synthetic"
- "run PHI detection"
Tools
python scripts/detect_phi.py --strict
python scripts/detect_phi.py --path bloom_medical_eval/scenarios/
python scripts/detect_phi.py --verbose
Prerequisites
- Python 3.11+
- No external dependencies (uses stdlib only)
Input Schema
path:
type: path
default: "."
description: Directory or file to scan
strict:
type: boolean
default: false
description: Fail on warnings (provenance metadata)
verbose:
type: boolean
default: false
description: Show all matched patterns
Output Schema
status: enum
phi_detected: boolean
matches:
- file: string
pattern: string
severity: enum
examples: [string]
count: integer
files_scanned: integer
excluded_directories: [string]
PHI Patterns Detected
| Pattern | Severity | Example |
|---|
| SSN | HIGH | 123-45-6789 |
| Medical Record Number | HIGH | MRN: 12345678 |
| Full Date of Birth | HIGH | DOB: 01/15/1985 |
| Phone Number | MEDIUM | 555-123-4567 |
| Personal Email | MEDIUM | john.doe@gmail.com |
| Street Address | MEDIUM | 123 Main Street |
| Patient Full Name | HIGH | Patient: John Smith |
Whitelist Patterns
The following patterns are not flagged (legitimate use cases):
- Example domains (
example.com)
- Fake phone numbers (
555-xxxx)
- Toll-free numbers (
800-xxx-xxxx, 888-xxx-xxxx, etc.)
- Crisis hotlines (
988)
- Medical abbreviations (
PT, ST elevation)
Excluded Directories
These directories contain evaluation artifacts and are excluded:
results* - Evaluation outputs
reports/ - Generated reports
experiments/ - Experimental data
.private/ - Private test data
scripts/ - Source code
docs/ - Documentation
Success Criteria
| Metric | Threshold | Severity |
|---|
phi_detected | false | 🔴 Critical |
high_severity_matches | 0 | 🔴 Critical |
medium_severity_matches | 0 | 🟠 High |
Safety Gates
- gate: no_phi_detected
metric: phi_detected
operator: "=="
threshold: false
action: block_merge
severity: critical
- gate: no_high_severity
metric: high_severity_matches
operator: "=="
threshold: 0
action: block_merge
severity: critical
HIPAA Safe Harbor Compliance
This skill implements detection of HIPAA's 18 Safe Harbor identifiers:
- Names ✅
- Geographic data ✅
- Dates (except year) ✅
- Phone numbers ✅
- Fax numbers ✅
- Email addresses ✅
- Social Security numbers ✅
- Medical record numbers ✅
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers
- Device identifiers
- Web URLs
- IP addresses
- Biometric identifiers
- Full-face photographs
- Any other unique identifying number
Related Skills
bloom_integrity_verification - Verify data integrity after PHI check
crisis_persistence_eval - Requires PHI-clean scenarios
Documentation