June 1, 2026 · 5 min read
Sanitize: Stop PII from Reaching Your LLM
Your agent is about to send a customer's SSN, medical record, and home address to Claude. Aira catches it in under 5 milliseconds.
The Problem
Every AI application handles sensitive data. Patient records in healthcare. Financial details in banking. Employee PII in HR tools. The data flows through prompts, embeddings, and tool calls — often reaching third-party LLM APIs with no filtering.
GDPR says you need a legal basis to process personal data. HIPAA says PHI must be de-identified before sharing. Your customers ask “does our data reach your AI provider?” and you say “we have a DPA.” That's not enough.
Four File Types, One API
Aira Sanitize handles everything your agents process:
- Text — Prompts, chat messages, structured data. Scanned in-process, under 5ms.
- Images — Screenshots, photos, scanned documents. OCR + pixel-level redaction.
- PDFs — Contracts, reports, medical records. In-place black-box redaction via PyMuPDF.
- DICOM — Medical imaging. PS3.15 Annex E metadata scrubbing + burned-in text redaction. Conformance tags included.
43 Detection Patterns
Three layers of detection, running in parallel:
Regex patterns (27 PII + credentials + prompt injection):
- US SSN, IBANs (with mod-97 checksum validation), credit cards (Luhn-checked), international phone numbers, US passports
- AWS keys, GitHub/GitLab tokens, Slack tokens, Stripe keys, JWTs, PEM private keys, Azure credentials
- Prompt injection markers: role switches, jailbreak attempts, exfiltration patterns
NER (Microsoft Presidio + spaCy):
- Person names, locations, organizations, dates, phone numbers, email addresses
- Catches what regex misses — “Dr. Schmidt at the Charité” has no SSN but still contains PHI
Healthcare patterns (16 additional):
- MRN (4 variants), NPI (Luhn-checked with 80840 prefix), DEA numbers (checksum-validated)
- ICD-10 codes, drug names, health plan beneficiary IDs, biometric identifiers
Four Modes
- Flag — Detect and report, don't modify. For monitoring.
- Block — If critical PII found, reject the request entirely. Data never reaches the LLM.
- Redact — Replace sensitive spans with
[REDACTED]. For text. Black-box for images/PDFs/DICOM. - Tokenize — Replace with reversible tokens (
PERSON_001,SSN_001). The LLM sees tokens, you map back to originals after.
AI-Assisted Detection
Optional second pass using any LLM (33 models across 10 providers, or bring your own). The AI catches what rules miss: implied conditions (“the diabetic patient on the third floor”), OCR errors in scanned documents, and context-dependent PII that regex can't understand.
Every Scan Gets a Receipt
Every sanitize operation — text or file — produces an Ed25519-signed receipt with an RFC 3161 timestamp. The receipt commits input hash, output hash, findings count, mode, and policy. Tamper-proof proof that you scanned before sending.
Integration
from aira import Aira
aira = Aira(api_key="aira_live_...")
# Text
result = aira.sanitize(
content="Patient Maria Schmidt, SSN 234-56-7890",
mode="redact",
policy="hipaa",
)
# result.clean = "Patient [REDACTED], SSN [REDACTED]"
# File
with open("scan.dcm", "rb") as f:
result = aira.sanitize_file(
file=f,
mode="redact",
policy="hipaa",
include_pixel_redaction=True,
)
# result.download_url → de-identified DICOMOr use the Gateway — every LLM call is automatically scanned on input and output. No code changes.