June 1, 2026 · 5 min read

Sanitize: Stop PII from Reaching Your LLM

Your agent is about to send a customer's SSN, medical record, and home address to Claude. Aira catches it in under 5 milliseconds.

The Problem

Every AI application handles sensitive data. Patient records in healthcare. Financial details in banking. Employee PII in HR tools. The data flows through prompts, embeddings, and tool calls — often reaching third-party LLM APIs with no filtering.

GDPR says you need a legal basis to process personal data. HIPAA says PHI must be de-identified before sharing. Your customers ask “does our data reach your AI provider?” and you say “we have a DPA.” That's not enough.

Four File Types, One API

Aira Sanitize handles everything your agents process:

  • Text — Prompts, chat messages, structured data. Scanned in-process, under 5ms.
  • Images — Screenshots, photos, scanned documents. OCR + pixel-level redaction.
  • PDFs — Contracts, reports, medical records. In-place black-box redaction via PyMuPDF.
  • DICOM — Medical imaging. PS3.15 Annex E metadata scrubbing + burned-in text redaction. Conformance tags included.

43 Detection Patterns

Three layers of detection, running in parallel:

Regex patterns (27 PII + credentials + prompt injection):

  • US SSN, IBANs (with mod-97 checksum validation), credit cards (Luhn-checked), international phone numbers, US passports
  • AWS keys, GitHub/GitLab tokens, Slack tokens, Stripe keys, JWTs, PEM private keys, Azure credentials
  • Prompt injection markers: role switches, jailbreak attempts, exfiltration patterns

NER (Microsoft Presidio + spaCy):

  • Person names, locations, organizations, dates, phone numbers, email addresses
  • Catches what regex misses — “Dr. Schmidt at the Charité” has no SSN but still contains PHI

Healthcare patterns (16 additional):

  • MRN (4 variants), NPI (Luhn-checked with 80840 prefix), DEA numbers (checksum-validated)
  • ICD-10 codes, drug names, health plan beneficiary IDs, biometric identifiers

Four Modes

  • Flag — Detect and report, don't modify. For monitoring.
  • Block — If critical PII found, reject the request entirely. Data never reaches the LLM.
  • Redact — Replace sensitive spans with [REDACTED]. For text. Black-box for images/PDFs/DICOM.
  • Tokenize — Replace with reversible tokens (PERSON_001, SSN_001). The LLM sees tokens, you map back to originals after.

AI-Assisted Detection

Optional second pass using any LLM (33 models across 10 providers, or bring your own). The AI catches what rules miss: implied conditions (“the diabetic patient on the third floor”), OCR errors in scanned documents, and context-dependent PII that regex can't understand.

Every Scan Gets a Receipt

Every sanitize operation — text or file — produces an Ed25519-signed receipt with an RFC 3161 timestamp. The receipt commits input hash, output hash, findings count, mode, and policy. Tamper-proof proof that you scanned before sending.

Integration

from aira import Aira

aira = Aira(api_key="aira_live_...")

# Text
result = aira.sanitize(
    content="Patient Maria Schmidt, SSN 234-56-7890",
    mode="redact",
    policy="hipaa",
)
# result.clean = "Patient [REDACTED], SSN [REDACTED]"

# File
with open("scan.dcm", "rb") as f:
    result = aira.sanitize_file(
        file=f,
        mode="redact",
        policy="hipaa",
        include_pixel_redaction=True,
    )
# result.download_url → de-identified DICOM

Or use the Gateway — every LLM call is automatically scanned on input and output. No code changes.