June 10, 2026 · 6 min read

Zero-Downtime AI Governance: How We Deploy Aira

The number one objection to governance infrastructure is latency. "We can't add another hop to our AI pipeline." Fair concern. Here's how Aira adds governance with 95 microseconds of overhead and deploys updates with zero downtime — no request dropped, no connection reset, no maintenance window.

The Latency Myth

Governance platforms have a reputation for being slow. Policy engines that add 200ms to every request. Compliance checks that time out under load. Approval workflows that block pipelines for minutes.

This reputation is deserved — for platforms that treat governance as an afterthought. If you bolt on policy evaluation as an HTTP middleware that calls out to an external rules engine, you're adding a network round-trip to every decision. That's 5-50ms depending on proximity, plus serialization overhead, plus the evaluation itself.

Aira takes a different approach. The governance layer is colocated with the API. Policy evaluation for rules-mode policies happens in-process, in memory, with no network call. The entire authorize flow — policy matching, rules evaluation, receipt generation, and Ed25519 signing — completes in under 100 microseconds for rules-mode policies.

# Latency breakdown for a rules-mode authorize() call:
#
# Policy matching:        12μs  (hash lookup on action_type + agent_id)
# Rules evaluation:       18μs  (condition tree traversal)
# Receipt generation:     15μs  (SHA-256 hash + payload assembly)
# Ed25519 signing:        48μs  (HSM signing via PKCS#11)
# Response serialization:  2μs  (orjson)
# ──────────────────────────────
# Total:                  95μs
#
# For context: a single DNS lookup is ~1-50ms.
# Aira's governance overhead is invisible.

AI-mode and consensus-mode policies are slower because they involve LLM calls. But the LLM latency is inherent to the evaluation, not to Aira's infrastructure. And these modes are reserved for high-stakes decisions where 1-3 seconds of evaluation time is acceptable precisely because the decision is consequential.

Architecture: Why It's Fast

Three design decisions keep Aira's overhead near zero:

In-process policy engine — Rules-mode policies are compiled into an in-memory condition tree when created or updated. No database query, no network call, no external engine. The evaluation is a tree traversal in the same process handling the request.
Async receipt minting — Ed25519 signing is synchronous (48μs), but RFC 3161 timestamping and Merkle tree insertion happen asynchronously after the response is sent. The client gets the receipt ID immediately; the timestamp token is attached within seconds.
Connection pooling — PostgreSQL and Redis connections are pooled per worker. No connection setup overhead per request. Write-ahead logging to Redis means receipt persistence doesn't block the response path.

Docker-Rollout: Blue-Green Without Kubernetes

Aira runs on dedicated infrastructure, not Kubernetes. We made this choice deliberately. Kubernetes adds operational complexity that doesn't pay for itself at our scale. Instead, we use docker-rollout— a Docker Compose plugin that implements blue-green deployments with zero downtime.

# Deploy flow:
#
# 1. Pull new image
docker compose pull api
#
# 2. docker-rollout scales up new containers alongside old ones
docker rollout api
#
# What happens under the hood:
# a) New containers start (api_new_1, api_new_2)
# b) Traefik health checks confirm new containers are ready
# c) Traefik shifts traffic to new containers
# d) Old containers receive SIGTERM
# e) Old containers drain existing connections (30s grace period)
# f) Old containers stop
#
# Total: ~15 seconds. Zero dropped requests.

The key insight is that docker-rolloutdoesn't stop old containers until new ones are healthy and receiving traffic. There's always at least one healthy container serving requests. No maintenance window. No "please try again in 5 minutes."

Traefik: Health-Aware Routing

Traefik is our reverse proxy and load balancer. It's the front door for every API request. Its role in zero-downtime deploys is critical: it decides which containers receive traffic based on health checks, not container age.

# Traefik health check configuration:
#
# healthcheck:
#   test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
#   interval: 5s
#   timeout: 3s
#   retries: 3
#   start_period: 10s
#
# During a deploy:
# - Old containers: health check passes → continues receiving traffic
# - New containers: health check fails during startup → no traffic
# - New containers: health check passes → starts receiving traffic
# - Old containers: SIGTERM received → health check starts failing
# - Old containers: existing connections drain → container stops
#
# Traefik never sends traffic to an unhealthy container.

The health endpoint isn't a simple "return 200." It verifies database connectivity, Redis connectivity, and HSM availability. If the new container can't reach the signing HSM, it fails the health check and never receives traffic. The old containers keep running. The deploy rolls back automatically.

Connection Draining: No Request Left Behind

The hardest part of zero-downtime deployment isn't starting new containers — it's stopping old ones without dropping in-flight requests. A client might be mid-way through an authorize call when the old container receives SIGTERM.

Aira handles this with graceful shutdown:

SIGTERM received — the container stops accepting new connections
In-flight requests complete — existing requests are allowed to finish (up to 30 seconds)
Background tasks flush — pending receipt timestamps and Merkle insertions are flushed to durable storage
Connections close — database and Redis connections are cleanly closed
Container exits — clean exit code 0

The 30-second grace period is generous. The p99 request latency for rules-mode policies is 2ms. Even consensus-mode policies with 3 LLM calls complete in under 10 seconds. The grace period exists to handle edge cases — slow LLM responses, network hiccups, large batch operations.

Benchmarks: Real Numbers

We run continuous load tests against production-equivalent infrastructure. Here are the numbers:

Operation	p50	p95	p99
authorize() — rules mode	82μs	110μs	2.1ms
authorize() — AI mode	1.2s	2.8s	4.1s
authorize() — consensus (3 models)	2.1s	4.5s	7.2s
notarize()	65μs	95μs	1.8ms
verify()	45μs	72μs	1.2ms

Rules-mode authorize and notarize are sub-millisecond at p95. The governance overhead is less than a single DNS resolution. AI and consensus modes are slower because they involve LLM inference, but those latencies come from the LLM providers, not Aira.

Deploy Frequency: Ship Fast, Stay Governed

Zero-downtime deploys aren't just about reliability — they're about velocity. When deploys are safe, you deploy more often. When you deploy more often, changes are smaller. When changes are smaller, failures are easier to diagnose and roll back.

Aira deploys to production multiple times per day. Each deploy is a blue-green rollout that takes ~15 seconds with zero dropped requests. We don't have maintenance windows. We don't send "scheduled downtime" emails. We don't batch changes into weekly releases.

This matters for governance infrastructure specifically because governance must be always-on. If the governance layer goes down, your agents either stop working (safe but disruptive) or bypass governance (fast but uncontrolled). Neither is acceptable. Zero-downtime deploys mean the governance layer is never unavailable.

What This Means for You

Adding Aira to your AI pipeline doesn't mean adding latency. Rules-mode policies evaluate in under 100 microseconds. Deploys happen without downtime. The governance layer is always available.

The tradeoff between speed and safety is a false dichotomy. With the right architecture, governance is invisible to your users and your latency budget.

# 95μs of governance overhead. Zero downtime.
pip install aira-sdk

from aira import Aira

aira = Aira(api_key="aira_live_xxx")

# Rules-mode authorize: ~95μs
auth = aira.authorize(
    action_type="trade_execution",
    details="Buy 100 AAPL at market",
    agent_id="trading-agent",
    model_id="gpt-5.2",
)

# ... execute trade ...

# Notarize: ~65μs
receipt = aira.notarize(
    action_uuid=auth.action_uuid,
    outcome="completed",
    outcome_details="Filled 100 AAPL at $227.40",
)

# Total governance overhead: <200μs
# Your users won't notice. Your auditors will.

Try Aira — free Read the docs