June 10, 2026 · 8 min read
Open-Source Models for AI Governance: A Practical Guide
Not every organization can send governance decisions to a cloud API. Healthcare providers with HIPAA constraints, defense contractors with ITAR restrictions, European banks with data residency requirements — they all need governance that runs on their infrastructure, with their models. Aira's BYOM (Bring Your Own Model) feature makes this possible.
Why Self-Hosted Models for Governance
When an AI governance platform evaluates a decision, it processes the full action context: the agent's input, the decision details, the policy being evaluated, and potentially sensitive metadata. For a lending decision, that includes income, credit scores, and loan amounts. For a medical triage, it includes patient data. For a trade execution, it includes portfolio positions.
Sending this data to a cloud LLM provider creates three problems:
- Data residency — EU organizations subject to GDPR must ensure personal data stays within the EU. Most cloud LLM providers process requests in the US. Even providers with EU regions may route data through US infrastructure for training or debugging.
- Regulatory constraints — HIPAA, ITAR, CJIS, and other frameworks restrict where certain data can be processed. Cloud LLM APIs typically don't meet these requirements without expensive BAAs and custom configurations.
- Vendor dependency — relying on a cloud provider for governance means your governance layer is subject to their uptime, their pricing changes, their terms of service, and their model deprecation decisions.
Self-hosted open-source models solve all three. The data never leaves your infrastructure. You control the model, the hardware, and the availability. No vendor can change the model underneath you.
Verified Open-Source Models
Aira maintains a registry of verified open-source models that have been tested for governance workloads. Verification means the model has been evaluated against Aira's governance benchmark suite: policy evaluation accuracy, verdict consistency, reasoning quality, and tool calling reliability.
| Model | Parameters | License | Min. GPU | Governance score |
|---|---|---|---|---|
| Llama 4 Maverick | 400B (17B active) | Llama 4 | 1x H100 80GB | 92/100 |
| Llama 4 Scout | 109B (17B active) | Llama 4 | 1x A100 80GB | 88/100 |
| Gemma 4 27B | 27B | Gemma | 1x A100 40GB | 86/100 |
| Qwen3-72B | 72B | Apache 2.0 | 1x H100 80GB | 90/100 |
| Qwen3-32B | 32B | Apache 2.0 | 1x A100 40GB | 85/100 |
| DeepSeek-R2-70B | 70B | MIT | 1x H100 80GB | 87/100 |
The governance score measures policy evaluation accuracy (does the model correctly apply the policy to the action?), verdict consistency (does it give the same verdict for the same input?), reasoning quality (does it explain its decision clearly?), and tool calling reliability (can it use structured output formats?).
All verified models score above 85/100 on governance workloads. For comparison, leading cloud models (Claude Sonnet 4.6, GPT-5.2) score 94-96/100. The gap is narrowing with each model generation.
Inference Backends: vLLM, Ollama, TGI
Aira doesn't run inference itself. It connects to your existing inference backend via the OpenAI-compatible API format. This means any backend that exposes a /v1/chat/completions endpoint works with Aira.
vLLM (Recommended for Production)
vLLM is the gold standard for production LLM inference. It supports continuous batching, PagedAttention, tensor parallelism, and speculative decoding. For governance workloads with bursty traffic patterns (many authorize calls in quick succession), vLLM's batching is essential.
# Start vLLM with Qwen3-72B
vllm serve Qwen/Qwen3-72B \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--port 8001
# Register in Aira dashboard:
# Provider: custom
# Base URL: http://gpu-server:8001/v1
# Model ID: Qwen/Qwen3-72B
# API key: (optional, if you configured vLLM auth)
#
# Aira sends standard OpenAI-format requests to your vLLM server.
# No data leaves your network.Ollama (Recommended for Development)
Ollama is the fastest path from zero to running inference. One binary, one command, and you have a model running locally. Ideal for development, testing, and small-scale deployments.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Qwen3-32B
ollama pull qwen3:32b
ollama serve # Starts on port 11434
# Register in Aira dashboard:
# Provider: custom
# Base URL: http://localhost:11434/v1
# Model ID: qwen3:32b
#
# Ollama exposes an OpenAI-compatible API out of the box.
# Perfect for local development and testing governance policies.TGI (Text Generation Inference)
Hugging Face's TGI is a production-grade inference server with built-in support for quantization, grammar-constrained decoding, and speculative decoding. It's the default inference backend on Hugging Face's Inference Endpoints.
# Start TGI with Gemma 4 27B
docker run --gpus all \
-p 8002:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id google/gemma-4-27b-it \
--max-input-tokens 4096 \
--max-total-tokens 8192
# Register in Aira dashboard:
# Provider: custom
# Base URL: http://gpu-server:8002/v1
# Model ID: google/gemma-4-27b-itTool Calling Support
Aira's policy engine uses structured output — specifically, tool calling — to extract verdicts and reasoning from model responses. The model receives a tool definition for governance_verdict and must call it with a structured response containing the verdict (APPROVE, DENY, or REVIEW) and reasoning.
Not all open-source models support tool calling reliably. The verified models listed above have all been tested for tool calling accuracy. Key requirements:
- Structured output format — the model must return valid JSON matching the tool schema
- Tool selection — when given a tool definition, the model must call it rather than responding with free text
- Schema adherence — the model must respect required fields and enum constraints (e.g., verdict must be one of APPROVE, DENY, REVIEW)
# Aira sends this tool definition to the evaluating model:
#
# tools: [{
# "type": "function",
# "function": {
# "name": "governance_verdict",
# "description": "Evaluate the action against the policy and return a verdict.",
# "parameters": {
# "type": "object",
# "properties": {
# "verdict": {
# "type": "string",
# "enum": ["APPROVE", "DENY", "REVIEW"]
# },
# "reasoning": {
# "type": "string",
# "description": "Explain why you reached this verdict."
# },
# "confidence": {
# "type": "number",
# "minimum": 0,
# "maximum": 1
# }
# },
# "required": ["verdict", "reasoning", "confidence"]
# }
# }
# }]
#
# Model response (tool call):
# {
# "verdict": "DENY",
# "reasoning": "Loan amount exceeds 2x annual income threshold...",
# "confidence": 0.94
# }
#
# If the model returns free text instead of a tool call,
# Aira falls back to regex extraction. But tool calling
# is more reliable and recommended.vLLM supports tool calling via the --enable-auto-tool-choice flag. Ollama supports it natively for compatible models. TGI supports it via grammar-constrained decoding. All three backends produce consistent, schema-valid responses with the verified models.
Self-Hosted Data Sovereignty
When you use BYOM with a self-hosted inference backend, the data flow is entirely within your infrastructure:
- Your agent calls
aira.authorize()with the action context - Aira's policy engine evaluates matching policies
- For AI-mode or consensus-mode policies, Aira routes the evaluation to your self-hosted model
- Your model processes the request on your hardware
- The model's response returns to Aira
- Aira mints a cryptographic receipt and returns the authorization result
At no point does the action context leave your network for model evaluation. The only data that passes through Aira's infrastructure is the receipt metadata (action type, verdict, receipt ID) — not the sensitive action details. And even this can be minimized by configuring Aira in "receipt-only" mode, where action details are hashed but not stored.
# Data sovereignty configuration in the Aira dashboard:
#
# Organization Settings → Data Handling
#
# ┌─────────────────────────────────────────┐
# │ Action detail storage: [Hash only] │ ← Only SHA-256 hash stored
# │ Model routing: [Self-hosted only] │ ← No cloud model calls
# │ Receipt storage: [Your infrastructure] │ ← Receipts stored in your DB
# │ Inference endpoint: [Internal URL] │ ← Never leaves your network
# └─────────────────────────────────────────┘
#
# With these settings:
# - Sensitive data never leaves your infrastructure
# - Model inference happens on your GPUs
# - Receipts are stored in your database
# - Aira only sees hashed action metadata
# - Full EU AI Act compliance maintainedHybrid Configuration: Cloud + Self-Hosted
Most organizations don't go fully self-hosted from day one. Aira supports hybrid configurations where some policies use cloud models and others use self-hosted models. This is particularly useful for consensus, where you want provider diversity.
# Consensus policy with hybrid model configuration:
#
# Policy: "High-value financial decisions require 3-model consensus"
#
# Model 1: Claude Sonnet 4.6 (Anthropic cloud)
# → Used for: general reasoning, policy interpretation
# → Data handling: action details redacted, only policy + verdict sent
#
# Model 2: GPT-5.2 (OpenAI cloud)
# → Used for: structured analysis, numerical evaluation
# → Data handling: action details redacted
#
# Model 3: Qwen3-72B (self-hosted vLLM)
# → Used for: independent evaluation with full action context
# → Data handling: full context, never leaves your network
#
# Aira redacts sensitive fields before sending to cloud models.
# The self-hosted model receives the complete, unredacted context.
# All three verdicts are combined in the consensus score.This hybrid approach gives you provider diversity for consensus (three different models from three different sources) while keeping the most sensitive data on your own infrastructure. The cloud models evaluate a redacted version of the action; the self-hosted model evaluates the full version.
Hardware Requirements
Governance workloads are bursty but not sustained. An authorize call sends a short prompt (action context + policy) and expects a short response (verdict + reasoning). Input tokens are typically 500-2,000; output tokens are 100-300. This is much less demanding than general-purpose chat or document processing.
- Qwen3-32B / Gemma 4 27B — 1x A100 40GB or 1x L40S 48GB. ~50 requests/second with vLLM continuous batching. Sufficient for most governance workloads.
- Qwen3-72B / DeepSeek-R2-70B — 1x H100 80GB or 2x A100 80GB with tensor parallelism. ~30 requests/second. For high-stakes evaluations requiring larger models.
- Llama 4 Maverick — 1x H100 80GB (17B active parameters with MoE). ~40 requests/second. Best governance score among open-source models.
For development and testing, Ollama runs quantized versions of these models on consumer hardware. Qwen3-32B at Q4_K_M quantization runs on an M3 Max MacBook with 64GB RAM, producing governance verdicts in 2-3 seconds.
Getting Started
# 1. Set up your inference backend
ollama pull qwen3:32b && ollama serve
# 2. Install the Aira SDK
pip install aira-sdk
# 3. Register your model in the dashboard
# Provider: custom
# Base URL: http://localhost:11434/v1
# Model ID: qwen3:32b
# 4. Create a policy that uses your model
# Policy mode: AI or Consensus
# Model: qwen3:32b (your self-hosted model)
# 5. Govern your agent's actions
from aira import Aira
aira = Aira(api_key="aira_live_xxx")
auth = aira.authorize(
action_type="patient_triage",
details="Triage patient P-1102. Symptoms: chest pain, shortness of breath.",
agent_id="triage-agent",
model_id="qwen3:32b", # Your self-hosted model
)
# Policy evaluation runs on YOUR hardware.
# Patient data never leaves YOUR infrastructure.
# Cryptographic receipt proves governance happened.Your models. Your hardware. Your data. Full governance with full sovereignty. Open-source models make it possible to comply with the EU AI Act, HIPAA, and data residency requirements without sending a single byte to a cloud LLM provider.