Production Deployment Guide¶

Architecture¶

graph TD
    subgraph "Your Application"
        APP["FastAPI / Flask / Django"]
    end

    subgraph "Director-AI Guardrail"
        GUARD["guard() / CoherenceScorer"]
        NLI["NLI Model<br/>(ONNX / PyTorch / CPU)"]
        CACHE["ScoreCache<br/>(LRU + TTL)"]
        STREAM["StreamingKernel<br/>(opt-in contradiction halt)"]
    end

    subgraph "Knowledge Base"
        KB["VectorGroundTruthStore"]
        CHROMA["ChromaDB / Qdrant /<br/>FAISS / Pinecone"]
    end

    subgraph "LLM Providers"
        OAI["OpenAI"]
        ANT["Anthropic"]
        LOCAL["Ollama / vLLM"]
    end

    subgraph "Observability"
        OTEL["OpenTelemetry"]
        PROM["Prometheus /v1/metrics"]
        AUDIT["AuditLog (SQLite / Postgres)"]
    end

    APP --> GUARD
    GUARD --> NLI
    GUARD --> CACHE
    GUARD --> STREAM
    GUARD --> KB
    KB --> CHROMA
    OAI --> APP
    ANT --> APP
    LOCAL --> APP
    GUARD --> OTEL
    GUARD --> PROM
    GUARD --> AUDIT

    style GUARD fill:#512da8,color:#fff
    style NLI fill:#1565c0,color:#fff
    style KB fill:#00695c,color:#fff
    style CACHE fill:#ff8f00,color:#fff

Recommended Setup¶

For a production scaffold with authenticated service defaults, tenant routing, audit stores, signed knowledge writes, and Prometheus-ready metrics:

director-ai quickstart --profile production
cd director_guard

Fill .env with the deployment API-key to tenant map, proxy API keys, upstream LLM URL, CORS origins, and KB signing key, then run:

docker compose up

For the bundled Prometheus profile, write the same API key to secrets/director-api-key and start:

docker compose --profile monitoring up

Before promotion, generate the local conformal-routing, trajectory rollback, multimodal temporal, federated privacy, auto-redteam defence, formal symbolic, and sustained-load hardening packets and attach them to the release evidence:

PYTHONPATH=src python -m benchmarks.conformal_routing_evidence
PYTHONPATH=src python -m benchmarks.trajectory_rollback_evidence
PYTHONPATH=src python -m benchmarks.multimodal_temporal_evidence
PYTHONPATH=src python -m benchmarks.federated_privacy_evidence
PYTHONPATH=src python -m benchmarks.auto_redteam_defence_evidence
PYTHONPATH=src python -m benchmarks.formal_symbolic_evidence
PYTHONPATH=src python -m benchmarks.sustained_load_evidence

Customer Model Factory promotion also requires the formal-symbolic packet URI, external Lean proof URI, actual Z3 release packet URI, domain contracts URI, code-contract packet URI, verification flags, operator sign-off, and evidence hash in the release gate.

pip install director-ai[nli,vector,embeddings,openai]

from director_ai import guard, CoherenceScorer, VectorGroundTruthStore, ChromaBackend

# Production vector store with persistent storage
backend = ChromaBackend(
    collection_name="prod_facts",
    persist_directory="/data/chroma",
    embedding_model="BAAI/bge-large-en-v1.5",
)
store = VectorGroundTruthStore(backend=backend)

# Ingest your knowledge base
store.ingest(your_documents)

# Create scorer with caching and NLI
scorer = CoherenceScorer(
    threshold=0.6,
    soft_limit=0.7,
    use_nli=True,
    nli_model="lytang/MiniCheck-DeBERTa-L",
    ground_truth_store=store,
    cache_size=2048,
    cache_ttl=600,
    nli_quantize_8bit=True,
    nli_device="cuda",
)

# Optional: enable injection detection on every review()
scorer.enable_injection_detection(
    injection_threshold=0.7,
    baseline_divergence=0.4,
)

Scaling¶

Horizontal: Multiple Workers¶

uvicorn director_ai.server:app --workers 4 --host 0.0.0.0 --port 8080

Each worker gets its own scorer instance. The NLI model is loaded once per worker via lru_cache.

Rate limiting across workers and instances

Rate limiting is in-memory and per worker process unless a shared backend is configured. With --workers N (or several instances behind a load balancer) and no shared store, the effective limit is multiplied by the number of processes. Set redis_url so the server's SlowAPI limiter uses Redis for a single global limit; the server logs a warning at startup when server_workers > 1 and no redis_url is set. The standalone RateLimitMiddleware also accepts redis_url or an injected shared store; use one of those, the main server's Redis-backed limiter, or a reverse proxy/API gateway for multi-instance enforcement.

Request-size and NLI-cost limits

The REST server rejects oversized text before sanitizer, redaction, batch workers, or NLI scoring. Single /v1/process prompts are capped at 100,000 characters, single review responses at 500,000 characters, /v1/batch prompt text at 1,000,000 aggregate characters, and /v1/batch response text at 2,000,000 aggregate characters.

For multi-worker GPU sharing, load the model once and share via torch multiprocessing:

scorer = CoherenceScorer(
    use_nli=True,
    nli_device="cuda:0",
    nli_torch_dtype="float16",
    nli_quantize_8bit=True,
)

8-bit quantization reduces VRAM from ~1.5GB to ~400MB per model.

Edge And Mobile Release Evidence¶

Browser, Worker, mobile, and embedded deployments use the WASM halt kernel plus a host-owned scorer. Before release, generate:

PYTHONPATH=src python -m benchmarks.edge_mobile_evidence

The packet must show ready_for_release=true for customer release. If it only shows ready_for_local_trial=true, the source contracts and docs are present but release evidence is still missing for local WASM artefacts, quantised model artefacts, browser/Web Worker smoke, and mobile or embedded-device smoke.

Score Caching¶

Enable caching to reduce NLI inference by 60-80% in streaming workloads:

scorer = CoherenceScorer(
    cache_size=4096,
    cache_ttl=300,
)

# Monitor cache performance
print(f"Hit rate: {scorer.cache.hit_rate:.1%}")

Resource Sizing¶

Workload	CPU	RAM	GPU	Latency (measured)
Heuristic only	1 core	256MB	None	<0.1 ms
ONNX GPU batch	2 cores	2GB	1.2GB VRAM	14.6 ms/pair
PyTorch GPU batch	2 cores	2GB	1.2GB VRAM	19 ms/pair
ONNX GPU sequential	2 cores	2GB	1.2GB VRAM	65 ms/pair
PyTorch GPU sequential	2 cores	2GB	1.2GB VRAM	197 ms/pair
ONNX CPU batch	4 cores	2GB	None	383 ms/pair
MiniCheck (GPU)	2 cores	1GB	400MB VRAM	~60 ms
+ bge-large embeddings	+1 core	+500MB	+200MB VRAM	+5 ms

Monitoring¶

Prometheus Metrics¶

from director_ai.core.metrics import metrics

# Built-in metrics exposed at /v1/metrics and /v1/metrics/prometheus
print(metrics.prometheus_format())

When API keys are configured, /v1/metrics/prometheus requires authentication by default. Prometheus-compatible scrapers can use:

Authorization: Bearer <api-key>

Available metrics:

Metric	Type	Description
`director_ai_reviews_total`	counter	Total review requests
`director_ai_reviews_approved`	counter	Approved reviews
`director_ai_reviews_rejected`	counter	Rejected reviews
`director_ai_halts_total`	counter	Safety kernel halts (by reason)
`director_ai_coherence_score`	histogram	Score distribution
`director_ai_review_duration_seconds`	histogram	Review latency
`director_ai_active_requests`	gauge	In-flight requests

Health Check¶

@app.get("/health")
def health():
    return {
        "status": "ok",
        "nli_loaded": scorer._nli and scorer._nli.model_available,
        "cache_hit_rate": scorer.cache.hit_rate if scorer.cache else None,
        "cache_size": scorer.cache.size if scorer.cache else 0,
    }

Continuous Batching (ReviewQueue)¶

For API servers under concurrent load, enable server-level request accumulation. The queue collects incoming /v1/review requests and flushes them as a single review_batch() call, reducing GPU kernel launches from 2*N to 2 per flush window (when NLI is available).

from director_ai.core.config import DirectorConfig
from director_ai.server import create_app

config = DirectorConfig(
    review_queue_enabled=True,
    review_queue_max_batch=32,
    review_queue_flush_timeout_ms=10.0,
)
app = create_app(config)

Environment variable override:

DIRECTOR_REVIEW_QUEUE_ENABLED=1 \
DIRECTOR_REVIEW_QUEUE_MAX_BATCH=64 \
DIRECTOR_REVIEW_QUEUE_FLUSH_TIMEOUT_MS=20 \
uvicorn director_ai.server:app --host 0.0.0.0 --port 8080

Session-bound requests (with session_id) bypass the queue automatically. Tenant groups inside one flush are dispatched concurrently. This preserves tenant-specific scorer calls while avoiding a response-time signal that scales linearly with the number of tenants represented in the flush window.

Throughput Comparison¶

Mode	GPU Kernels	Per-Pair Latency	Use Case
`review()` serial	2 per pair	14.6 ms (ONNX GPU)	Single requests
`review_batch()` coalesced	2 total	~14.6 ms amortised	Batch API (`/v1/batch`)
ReviewQueue (10ms flush)	~2 per window	10-20 ms p95	High-concurrency API

Security¶

Store API keys in environment variables, not config files
Use DirectorConfig._REDACTED_FIELDS for safe serialization
Enable InputSanitizer to filter prompt injection attempts
Audit all rejections via AuditLogger

Experimental Hooks¶

Keep experimental hook packages disabled for live deployments unless they run behind a separately isolated service boundary with operator review. This includes meta_guard, self_evolving, continual_adversarial, and adjacent research surfaces that can adapt prompts, policies, adversarial examples, or runtime decisions.

For production use, route these hooks through a non-serving evaluation lane first. Promote a hook only after replay tests, tenant-safe audit logging, rollback instructions, and human review are in place for the deployment.

The nightly live red-team workflow runs both current jailbreak/evasion data and property contract gates for experimental namespace loading, cross-language serialisation, zk attestation, and cyber-physical halt contracts. Keep those property gates green before promoting experimental hooks or changing generated protocol boundaries.

Production Checklist¶

Before going live, verify each item: