Docker Deployment¶
Build locally¶
Pre-built images are not yet published to a registry. Build from the included Dockerfiles:
# CPU-only (heuristic scoring, ~200 MB)
docker build -t director-ai .
docker run -p 8080:8080 director-ai
# GPU-enabled (ONNX CUDA, FactCG model baked in, ~5 GB)
docker build -f Dockerfile.gpu -t director-ai:gpu .
docker run --gpus all -p 8080:8080 director-ai:gpu
Docker Compose¶
docker compose up # CPU
docker compose --profile gpu up # GPU (NVIDIA)
docker compose --profile full up # CPU + ChromaDB
The GPU service requires the NVIDIA Container Toolkit:
# docker-compose.yml (gpu profile)
director-api-gpu:
build:
context: .
dockerfile: Dockerfile.gpu
profiles: ["gpu"]
ports:
- "8080:8080"
environment:
- DIRECTOR_USE_NLI=true
- DIRECTOR_ONNX_PATH=/app/models/onnx
- DIRECTOR_METRICS_ENABLED=true
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
GPU image (Dockerfile.gpu)¶
Three-stage build: install deps → export ONNX model → runtime.
Stage 1 (builder): pip install .[server,nli] + onnxruntime-gpu + torch CUDA
Stage 2 (model-build): export_onnx() → /models/onnx/
Stage 3 (runtime): copy from stages, set env vars, healthcheck
The ONNX model is exported at build time (deterministic startup, no HuggingFace downloads at runtime). Environment defaults:
DIRECTOR_USE_NLI=trueDIRECTOR_ONNX_PATH=/app/models/onnxORT_ENABLE_ALL=1(graph optimization)
Multi-replica GPU scaling¶
Each replica loads its own ONNX session. With the Ada GPU (0.9 ms/pair), 3 replicas handle ~3,000 pairs/second.
Verify¶
curl localhost:8080/v1/health
# {"status":"ok","nli_loaded":true,"model":"FactCG-DeBERTa-v3-Large"}
curl -X POST localhost:8080/v1/review \
-H 'Content-Type: application/json' \
-d '{"prompt":"What is 2+2?","response":"4."}'
CPU-only image (Dockerfile)¶
For development or heuristic-only scoring:
FROM python:3.12-slim
WORKDIR /app
RUN pip install --no-cache-dir director-ai[server]
EXPOSE 8080
CMD ["uvicorn", "director_ai.server:app", "--host", "0.0.0.0", "--port", "8080"]
Resource requirements¶
| Image | Size | RAM | GPU VRAM | Latency |
|---|---|---|---|---|
CPU (latest) |
~200 MB | 256 MB | — | <0.1 ms (heuristic) |
GPU (gpu) |
~5 GB | 2 GB | 1.2 GB | 14.6 ms/pair (ONNX CUDA) |