Quantization Guide¶
8-bit quantization (bitsandbytes)¶
Director-AI supports 8-bit quantization via bitsandbytes for PyTorch
backends. This halves VRAM usage with minimal accuracy loss.
from director_ai import CoherenceScorer
scorer = CoherenceScorer(
use_nli=True,
nli_quantize_8bit=True,
nli_device="cuda",
)
Requires: pip install director-ai[quantize] (installs bitsandbytes).
Performance trade-offs¶
Measured on GTX 1060 6GB, 16-pair batch, FactCG-DeBERTa-v3-Large:
| Backend | Precision | VRAM | Latency/pair | Bal. Acc |
|---|---|---|---|---|
| ONNX GPU (FP32) | FP32 | 1.2 GB | 14.6 ms | 75.8% |
| PyTorch GPU (FP32) | FP32 | 1.4 GB | 19.0 ms | 75.8% |
| PyTorch GPU (FP16) | FP16 | 0.7 GB | ~10 ms | 75.8% |
| PyTorch GPU (INT8) | INT8 | ~0.4 GB | ~8 ms | ~75.1% |
Note
ONNX Runtime does not use bitsandbytes. For ONNX quantization, use
optimum static quantization or ONNX Runtime's built-in quantizer.
FP16 (half precision)¶
No extra dependencies. Supported on GPUs with compute capability >= 7.0 (Volta and newer):
GTX 1060 (compute 6.1) auto-skips FP16 — the scorer falls back to FP32.
When to quantize¶
- INT8: CPU-only deployments where latency > 200 ms is acceptable but VRAM is constrained.
- FP16: Default for GPU deployments. Free speedup on Volta+.
- ONNX GPU: Fastest path overall. Prefer this over PyTorch quantization
when
onnxruntime-gpuis available.
TensorRT (experimental)¶
For sub-10 ms inference on Ada/Hopper GPUs:
Requires libnvinfer (NVIDIA TensorRT runtime). The scorer auto-detects
TensorRT availability and builds an engine cache in
<onnx_path>/trt_cache/.