Skip to content

Quantization Guide

8-bit quantization (bitsandbytes)

Director-AI supports 8-bit quantization via bitsandbytes for PyTorch backends. This halves VRAM usage with minimal accuracy loss.

from director_ai import CoherenceScorer

scorer = CoherenceScorer(
    use_nli=True,
    nli_quantize_8bit=True,
    nli_device="cuda",
)

Requires: pip install director-ai[quantize] (installs bitsandbytes).

Performance trade-offs

Measured on GTX 1060 6GB, 16-pair batch, FactCG-DeBERTa-v3-Large:

Backend Precision VRAM Latency/pair Bal. Acc
ONNX GPU (FP32) FP32 1.2 GB 14.6 ms 75.8%
PyTorch GPU (FP32) FP32 1.4 GB 19.0 ms 75.8%
PyTorch GPU (FP16) FP16 0.7 GB ~10 ms 75.8%
PyTorch GPU (INT8) INT8 ~0.4 GB ~8 ms ~75.1%

Note

ONNX Runtime does not use bitsandbytes. For ONNX quantization, use optimum static quantization or ONNX Runtime's built-in quantizer.

FP16 (half precision)

No extra dependencies. Supported on GPUs with compute capability >= 7.0 (Volta and newer):

scorer = CoherenceScorer(
    use_nli=True,
    nli_torch_dtype="float16",
    nli_device="cuda",
)

GTX 1060 (compute 6.1) auto-skips FP16 — the scorer falls back to FP32.

When to quantize

  • INT8: CPU-only deployments where latency > 200 ms is acceptable but VRAM is constrained.
  • FP16: Default for GPU deployments. Free speedup on Volta+.
  • ONNX GPU: Fastest path overall. Prefer this over PyTorch quantization when onnxruntime-gpu is available.

TensorRT (experimental)

For sub-10 ms inference on Ada/Hopper GPUs:

export DIRECTOR_ENABLE_TRT=1

Requires libnvinfer (NVIDIA TensorRT runtime). The scorer auto-detects TensorRT availability and builds an engine cache in <onnx_path>/trt_cache/.