Streaming Overhead¶
Director-AI's StreamingKernel scores every token by default. For
latency-critical paths you can reduce scoring frequency with cadence
parameters.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
score_every_n |
int |
1 | Score every N-th token. Skipped tokens reuse the last score. |
adaptive |
bool |
False | Automatically adjust cadence based on coherence. |
max_cadence |
int |
8 | Upper bound for adaptive cadence ramp-up. |
These parameters are available on both StreamingKernel and
AsyncStreamingKernel.
Overhead by Cadence¶
Measured with 200 tokens, heuristic scorer (no NLI), 5 iterations:
| Cadence | Tokens/s | Wall (ms) | Overhead | Callbacks |
|---|---|---|---|---|
| none | ~150K | ~1 | — | 0 |
| 1 | ~4.5K | ~44 | +3200% | 200 |
| 4 | ~16K | ~13 | +870% | 50 |
| 8 | ~28K | ~7 | +440% | 25 |
| adaptive | ~18K | ~11 | +730% | ~38 |
Numbers are CPU-only (heuristic word-overlap scorer). NLI-backed
scoring re-runs the full DeBERTa pipeline on the accumulated text
per callback (~15-50 ms on GPU). Cadence reduction via
score_every_n or adaptive=True is critical for NLI production
deployments.
Domain Recommendations¶
| Domain | Cadence | Rationale |
|---|---|---|
| Medical / Legal | score_every_n=1 |
Every token matters. False negatives are costly. |
| General chat | score_every_n=4 |
Good balance of safety and throughput. |
| Latency-critical | score_every_n=8 |
Minimal overhead. Rely on window/trend for catch-up. |
| Mixed workload | adaptive=True |
Ramps up when coherent, snaps back on any dip. |
Adaptive Cadence¶
When adaptive=True, the kernel adjusts scoring frequency at runtime:
- Ramp up: if the sliding window average is above
soft_limitand the current cadence is belowmax_cadence, increment cadence by 1. - Snap back: if any scored token falls below
soft_limit, reset cadence to 1 (score every token).
This gives near-full coverage during degradation while reducing overhead during stable generation.
Backend Selection¶
The scorer backend determines per-callback cost. Choose based on your latency budget and accuracy requirements.
| Backend | Per-pair (GPU) | Per-pair (CPU) | Accuracy | Best for |
|---|---|---|---|---|
heuristic |
<0.1 ms | <0.1 ms | ~55% | Prototyping, edge devices |
deberta (PyTorch) |
19 ms (batch) / 197 ms (seq) | N/A | 75.8% | Standard GPU production |
onnx |
14.6 ms (batch) / 65 ms (seq) | 383 ms | 75.8% | Fastest GPU path |
lite |
<1 ms | <1 ms | ~60% | High-throughput, low-accuracy OK |
hybrid |
200-500 ms | 500-2000 ms | ~78% est. | Max accuracy, summarisation |
Combining cadence with backends¶
The total per-token overhead is roughly backend_latency / score_every_n:
| Combination | Per-token overhead | Use case |
|---|---|---|
| ONNX GPU + cadence=1 | ~14.6 ms | Medical/legal — every token scored |
| ONNX GPU + cadence=4 | ~3.7 ms | General chat |
| ONNX GPU + cadence=8 | ~1.8 ms | Latency-critical streaming |
| Heuristic + cadence=1 | <0.1 ms | Offline/CPU-only environments |
| Hybrid + cadence=4 | ~50-125 ms | Summarisation with high catch rate |
When to use hybrid mode¶
scorer_backend="hybrid" runs NLI first, then falls back to LLM-as-judge
when NLI confidence is in the grey zone (0.40-0.60). Trade-offs:
- Latency: 10-30x slower than pure NLI (adds an LLM API call)
- Accuracy: Strongest on summarisation (AggreFact-CNN, ExpertQA)
- Privacy: LLM judge sends truncated text (500 chars) to external API
- Cost: Each grey-zone call consumes LLM tokens
Recommended: use hybrid only for high-stakes summarisation pipelines
where the 200-500 ms per-pair overhead is acceptable.
Reproducing¶
Outputs a JSON file at benchmarks/results/streaming_overhead.json
and a table to stdout.