Streaming Overhead¶

Director-AI's StreamingKernel scores every token by default. For latency-critical paths you can reduce scoring frequency with cadence parameters.

Parameters¶

Parameter	Type	Default	Description
`score_every_n`	`int`	1	Score every N-th token. Skipped tokens reuse the last score.
`adaptive`	`bool`	False	Automatically adjust cadence based on coherence.
`max_cadence`	`int`	8	Upper bound for adaptive cadence ramp-up.

These parameters are available on both StreamingKernel and AsyncStreamingKernel.

Overhead by Cadence¶

Measured with 200 tokens, heuristic scorer (no NLI), 5 iterations:

Cadence	Tokens/s	Wall (ms)	Overhead	Callbacks
none	~150K	~1	—	0
1	~4.5K	~44	+3200%	200
4	~16K	~13	+870%	50
8	~28K	~7	+440%	25
adaptive	~18K	~11	+730%	~38

Numbers are CPU-only (heuristic word-overlap scorer). NLI-backed scoring re-runs the full DeBERTa pipeline on the accumulated text per callback (~15-50 ms on GPU). Cadence reduction via score_every_n or adaptive=True is critical for NLI production deployments.

Domain Recommendations¶

Domain	Cadence	Rationale
Medical / Legal	`score_every_n=1`	Every token matters. False negatives are costly.
General chat	`score_every_n=4`	Good balance of safety and throughput.
Latency-critical	`score_every_n=8`	Minimal overhead. Rely on window/trend for catch-up.
Mixed workload	`adaptive=True`	Ramps up when coherent, snaps back on any dip.

Adaptive Cadence¶

When adaptive=True, the kernel adjusts scoring frequency at runtime:

Ramp up: if the sliding window average is above soft_limit and the current cadence is below max_cadence, increment cadence by 1.
Snap back: if any scored token falls below soft_limit, reset cadence to 1 (score every token).

This gives near-full coverage during degradation while reducing overhead during stable generation.

Backend Selection¶

The scorer backend determines per-callback cost. Choose based on your latency budget and accuracy requirements.

Backend	Per-pair (GPU)	Per-pair (CPU)	Accuracy	Best for
`heuristic`	<0.1 ms	<0.1 ms	~55%	Prototyping, edge devices
`deberta` (PyTorch)	19 ms (batch) / 197 ms (seq)	N/A	75.8%	Standard GPU production
`onnx`	14.6 ms (batch) / 65 ms (seq)	383 ms	75.8%	Fastest GPU path
`lite`	<1 ms	<1 ms	~60%	High-throughput, low-accuracy OK
`hybrid`	200-500 ms	500-2000 ms	~78% est.	Max accuracy, summarisation

Combining cadence with backends¶

The total per-token overhead is roughly backend_latency / score_every_n:

Combination	Per-token overhead	Use case
ONNX GPU + cadence=1	~14.6 ms	Medical/legal — every token scored
ONNX GPU + cadence=4	~3.7 ms	General chat
ONNX GPU + cadence=8	~1.8 ms	Latency-critical streaming
Heuristic + cadence=1	<0.1 ms	Offline/CPU-only environments
Hybrid + cadence=4	~50-125 ms	Summarisation with high catch rate

When to use hybrid mode¶

scorer_backend="hybrid" runs NLI first, then falls back to LLM-as-judge when NLI confidence is in the grey zone (0.40-0.60). Trade-offs:

Latency: 10-30x slower than pure NLI (adds an LLM API call)
Accuracy: Strongest on summarisation (AggreFact-CNN, ExpertQA)
Privacy: LLM judge sends truncated text (500 chars) to external API
Cost: Each grey-zone call consumes LLM tokens

Recommended: use hybrid only for high-stakes summarisation pipelines where the 200-500 ms per-pair overhead is acceptable.

Reproducing¶

python -m benchmarks.streaming_overhead_bench

Outputs a JSON file at benchmarks/results/streaming_overhead.json and a table to stdout.