Threshold Tuning Guide¶
When to Use Heuristic vs NLI¶
| Mode | Latency | Accuracy | Best For |
|---|---|---|---|
Heuristic only (use_nli=False) |
< 0.1 ms | Moderate — catches obvious off-topic responses | High-throughput, cost-sensitive, or when KB coverage is strong |
NLI (use_nli=True) |
15-200 ms (GPU/CPU) | High — catches subtle contradictions | Medical, legal, finance, or any domain where factual precision matters |
Chunked NLI (score_chunked) |
30-400 ms | Highest — catches localized hallucinations | Long responses where a single hallucinated sentence hides in correct text |
Hybrid (scorer_backend="hybrid") |
200-500 ms | ~78% est. | High-stakes pipelines, dialogue tasks where extra precision is needed |
Rule of thumb: start with heuristic for development. Switch to NLI for production if your domain has high factual stakes. Summarisation FPR at 10.5% (v3.5, bidirectional NLI + baseline calibration). CoherenceScorer produces scores in [0.25, 0.55] — domain profiles use threshold=0.30 based on measured PubMedQA and FinanceBench results (2026-03-20). Tune on your own data.
For per-backend latency numbers and cadence combinations, see Streaming Overhead.
Score Components¶
The coherence score is a weighted combination:
Where:
h_logical— NLI-derived logical divergence (0 = entailed, 1 = contradicted)h_factual— NLI-derived factual divergence from KB retrievalw_logic,w_fact— configurable weights (default 0.6, 0.4; must sum to 1.0)
Adjusting weights:
- High
w_logic→ penalizes logical contradictions more (good for reasoning tasks) - High
w_fact→ penalizes factual divergence from KB more (good for RAG pipelines) - Both low → relies primarily on heuristic word overlap
Running a Threshold Sweep¶
This scores all samples once, then evaluates catch rate / FPR at thresholds from 0.30 to 0.80:
Threshold Catch FPR Prec F1
0.30 89.2% 45.1% 66.3% 76.1%
0.35 82.4% 32.0% 72.0% 76.9%
0.40 74.1% 21.3% 77.8% 75.9%
0.45 65.2% 14.5% 81.8% 72.5%
0.50 55.0% 8.7% 86.3% 67.2%
...
Pick the threshold where F1 is maximized for your risk tolerance.
Decision Matrix¶
| Symptom | Action |
|---|---|
| High false-positive rate (correct responses rejected) | Lower coherence_threshold by 0.05-0.10 |
| Missing hallucinations (low catch rate) | Raise coherence_threshold by 0.05-0.10 |
| Good catch rate but noisy warnings | Raise soft_limit closer to threshold |
| Streaming halts too aggressively | Increase window_size or lower trend_threshold |
| Streaming misses gradual degradation | Decrease window_size or raise trend_threshold |
| NLI scores all cluster near 0.5 | Check KB coverage — scorer needs grounding facts to differentiate |
Domain-Specific Presets¶
Medical¶
scorer = CoherenceScorer(
threshold=0.6,
soft_limit=0.7,
use_nli=True,
ground_truth_store=medical_kb,
)
See Medical Cookbook and examples/medical_guard.py.
Customer Support¶
scorer = CoherenceScorer(
threshold=0.5,
soft_limit=0.6,
use_nli=True,
ground_truth_store=support_kb,
)
See Customer Support Cookbook and examples/customer_support_guard.py.
Finance¶
See Finance Cookbook.
Streaming Threshold Tuning¶
For streaming workloads, you tune 4 parameters:
| Parameter | Default | Effect of Raising | Effect of Lowering |
|---|---|---|---|
hard_limit |
0.5 | More immediate halts | Tolerates brief dips |
window_threshold |
0.55 | Stricter sustained quality | Allows temporary degradation |
trend_threshold |
0.15 | More sensitive to coherence drops | Ignores gradual decline |
window_size |
10 | Smooths noise (less reactive) | Faster response to changes |
Use streaming_debug=True to inspect per-token scores and identify which mechanism triggers. See Streaming Halt.
Grid-Search Example¶
Iterate candidate thresholds on a labeled dataset and pick the one that maximizes F1:
from director_ai.core import CoherenceScorer, GroundTruthStore
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for t in thresholds:
scorer = CoherenceScorer(threshold=t, ground_truth_store=store, use_nli=True)
tp = fp = tn = fn = 0
for prompt, response, is_hallucinated in labeled_data:
approved, _ = scorer.review(prompt, response)
if is_hallucinated and not approved:
tp += 1
elif is_hallucinated and approved:
fn += 1
elif not is_hallucinated and not approved:
fp += 1
else:
tn += 1
precision = tp / (tp + fp) if (tp + fp) else 0
recall = tp / (tp + fn) if (tp + fn) else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
print(f" threshold={t:.1f} P={precision:.2f} R={recall:.2f} F1={f1:.2f}")
Or use the CLI: director-ai bench --dataset e2e for an automated sweep.
Domain Recommendation Table¶
| Domain | Threshold | Rationale |
|---|---|---|
| Medical | 0.70 | Patient safety demands low false-negative rate |
| Legal | 0.65 | Regulatory compliance; moderate tolerance |
| Finance | 0.60 | Quantitative claims must be grounded |
| Customer support | 0.50 | Balanced; some creative latitude acceptable |
| Creative | 0.40 | Permissive; hallucination is less harmful |
Pitfalls¶
- Threshold too high (> 0.75): correct responses get rejected (false positives). Users see "hallucination detected" on accurate text. Reduce threshold or improve KB coverage.
- Threshold too low (< 0.35): hallucinations pass through. The guardrail becomes decorative. Raise threshold or enable NLI.
- Empty KB: without ground truth facts, factual divergence defaults to 0.5 (neutral). The scorer relies entirely on logical divergence. Always populate your
GroundTruthStore. - Short responses: NLI models need sufficient text to make meaningful entailment judgments. Responses under 5 words may produce unreliable scores.