Threshold Tuning Guide¶
Decision Flowchart¶
flowchart TD
START["Choose scoring mode"] --> Q1{"Factual precision<br/>required?"}
Q1 -->|"No (dev/prototype)"| HEUR["Heuristic<br/>use_nli=False<br/><0.1ms, moderate accuracy"]
Q1 -->|"Yes"| Q2{"GPU available?"}
Q2 -->|"No"| CPU["ONNX CPU batch<br/>383ms/pair, 75.6% BA"]
Q2 -->|"Yes"| Q3{"Long documents?"}
Q3 -->|"No"| NLI["NLI (ONNX GPU)<br/>14.6ms/pair, 75.6% BA"]
Q3 -->|"Yes"| CHUNK["Chunked NLI<br/>30-400ms, catches localized hallucinations"]
Q2 -->|"Yes + high stakes"| Q4{"Extra precision needed?"}
Q4 -->|"Yes"| HYBRID["Hybrid (NLI + LLM judge)<br/>200-500ms, ~78% est."]
HEUR --> TUNE["Tune threshold on your data"]
CPU --> TUNE
NLI --> TUNE
CHUNK --> TUNE
HYBRID --> TUNE
style HEUR fill:#616161,color:#fff
style NLI fill:#2e7d32,color:#fff
style CHUNK fill:#1565c0,color:#fff
style HYBRID fill:#ff8f00,color:#fff
style CPU fill:#4a148c,color:#fff
When to Use Heuristic vs NLI¶
| Mode | Latency | Accuracy | Best For |
|---|---|---|---|
Heuristic only (use_nli=False) |
< 0.1 ms | Moderate — catches obvious off-topic responses | High-throughput, cost-sensitive, or when KB coverage is strong |
NLI (use_nli=True) |
15-200 ms (GPU/CPU) | High — catches subtle contradictions | Medical, legal, finance, or any domain where factual precision matters |
Chunked NLI (score_chunked) |
30-400 ms | Highest — catches localized hallucinations | Long responses where a single hallucinated sentence hides in correct text |
Hybrid (scorer_backend="hybrid") |
200-500 ms | ~78% est. | High-stakes pipelines, dialogue tasks where extra precision is needed |
Rule of thumb: start with heuristic for development. Switch to NLI for production if your domain has high factual stakes. Summarisation FPR at 10.5% (v3.5, bidirectional NLI + baseline calibration). NLI-only scoring (without KB) produces scores in [0.25, 0.55] and has high FPR on domain tasks (100% on PubMedQA and FinanceBench). Domain profiles require KB grounding or customer-specific calibration — the threshold alone is not sufficient. Tune on your own data.
For per-backend latency numbers and cadence combinations, see Streaming Overhead.
Score Components¶
The coherence score is a weighted combination:
Where:
h_logical— NLI-derived logical divergence (0 = entailed, 1 = contradicted)h_factual— NLI-derived factual divergence from KB retrievalw_logic,w_fact— configurable weights (default 0.6, 0.4; must sum to 1.0)
Adjusting weights:
- High
w_logic→ penalizes logical contradictions more (good for reasoning tasks) - High
w_fact→ penalizes factual divergence from KB more (good for RAG pipelines) - Both low → relies primarily on heuristic word overlap
Running a Threshold Sweep¶
This scores all samples once, then evaluates catch rate / FPR at thresholds from 0.30 to 0.80:
Threshold Catch FPR Prec F1
0.30 89.2% 45.1% 66.3% 76.1%
0.35 82.4% 32.0% 72.0% 76.9%
0.40 74.1% 21.3% 77.8% 75.9%
0.45 65.2% 14.5% 81.8% 72.5%
0.50 55.0% 8.7% 86.3% 67.2%
...
Pick the threshold where F1 is maximized for your risk tolerance.
Decision Matrix¶
| Symptom | Action |
|---|---|
| High false-positive rate (correct responses rejected) | Lower coherence_threshold by 0.05-0.10 |
| Missing hallucinations (low catch rate) | Raise coherence_threshold by 0.05-0.10 |
| Good catch rate but noisy warnings | Raise soft_limit closer to threshold |
| Streaming halts too aggressively | Increase window_size or lower trend_threshold |
| Streaming misses gradual degradation | Decrease window_size or raise trend_threshold |
| NLI scores all cluster near 0.5 | Check KB coverage — scorer needs grounding facts to differentiate |
Domain-Specific Presets¶
Medical¶
scorer = CoherenceScorer(
threshold=0.6,
soft_limit=0.7,
use_nli=True,
ground_truth_store=medical_kb,
)
See Medical Cookbook and examples/medical_guard.py.
Customer Support¶
scorer = CoherenceScorer(
threshold=0.5,
soft_limit=0.6,
use_nli=True,
ground_truth_store=support_kb,
)
See Customer Support Cookbook and examples/customer_support_guard.py.
Finance¶
See Finance Cookbook.
Streaming Threshold Tuning¶
For streaming workloads, you tune 4 parameters:
| Parameter | Default | Effect of Raising | Effect of Lowering |
|---|---|---|---|
hard_limit |
0.5 | More immediate halts | Tolerates brief dips |
window_threshold |
0.55 | Stricter sustained quality | Allows temporary degradation |
trend_threshold |
0.15 | More sensitive to coherence drops | Ignores gradual decline |
window_size |
10 | Smooths noise (less reactive) | Faster response to changes |
Use streaming_debug=True to inspect per-token scores and identify which mechanism triggers. See Streaming Halt.
Grid-Search Example¶
Iterate candidate thresholds on a labeled dataset and pick the one that maximizes F1:
from director_ai.core import CoherenceScorer, GroundTruthStore
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for t in thresholds:
scorer = CoherenceScorer(threshold=t, ground_truth_store=store, use_nli=True)
tp = fp = tn = fn = 0
for prompt, response, is_hallucinated in labeled_data:
approved, _ = scorer.review(prompt, response)
if is_hallucinated and not approved:
tp += 1
elif is_hallucinated and approved:
fn += 1
elif not is_hallucinated and not approved:
fp += 1
else:
tn += 1
precision = tp / (tp + fp) if (tp + fp) else 0
recall = tp / (tp + fn) if (tp + fn) else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
print(f" threshold={t:.1f} P={precision:.2f} R={recall:.2f} F1={f1:.2f}")
Or write a ready-to-use profile overlay from a labelled JSONL file:
The output includes coherence_threshold, hard_limit, soft_limit,
w_logic, w_fact, and an extra block with the tuning metrics and base
profile. Load it with DirectorConfig.from_yaml(...) or merge it over
DirectorConfig.from_profile("medical").
The tuner also prints and embeds a confidence report. It records:
- the selected threshold and weight pair
- the balanced-accuracy margin over the next-best candidate
- class balance and the selected confusion matrix
- false-positive and false-negative trade-offs for the top candidate thresholds
- boundary examples with the score that would flip each decision
Treat a low confidence level as a warning to add more labelled examples,
rebalance the dataset, or inspect the boundary examples before enforcing the
overlay in production.
Use director-ai bench --dataset e2e when you need the built-in regression
benchmark sweep instead of a customer-specific overlay.
Domain Recommendation Table¶
| Domain | Threshold | Rationale |
|---|---|---|
| Medical | 0.30 | NLI-only FPR=100% at this threshold. KB grounding required. |
| Legal | 0.30 | Not yet validated (no benchmark artifact). Aligned with medical/finance. |
| Finance | 0.30 | NLI-only FPR=100% at this threshold. KB grounding required. |
| Customer support | 0.55 | Balanced; some creative latitude acceptable |
| Creative | 0.40 | Permissive; hallucination is less harmful |
Pitfalls¶
- Threshold too high (> 0.75): correct responses get rejected (false positives). Users see "hallucination detected" on accurate text. Reduce threshold or improve KB coverage.
- Threshold too low (< 0.35): hallucinations pass through. The guardrail becomes decorative. Raise threshold or enable NLI.
- Empty KB: without ground truth facts, factual divergence defaults to 0.5 (neutral). The scorer relies entirely on logical divergence. Always populate your
GroundTruthStore. - Short responses: NLI models need sufficient text to make meaningful entailment judgments. Responses under 5 words may produce unreliable scores.