Threshold Tuning Guide¶

When to Use Heuristic vs NLI¶

Mode	Latency	Accuracy	Best For
Heuristic only (`use_nli=False`)	< 0.1 ms	Moderate — catches obvious off-topic responses	High-throughput, cost-sensitive, or when KB coverage is strong
NLI (`use_nli=True`)	15-200 ms (GPU/CPU)	High — catches subtle contradictions	Medical, legal, finance, or any domain where factual precision matters
Chunked NLI (`score_chunked`)	30-400 ms	Highest — catches localized hallucinations	Long responses where a single hallucinated sentence hides in correct text
Hybrid (`scorer_backend="hybrid"`)	200-500 ms	~78% est.	High-stakes pipelines, dialogue tasks where extra precision is needed

Rule of thumb: start with heuristic for development. Switch to NLI for production if your domain has high factual stakes. Summarisation FPR at 10.5% (v3.5, bidirectional NLI + baseline calibration). CoherenceScorer produces scores in [0.25, 0.55] — domain profiles use threshold=0.30 based on measured PubMedQA and FinanceBench results (2026-03-20). Tune on your own data.

For per-backend latency numbers and cadence combinations, see Streaming Overhead.

Score Components¶

The coherence score is a weighted combination:

score = 1 - (w_logic * h_logical + w_fact * h_factual)

Where:

h_logical — NLI-derived logical divergence (0 = entailed, 1 = contradicted)
h_factual — NLI-derived factual divergence from KB retrieval
w_logic, w_fact — configurable weights (default 0.6, 0.4; must sum to 1.0)

Adjusting weights:

High w_logic → penalizes logical contradictions more (good for reasoning tasks)
High w_fact → penalizes factual divergence from KB more (good for RAG pipelines)
Both low → relies primarily on heuristic word overlap

Running a Threshold Sweep¶

python -m benchmarks.e2e_eval --sweep-thresholds --max-samples 200

This scores all samples once, then evaluates catch rate / FPR at thresholds from 0.30 to 0.80:

 Threshold    Catch      FPR     Prec       F1
      0.30   89.2%    45.1%    66.3%    76.1%
      0.35   82.4%    32.0%    72.0%    76.9%
      0.40   74.1%    21.3%    77.8%    75.9%
      0.45   65.2%    14.5%    81.8%    72.5%
      0.50   55.0%     8.7%    86.3%    67.2%
      ...

Pick the threshold where F1 is maximized for your risk tolerance.

Decision Matrix¶

Symptom	Action
High false-positive rate (correct responses rejected)	Lower `coherence_threshold` by 0.05-0.10
Missing hallucinations (low catch rate)	Raise `coherence_threshold` by 0.05-0.10
Good catch rate but noisy warnings	Raise `soft_limit` closer to threshold
Streaming halts too aggressively	Increase `window_size` or lower `trend_threshold`
Streaming misses gradual degradation	Decrease `window_size` or raise `trend_threshold`
NLI scores all cluster near 0.5	Check KB coverage — scorer needs grounding facts to differentiate

Domain-Specific Presets¶

Medical¶

scorer = CoherenceScorer(
    threshold=0.6,
    soft_limit=0.7,
    use_nli=True,
    ground_truth_store=medical_kb,
)

See Medical Cookbook and examples/medical_guard.py.

Customer Support¶

scorer = CoherenceScorer(
    threshold=0.5,
    soft_limit=0.6,
    use_nli=True,
    ground_truth_store=support_kb,
)

See Customer Support Cookbook and examples/customer_support_guard.py.

Finance¶

scorer = CoherenceScorer(
    threshold=0.55,
    soft_limit=0.65,
    use_nli=True,
)

See Finance Cookbook.

Streaming Threshold Tuning¶

For streaming workloads, you tune 4 parameters:

Parameter	Default	Effect of Raising	Effect of Lowering
`hard_limit`	0.5	More immediate halts	Tolerates brief dips
`window_threshold`	0.55	Stricter sustained quality	Allows temporary degradation
`trend_threshold`	0.15	More sensitive to coherence drops	Ignores gradual decline
`window_size`	10	Smooths noise (less reactive)	Faster response to changes

Use streaming_debug=True to inspect per-token scores and identify which mechanism triggers. See Streaming Halt.

Grid-Search Example¶

Iterate candidate thresholds on a labeled dataset and pick the one that maximizes F1:

from director_ai.core import CoherenceScorer, GroundTruthStore

thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for t in thresholds:
    scorer = CoherenceScorer(threshold=t, ground_truth_store=store, use_nli=True)
    tp = fp = tn = fn = 0
    for prompt, response, is_hallucinated in labeled_data:
        approved, _ = scorer.review(prompt, response)
        if is_hallucinated and not approved:
            tp += 1
        elif is_hallucinated and approved:
            fn += 1
        elif not is_hallucinated and not approved:
            fp += 1
        else:
            tn += 1
    precision = tp / (tp + fp) if (tp + fp) else 0
    recall = tp / (tp + fn) if (tp + fn) else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
    print(f"  threshold={t:.1f}  P={precision:.2f}  R={recall:.2f}  F1={f1:.2f}")

Or use the CLI: director-ai bench --dataset e2e for an automated sweep.

Domain Recommendation Table¶

Domain	Threshold	Rationale
Medical	0.70	Patient safety demands low false-negative rate
Legal	0.65	Regulatory compliance; moderate tolerance
Finance	0.60	Quantitative claims must be grounded
Customer support	0.50	Balanced; some creative latitude acceptable
Creative	0.40	Permissive; hallucination is less harmful

Pitfalls¶

Threshold too high (> 0.75): correct responses get rejected (false positives). Users see "hallucination detected" on accurate text. Reduce threshold or improve KB coverage.
Threshold too low (< 0.35): hallucinations pass through. The guardrail becomes decorative. Raise threshold or enable NLI.
Empty KB: without ground truth facts, factual divergence defaults to 0.5 (neutral). The scorer relies entirely on logical divergence. Always populate your GroundTruthStore.
Short responses: NLI models need sufficient text to make meaningful entailment judgments. Responses under 5 words may produce unreliable scores.