Threshold Tuning Guide¶

Decision Flowchart¶

flowchart TD
    START["Choose scoring mode"] --> Q1{"Factual precision<br/>required?"}
    Q1 -->|"No (dev/prototype)"| HEUR["Heuristic<br/>use_nli=False<br/><0.1ms, moderate accuracy"]
    Q1 -->|"Yes"| Q2{"GPU available?"}
    Q2 -->|"No"| CPU["ONNX CPU batch<br/>383ms/pair, 75.6% BA"]
    Q2 -->|"Yes"| Q3{"Long documents?"}
    Q3 -->|"No"| NLI["NLI (ONNX GPU)<br/>14.6ms/pair, 75.6% BA"]
    Q3 -->|"Yes"| CHUNK["Chunked NLI<br/>30-400ms, catches localized hallucinations"]
    Q2 -->|"Yes + high stakes"| Q4{"Extra precision needed?"}
    Q4 -->|"Yes"| HYBRID["Hybrid (NLI + LLM judge)<br/>200-500ms, ~78% est."]

    HEUR --> TUNE["Tune threshold on your data"]
    CPU --> TUNE
    NLI --> TUNE
    CHUNK --> TUNE
    HYBRID --> TUNE

    style HEUR fill:#616161,color:#fff
    style NLI fill:#2e7d32,color:#fff
    style CHUNK fill:#1565c0,color:#fff
    style HYBRID fill:#ff8f00,color:#fff
    style CPU fill:#4a148c,color:#fff

When to Use Heuristic vs NLI¶

Mode	Latency	Accuracy	Best For
Heuristic only (`use_nli=False`)	< 0.1 ms	Moderate — catches obvious off-topic responses	High-throughput, cost-sensitive, or when KB coverage is strong
NLI (`use_nli=True`)	15-200 ms (GPU/CPU)	High — catches subtle contradictions	Medical, legal, finance, or any domain where factual precision matters
Chunked NLI (`score_chunked`)	30-400 ms	Highest — catches localized hallucinations	Long responses where a single hallucinated sentence hides in correct text
Hybrid (`scorer_backend="hybrid"`)	200-500 ms	~78% est.	High-stakes pipelines, dialogue tasks where extra precision is needed

Rule of thumb: start with heuristic for development. Switch to NLI for production if your domain has high factual stakes. Summarisation FPR at 10.5% (v3.5, bidirectional NLI + baseline calibration). NLI-only scoring (without KB) produces scores in [0.25, 0.55] and has high FPR on domain tasks (100% on PubMedQA and FinanceBench). Domain profiles require KB grounding or customer-specific calibration — the threshold alone is not sufficient. Tune on your own data.

For per-backend latency numbers and cadence combinations, see Streaming Overhead.

Score Components¶

The coherence score is a weighted combination:

score = 1 - (w_logic * h_logical + w_fact * h_factual)

Where:

h_logical — NLI-derived logical divergence (0 = entailed, 1 = contradicted)
h_factual — NLI-derived factual divergence from KB retrieval
w_logic, w_fact — configurable weights (default 0.6, 0.4; must sum to 1.0)

Adjusting weights:

High w_logic → penalizes logical contradictions more (good for reasoning tasks)
High w_fact → penalizes factual divergence from KB more (good for RAG pipelines)
Both low → relies primarily on heuristic word overlap

Running a Threshold Sweep¶

python -m benchmarks.e2e_eval --sweep-thresholds --max-samples 200

This scores all samples once, then evaluates catch rate / FPR at thresholds from 0.30 to 0.80:

 Threshold    Catch      FPR     Prec       F1
      0.30   89.2%    45.1%    66.3%    76.1%
      0.35   82.4%    32.0%    72.0%    76.9%
      0.40   74.1%    21.3%    77.8%    75.9%
      0.45   65.2%    14.5%    81.8%    72.5%
      0.50   55.0%     8.7%    86.3%    67.2%
      ...

Pick the threshold where F1 is maximized for your risk tolerance.

Decision Matrix¶

Symptom	Action
High false-positive rate (correct responses rejected)	Lower `coherence_threshold` by 0.05-0.10
Missing hallucinations (low catch rate)	Raise `coherence_threshold` by 0.05-0.10
Good catch rate but noisy warnings	Raise `soft_limit` closer to threshold
Streaming halts too aggressively	Increase `window_size` or lower `trend_threshold`
Streaming misses gradual degradation	Decrease `window_size` or raise `trend_threshold`
NLI scores all cluster near 0.5	Check KB coverage — scorer needs grounding facts to differentiate

Domain-Specific Presets¶

Medical¶

scorer = CoherenceScorer(
    threshold=0.6,
    soft_limit=0.7,
    use_nli=True,
    ground_truth_store=medical_kb,
)

See Medical Cookbook and examples/medical_guard.py.

Customer Support¶

scorer = CoherenceScorer(
    threshold=0.5,
    soft_limit=0.6,
    use_nli=True,
    ground_truth_store=support_kb,
)

See Customer Support Cookbook and examples/customer_support_guard.py.

Finance¶

scorer = CoherenceScorer(
    threshold=0.30,
    soft_limit=0.40,
    use_nli=True,
)

See Finance Cookbook.

Streaming Threshold Tuning¶

For streaming workloads, you tune 4 parameters:

Parameter	Default	Effect of Raising	Effect of Lowering
`hard_limit`	0.5	More immediate halts	Tolerates brief dips
`window_threshold`	0.55	Stricter sustained quality	Allows temporary degradation
`trend_threshold`	0.15	More sensitive to coherence drops	Ignores gradual decline
`window_size`	10	Smooths noise (less reactive)	Faster response to changes

Use streaming_debug=True to inspect per-token scores and identify which mechanism triggers. See Streaming Halt.

Grid-Search Example¶

Iterate candidate thresholds on a labeled dataset and pick the one that maximizes F1:

from director_ai.core import CoherenceScorer, GroundTruthStore

thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for t in thresholds:
    scorer = CoherenceScorer(threshold=t, ground_truth_store=store, use_nli=True)
    tp = fp = tn = fn = 0
    for prompt, response, is_hallucinated in labeled_data:
        approved, _ = scorer.review(prompt, response)
        if is_hallucinated and not approved:
            tp += 1
        elif is_hallucinated and approved:
            fn += 1
        elif not is_hallucinated and not approved:
            fp += 1
        else:
            tn += 1
    precision = tp / (tp + fp) if (tp + fp) else 0
    recall = tp / (tp + fn) if (tp + fn) else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
    print(f"  threshold={t:.1f}  P={precision:.2f}  R={recall:.2f}  F1={f1:.2f}")

Or write a ready-to-use profile overlay from a labelled JSONL file:

director-ai tune --dataset my_eval.jsonl --profile medical --output medical_tuned.yaml

The output includes coherence_threshold, hard_limit, soft_limit, w_logic, w_fact, and an extra block with the tuning metrics and base profile. Load it with DirectorConfig.from_yaml(...) or merge it over DirectorConfig.from_profile("medical").

The tuner also prints and embeds a confidence report. It records:

the selected threshold and weight pair
95% Wilson score confidence intervals for balanced accuracy, precision, recall, and F1
the balanced-accuracy margin over the next-best candidate
class balance and the selected confusion matrix
false-positive and false-negative trade-offs for the top candidate thresholds

Human-Gated Adaptive Recommendations¶

For deployments with reviewed feedback, AdaptiveThresholdLearner can replay labelled score feedback across candidate thresholds and estimate each threshold with Beta-Bernoulli posteriors. It returns a recommendation and rollback threshold; it never changes runtime policy by itself.

Use it only after enough human-labelled examples exist, and constrain maximum false-positive and false-negative rates for the domain:

from director_ai.core import AdaptiveThresholdLearner, ThresholdFeedback

learner = AdaptiveThresholdLearner(
    candidate_thresholds=[0.3, 0.4, 0.5, 0.6],
    current_threshold=0.4,
    min_samples=50,
    max_false_negative_rate=0.05,
)
learner.observe_batch(
    [ThresholdFeedback(score=row.score, human_approved=row.human_approved) for row in rows]
)
recommendation = learner.recommend()

Route recommendation.to_profile_overlay() through human review/change control before deployment. Keep adaptive_rollback_threshold with the approved config so the previous threshold can be restored immediately. - boundary examples with the score that would flip each decision

Treat a low confidence level as a warning to add more labelled examples, rebalance the dataset, or inspect the boundary examples before enforcing the overlay in production.

Use director-ai bench --dataset e2e when you need the built-in regression benchmark sweep instead of a customer-specific overlay.

Domain Recommendation Table¶

Domain	Threshold	Rationale
Medical	0.30	NLI-only FPR=100% at this threshold. KB grounding required.
Legal	0.30	Not yet validated (no benchmark artifact). Aligned with medical/finance.
Finance	0.30	NLI-only FPR=100% at this threshold. KB grounding required.
Customer support	0.55	Balanced; some creative latitude acceptable
Creative	0.40	Permissive; hallucination is less harmful

Pitfalls¶

Threshold too high (> 0.75): correct responses get rejected (false positives). Users see "hallucination detected" on accurate text. Reduce threshold or improve KB coverage.
Threshold too low (< 0.35): hallucinations pass through. The guardrail becomes decorative. Raise threshold or enable NLI.
Empty KB: without ground truth facts, factual divergence defaults to 0.5 (neutral). The scorer relies entirely on logical divergence. Always populate your GroundTruthStore.
Short responses: NLI models need sufficient text to make meaningful entailment judgments. Responses under 5 words may produce unreliable scores.