Skip to content

Scoring

How Coherence Scoring Works

Director-AI computes a composite coherence score from two independent signals:

coherence = 1.0 - (W_LOGIC × H_logical + W_FACT × H_factual)
Signal Weight Source Measures
H_logical 0.6 NLI model (DeBERTa) Contradiction probability between prompt and response
H_factual 0.4 RAG retrieval Deviation from ground-truth knowledge base

The score is in [0.0, 1.0]. Higher = more coherent.

Thresholds

Parameter Default Purpose
threshold 0.5 Below this = rejected
soft_limit threshold + 0.1 Between threshold and soft_limit = warning zone
scorer = CoherenceScorer(threshold=0.5, soft_limit=0.65)
approved, score = scorer.review(query, response)

if not approved:
    print("Rejected — below threshold")
elif score.warning:
    print("Warning — low confidence, consider verification")
else:
    print("Approved")

NLI Backends

Heuristic (default, no GPU)

Word-overlap scoring. Fast (<1ms) but limited to vocabulary-level detection.

scorer = CoherenceScorer(use_nli=False)

75.8% balanced accuracy on AggreFact. Uses instruction template + SummaC source chunking.

scorer = CoherenceScorer(use_nli=True)
Backend Latency Accuracy
ONNX GPU batch 14.6 ms/pair 75.8% BA
PyTorch GPU batch 19 ms/pair 75.8% BA
PyTorch GPU sequential 197 ms/pair 75.8% BA
ONNX CPU batch 383 ms/pair 75.8% BA

MiniCheck (lighter alternative)

72.6% balanced accuracy. Lower VRAM (~400MB vs ~1.5GB).

scorer = CoherenceScorer(
    use_nli=True,
    nli_model="lytang/MiniCheck-DeBERTa-L",
)

LiteScorer (CPU-only, ~65% accuracy)

Word overlap + length ratio + negation heuristics. <0.5 ms/pair, no dependencies.

scorer = CoherenceScorer(scorer_backend="lite")

Customizing Weights

Adjust the balance between logical and factual signals:

# Fact-heavy (for KB-grounded use cases)
scorer = CoherenceScorer(w_logic=0.3, w_fact=0.7)

# Logic-heavy (for free-form reasoning)
scorer = CoherenceScorer(w_logic=0.8, w_fact=0.2)

# Summarization (factual only, no logic duplication)
scorer = CoherenceScorer(w_logic=0.0, w_fact=1.0)

Constraint: w_logic + w_fact must equal 1.0.

Score Caching

Enable caching to avoid redundant NLI inference (60-80% cost reduction in streaming):

scorer = CoherenceScorer(
    cache_size=2048,
    cache_ttl=300.0,
)

# Monitor cache
print(f"Hit rate: {scorer.cache.hit_rate:.1%}")
print(f"Size: {scorer.cache.size}")

Batch Scoring

Score multiple pairs in 2 GPU forward passes (when NLI is available):

items = [
    ("What is 2+2?", "The answer is 4."),
    ("Capital of France?", "Paris is in Germany."),
]
results = scorer.review_batch(items)

Chunked NLI

For long documents, sentence-level scoring catches localized hallucinations:

divergence = scorer._nli.score_chunked(
    premise="Paris is the capital of France. The Eiffel Tower is in Paris.",
    hypothesis="Berlin is the capital of France. The Eiffel Tower is in Berlin.",
)

Max-aggregation: the worst per-sentence contradiction drives the final score.

Next Steps