Skip to content

Scoring

How Coherence Scoring Works

graph LR
    subgraph "Input"
        P["Prompt"]
        R["Response"]
    end
    subgraph "Logical Signal (W=0.6)"
        NLI["NLI Model<br/>DeBERTa / ONNX"]
        HL["H_logical<br/>(contradiction prob)"]
    end
    subgraph "Factual Signal (W=0.4)"
        KB["KB Retrieval<br/>Vector / Keyword"]
        HF["H_factual<br/>(fact deviation)"]
    end
    subgraph "Decision"
        SCORE["coherence =<br/>1 - (0.6·H_L + 0.4·H_F)"]
        GATE{≥ threshold?}
    end

    P --> NLI
    R --> NLI
    NLI --> HL --> SCORE
    P --> KB
    KB --> |"facts"| NLI2["NLI(facts, response)"]
    R --> NLI2
    NLI2 --> HF --> SCORE
    SCORE --> GATE
    GATE -->|Yes| OK["Approved"]
    GATE -->|No| FAIL["Rejected + Evidence"]

    style OK fill:#2e7d32,color:#fff
    style FAIL fill:#c62828,color:#fff
    style SCORE fill:#ff8f00,color:#fff

Director-AI computes a composite coherence score from two independent signals:

coherence = 1.0 - (W_LOGIC × H_logical + W_FACT × H_factual)
Signal Weight Source Measures
H_logical 0.6 NLI model (DeBERTa) Contradiction probability between prompt and response
H_factual 0.4 RAG retrieval Deviation from ground-truth knowledge base

The score is in [0.0, 1.0]. Higher = more coherent.

Thresholds

Parameter Default Purpose
threshold 0.5 Below this = rejected
soft_limit threshold + 0.1 Between threshold and soft_limit = warning zone
scorer = CoherenceScorer(threshold=0.5, soft_limit=0.65)
approved, score = scorer.review(query, response)

if not approved:
    print("Rejected — below threshold")
elif score.warning:
    print("Warning — low confidence, consider verification")
else:
    print("Approved")

NLI Backends

Heuristic (default, no GPU)

Word-overlap scoring. Fast (<1ms) but limited to vocabulary-level detection.

scorer = CoherenceScorer(use_nli=False)

75.6% per-dataset mean BA on AggreFact. Uses instruction template + SummaC source chunking.

scorer = CoherenceScorer(use_nli=True)
Backend Latency Accuracy
ONNX GPU batch 14.6 ms/pair 75.6% BA
PyTorch GPU batch 19 ms/pair 75.6% BA
PyTorch GPU sequential 197 ms/pair 75.6% BA
ONNX CPU batch 383 ms/pair 75.6% BA

Embedding scorer (no GPU needed)

~65% balanced accuracy at 3ms/pair on CPU. Good for screening before NLI.

scorer = CoherenceScorer(scorer_backend="embed")
# requires: pip install director-ai[embed]

Rules engine (zero ML, <1ms)

8 configurable rules (entity grounding, numeric consistency, negation flip, etc.). Guardrails AI-style explicit control. Ships in the base package.

scorer = CoherenceScorer(scorer_backend="rules")
# no extra install needed

MiniCheck (lighter alternative)

72.6% balanced accuracy. Lower VRAM (~400MB vs ~1.5GB).

scorer = CoherenceScorer(
    use_nli=True,
    nli_model="lytang/MiniCheck-DeBERTa-L",
)

LiteScorer (CPU-only, ~65% accuracy)

Word overlap + length ratio + negation heuristics. <0.5 ms/pair, no dependencies.

scorer = CoherenceScorer(scorer_backend="lite")

Customizing Weights

Adjust the balance between logical and factual signals:

# Fact-heavy (for KB-grounded use cases)
scorer = CoherenceScorer(w_logic=0.3, w_fact=0.7)

# Logic-heavy (for free-form reasoning)
scorer = CoherenceScorer(w_logic=0.8, w_fact=0.2)

# Summarization (factual only, no logic duplication)
scorer = CoherenceScorer(w_logic=0.0, w_fact=1.0)

Constraint: w_logic + w_fact must equal 1.0.

Score Caching

Enable caching to avoid redundant NLI inference (60-80% cost reduction in streaming):

scorer = CoherenceScorer(
    cache_size=2048,
    cache_ttl=300.0,
)

# Monitor cache
print(f"Hit rate: {scorer.cache.hit_rate:.1%}")
print(f"Size: {scorer.cache.size}")

Batch Scoring

Score multiple pairs in 2 GPU forward passes (when NLI is available):

items = [
    ("What is 2+2?", "The answer is 4."),
    ("Capital of France?", "Paris is in Germany."),
]
results = scorer.review_batch(items)

Chunked NLI

For long documents, sentence-level scoring catches localized hallucinations:

divergence = scorer._nli.score_chunked(
    premise="Paris is the capital of France. The Eiffel Tower is in Paris.",
    hypothesis="Berlin is the capital of France. The Eiffel Tower is in Berlin.",
)

Max-aggregation: the worst per-sentence contradiction drives the final score.

Next Steps