Skip to content

Scoring

How Coherence Scoring Works

graph LR
    subgraph "Input"
        P["Prompt"]
        R["Response"]
    end
    subgraph "Logical Signal (W=0.6)"
        NLI["NLI Model<br/>DeBERTa / ONNX"]
        HL["H_logical<br/>(contradiction prob)"]
    end
    subgraph "Factual Signal (W=0.4)"
        KB["KB Retrieval<br/>Vector / Keyword"]
        HF["H_factual<br/>(fact deviation)"]
    end
    subgraph "Decision"
        SCORE["coherence =<br/>1 - (0.6·H_L + 0.4·H_F)"]
        GATE{≥ threshold?}
    end

    P --> NLI
    R --> NLI
    NLI --> HL --> SCORE
    P --> KB
    KB --> |"facts"| NLI2["NLI(facts, response)"]
    R --> NLI2
    NLI2 --> HF --> SCORE
    SCORE --> GATE
    GATE -->|Yes| OK["Approved"]
    GATE -->|No| FAIL["Rejected + Evidence"]

    style OK fill:#2e7d32,color:#fff
    style FAIL fill:#c62828,color:#fff
    style SCORE fill:#ff8f00,color:#fff

Director-AI computes a composite coherence score from two independent signals:

coherence = 1.0 - (W_LOGIC × H_logical + W_FACT × H_factual)
Signal Weight Source Measures
H_logical 0.6 NLI model (DeBERTa) Contradiction probability between prompt and response
H_factual 0.4 RAG retrieval Deviation from ground-truth knowledge base

The score is in [0.0, 1.0]. Higher = more coherent.

Heuristic Orchestration

When NLI is unavailable or disabled, CoherenceScorer still uses the same logical/factual score shape. The no-model path now separates route selection from component scoring:

  • dialogue: route-specific factual divergence when a loaded NLI scorer is available for dialogue calibration.
  • summarisation: factual-only summarisation divergence when prompt-as-premise or summarisation auto-routing is active with NLI.
  • factual-only: skips logical divergence when w_logic is effectively zero.
  • parallel components: computes logical and factual heuristic divergences concurrently for the default no-model path.

The weighted combiner applies the no-KB calibration only when the factual score is neutral, no retrieval evidence is present, and the score did not come from the dialogue route. Local non-isolated regression evidence for this orchestration path is recorded by python -m benchmarks.heuristic_coherence_pipeline; the output lives at benchmarks/results/heuristic_coherence_pipeline.json and does not make a latency claim.

Thresholds

Parameter Default Purpose
threshold 0.5 Below this = rejected
soft_limit threshold + 0.1 Between threshold and soft_limit = warning zone
scorer = CoherenceScorer(threshold=0.5, soft_limit=0.65)
approved, score = scorer.review(query, response)

if not approved:
    print("Rejected — below threshold")
elif score.warning:
    print("Warning — low confidence, consider verification")
else:
    print("Approved")

NLI Backends

Heuristic (default, no GPU)

Word-overlap scoring. Fast (<1ms) but limited to vocabulary-level detection.

scorer = CoherenceScorer(use_nli=False)

75.6% per-dataset mean BA on AggreFact. Uses instruction template + SummaC source chunking.

scorer = CoherenceScorer(use_nli=True)
Backend Latency Accuracy
ONNX GPU batch 14.6 ms/pair 75.6% BA
PyTorch GPU batch 19 ms/pair 75.6% BA
PyTorch GPU sequential 197 ms/pair 75.6% BA
ONNX CPU batch 383 ms/pair 75.6% BA

Embedding scorer (no GPU needed)

~65% balanced accuracy at 3ms/pair on CPU. Good for screening before NLI.

scorer = CoherenceScorer(scorer_backend="embed")
# requires: pip install director-ai[embed]

Rules engine (zero ML, <1ms)

8 configurable rules (entity grounding, numeric consistency, negation flip, etc.). Guardrails AI-style explicit control. Ships in the base package.

scorer = CoherenceScorer(scorer_backend="rules")
# no extra install needed

MiniCheck (lighter alternative)

72.6% balanced accuracy. Lower VRAM (~400MB vs ~1.5GB).

scorer = CoherenceScorer(
    use_nli=True,
    nli_model="lytang/MiniCheck-DeBERTa-L",
)

LiteScorer (CPU-only heuristic baseline)

Word overlap + length ratio + negation heuristics. <0.5 ms/pair, no dependencies.

scorer = CoherenceScorer(scorer_backend="lite")

Distilled NLI Lite

The nli-lite backend is the Lite Scorer v2 distillation track. It is available as an experimental backend for local student artefacts and readiness testing, but public accuracy or latency claims require a held-out evaluation packet, ONNX export evidence, quantized latency evidence, a model card, a benchmark claim review, and the validator gate in tools/validate_lite_scorer_v2_plan.py. The current evidence placeholder is benchmarks/lite_scorer_v2_evidence_packet.toml; all evidence statuses remain pending until a trained student artefact is evaluated.

Use the validator in two modes:

python tools/validate_lite_scorer_v2_plan.py
python tools/validate_lite_scorer_v2_plan.py --require-recorded-evidence

The first command preserves the no-claim placeholder and claim-surface guard. The second command is the R2 release-evidence gate: it requires recorded or validated student, teacher, ONNX, held-out evaluation, quantized latency, model-card, and benchmark-review statuses. A pending packet may stay in the repository for planning, but it cannot close Lite Scorer v2 evidence.

After training a student artefact, measure it with tools/eval_lite_scorer_v2.py, then record the evidence packet with tools/record_lite_scorer_v2_evidence.py. The evaluator calculates held-out balanced accuracy, threshold, and latency percentiles. The recorder hashes the student, teacher, ONNX, model-card, and benchmark-claim review artefacts, can consume the evaluator JSON output via --eval-result, writes the measured values, and re-runs the Lite Scorer v2 validator before keeping the packet.

The reproducible command plan is defined in benchmarks/lite_scorer_v2_run_manifest.toml and emitted by tools/plan_lite_scorer_v2_run.py. The planner prints held-out dataset build, train, ONNX export, held-out evaluation, and evidence-recording argv arrays only; it does not mark the evidence packet as recorded or make any public score claim. tools/build_lite_scorer_v2_heldout.py builds the held-out JSONL from the local evaluation split with deterministic sampling, balanced supported and unsupported labels, source counts, and a SHA-256 provenance manifest. training/train_distillation.py records the deterministic training seed, selected row counts, model parameter counts, device, and claim boundary in training_run_manifest.json inside the student output directory. Its --device auto mode probes CUDA before loading training data or models and falls back to CPU when the local PyTorch build cannot execute on the visible GPU. The run manifest points at the local training/output/minilm-safetensors student base so training does not depend on remote model lookup during evidence production.

Customizing Weights

Adjust the balance between logical and factual signals:

# Fact-heavy (for KB-grounded use cases)
scorer = CoherenceScorer(w_logic=0.3, w_fact=0.7)

# Logic-heavy (for free-form reasoning)
scorer = CoherenceScorer(w_logic=0.8, w_fact=0.2)

# Summarization (factual only, no logic duplication)
scorer = CoherenceScorer(w_logic=0.0, w_fact=1.0)

Constraint: w_logic + w_fact must equal 1.0.

Score Caching

Enable caching to avoid redundant NLI inference (60-80% cost reduction in streaming):

scorer = CoherenceScorer(
    cache_size=2048,
    cache_ttl=300.0,
)

# Monitor cache
print(f"Hit rate: {scorer.cache.hit_rate:.1%}")
print(f"Size: {scorer.cache.size}")

Batch Scoring

Score multiple pairs in 2 GPU forward passes (when NLI is available):

items = [
    ("What is 2+2?", "The answer is 4."),
    ("Capital of France?", "Paris is in Germany."),
]
results = scorer.review_batch(items)

Chunked NLI

For long documents, sentence-level scoring catches localized hallucinations:

divergence = scorer._nli.score_chunked(
    premise="Paris is the capital of France. The Eiffel Tower is in Paris.",
    hypothesis="Berlin is the capital of France. The Eiffel Tower is in Berlin.",
)

Max-aggregation: the worst per-sentence contradiction drives the final score.

Next Steps