Scoring¶
How Coherence Scoring Works¶
graph LR
subgraph "Input"
P["Prompt"]
R["Response"]
end
subgraph "Logical Signal (W=0.6)"
NLI["NLI Model<br/>DeBERTa / ONNX"]
HL["H_logical<br/>(contradiction prob)"]
end
subgraph "Factual Signal (W=0.4)"
KB["KB Retrieval<br/>Vector / Keyword"]
HF["H_factual<br/>(fact deviation)"]
end
subgraph "Decision"
SCORE["coherence =<br/>1 - (0.6·H_L + 0.4·H_F)"]
GATE{≥ threshold?}
end
P --> NLI
R --> NLI
NLI --> HL --> SCORE
P --> KB
KB --> |"facts"| NLI2["NLI(facts, response)"]
R --> NLI2
NLI2 --> HF --> SCORE
SCORE --> GATE
GATE -->|Yes| OK["Approved"]
GATE -->|No| FAIL["Rejected + Evidence"]
style OK fill:#2e7d32,color:#fff
style FAIL fill:#c62828,color:#fff
style SCORE fill:#ff8f00,color:#fff
Director-AI computes a composite coherence score from two independent signals:
| Signal | Weight | Source | Measures |
|---|---|---|---|
| H_logical | 0.6 | NLI model (DeBERTa) | Contradiction probability between prompt and response |
| H_factual | 0.4 | RAG retrieval | Deviation from ground-truth knowledge base |
The score is in [0.0, 1.0]. Higher = more coherent.
Heuristic Orchestration¶
When NLI is unavailable or disabled, CoherenceScorer still uses the same
logical/factual score shape. The no-model path now separates route selection
from component scoring:
- dialogue: route-specific factual divergence when a loaded NLI scorer is available for dialogue calibration.
- summarisation: factual-only summarisation divergence when prompt-as-premise or summarisation auto-routing is active with NLI.
- factual-only: skips logical divergence when
w_logicis effectively zero. - parallel components: computes logical and factual heuristic divergences concurrently for the default no-model path.
The weighted combiner applies the no-KB calibration only when the factual score
is neutral, no retrieval evidence is present, and the score did not come from
the dialogue route. Local non-isolated regression evidence for this
orchestration path is recorded by
python -m benchmarks.heuristic_coherence_pipeline; the output lives at
benchmarks/results/heuristic_coherence_pipeline.json and does not make a
latency claim.
Thresholds¶
| Parameter | Default | Purpose |
|---|---|---|
threshold |
0.5 | Below this = rejected |
soft_limit |
threshold + 0.1 |
Between threshold and soft_limit = warning zone |
scorer = CoherenceScorer(threshold=0.5, soft_limit=0.65)
approved, score = scorer.review(query, response)
if not approved:
print("Rejected — below threshold")
elif score.warning:
print("Warning — low confidence, consider verification")
else:
print("Approved")
NLI Backends¶
Heuristic (default, no GPU)¶
Word-overlap scoring. Fast (<1ms) but limited to vocabulary-level detection.
FactCG-DeBERTa-v3-Large (recommended)¶
75.6% per-dataset mean BA on AggreFact. Uses instruction template + SummaC source chunking.
| Backend | Latency | Accuracy |
|---|---|---|
| ONNX GPU batch | 14.6 ms/pair | 75.6% BA |
| PyTorch GPU batch | 19 ms/pair | 75.6% BA |
| PyTorch GPU sequential | 197 ms/pair | 75.6% BA |
| ONNX CPU batch | 383 ms/pair | 75.6% BA |
Embedding scorer (no GPU needed)¶
~65% balanced accuracy at 3ms/pair on CPU. Good for screening before NLI.
Rules engine (zero ML, <1ms)¶
8 configurable rules (entity grounding, numeric consistency, negation flip, etc.). Guardrails AI-style explicit control. Ships in the base package.
MiniCheck (lighter alternative)¶
72.6% balanced accuracy. Lower VRAM (~400MB vs ~1.5GB).
LiteScorer (CPU-only heuristic baseline)¶
Word overlap + length ratio + negation heuristics. <0.5 ms/pair, no dependencies.
Distilled NLI Lite¶
The nli-lite backend is the Lite Scorer v2 distillation track. It is available
as an experimental backend for local student artefacts and readiness testing,
but public accuracy or latency claims require a held-out evaluation packet,
ONNX export evidence, quantized latency evidence, a model card, a benchmark
claim review, and the validator gate in tools/validate_lite_scorer_v2_plan.py.
The current evidence placeholder is benchmarks/lite_scorer_v2_evidence_packet.toml;
all evidence statuses remain pending until a trained student artefact is
evaluated.
Use the validator in two modes:
python tools/validate_lite_scorer_v2_plan.py
python tools/validate_lite_scorer_v2_plan.py --require-recorded-evidence
The first command preserves the no-claim placeholder and claim-surface guard. The second command is the R2 release-evidence gate: it requires recorded or validated student, teacher, ONNX, held-out evaluation, quantized latency, model-card, and benchmark-review statuses. A pending packet may stay in the repository for planning, but it cannot close Lite Scorer v2 evidence.
After training a student artefact, measure it with
tools/eval_lite_scorer_v2.py, then record the evidence packet with
tools/record_lite_scorer_v2_evidence.py. The evaluator calculates held-out
balanced accuracy, threshold, and latency percentiles. The recorder hashes the
student, teacher, ONNX, model-card, and benchmark-claim review artefacts, can
consume the evaluator JSON output via --eval-result, writes the measured
values, and re-runs the Lite Scorer v2 validator before keeping the packet.
The reproducible command plan is defined in
benchmarks/lite_scorer_v2_run_manifest.toml and emitted by
tools/plan_lite_scorer_v2_run.py. The planner prints held-out dataset build,
train, ONNX export, held-out evaluation, and evidence-recording argv arrays
only; it does not mark the evidence packet as recorded or make any public score
claim. tools/build_lite_scorer_v2_heldout.py builds the held-out JSONL from
the local evaluation split with deterministic sampling, balanced supported and
unsupported labels, source counts, and a SHA-256 provenance manifest.
training/train_distillation.py records the deterministic training seed,
selected row counts, model parameter counts, device, and claim boundary in
training_run_manifest.json inside the student output directory. Its
--device auto mode probes CUDA before loading training data or models and
falls back to CPU when the local PyTorch build cannot execute on the visible
GPU. The run manifest points at the local
training/output/minilm-safetensors student base so training does not depend on
remote model lookup during evidence production.
Customizing Weights¶
Adjust the balance between logical and factual signals:
# Fact-heavy (for KB-grounded use cases)
scorer = CoherenceScorer(w_logic=0.3, w_fact=0.7)
# Logic-heavy (for free-form reasoning)
scorer = CoherenceScorer(w_logic=0.8, w_fact=0.2)
# Summarization (factual only, no logic duplication)
scorer = CoherenceScorer(w_logic=0.0, w_fact=1.0)
Constraint: w_logic + w_fact must equal 1.0.
Score Caching¶
Enable caching to avoid redundant NLI inference (60-80% cost reduction in streaming):
scorer = CoherenceScorer(
cache_size=2048,
cache_ttl=300.0,
)
# Monitor cache
print(f"Hit rate: {scorer.cache.hit_rate:.1%}")
print(f"Size: {scorer.cache.size}")
Batch Scoring¶
Score multiple pairs in 2 GPU forward passes (when NLI is available):
items = [
("What is 2+2?", "The answer is 4."),
("Capital of France?", "Paris is in Germany."),
]
results = scorer.review_batch(items)
Chunked NLI¶
For long documents, sentence-level scoring catches localized hallucinations:
divergence = scorer._nli.score_chunked(
premise="Paris is the capital of France. The Eiffel Tower is in Paris.",
hypothesis="Berlin is the capital of France. The Eiffel Tower is in Berlin.",
)
Max-aggregation: the worst per-sentence contradiction drives the final score.
Next Steps¶
- Threshold Tuning — domain-specific calibration
- Streaming Halt — claim-level oversight
- KB Ingestion — populate the factual signal