Skip to content

CoherenceScorer

The central scoring engine. Computes a composite coherence score from two independent signals — NLI contradiction probability (H_logical) and RAG fact deviation (H_factual) — then accepts or rejects the response.

coherence = 1.0 - (W_LOGIC × H_logical + W_FACT × H_factual)

Default weights: W_LOGIC = 0.6, W_FACT = 0.4.

Usage

from director_ai import CoherenceScorer, GroundTruthStore

store = GroundTruthStore()
store.add("capital", "Paris is the capital of France.")

scorer = CoherenceScorer(
    threshold=0.6,
    ground_truth_store=store,
    use_nli=True,
)

approved, score = scorer.review(
    "What is the capital of France?",
    "The capital of France is Berlin.",
)

print(f"Approved: {approved}")        # False
print(f"Score: {score.score:.3f}")    # ~0.35
print(f"Evidence: {score.evidence}")  # Retrieved context + NLI details

Constructor Parameters

Parameter Type Default Description
threshold float 0.5 Minimum coherence to approve (0.0–1.0)
soft_limit float \| None threshold + 0.1 Warning zone upper bound
w_logic float 0.6 Weight for NLI divergence
w_fact float 0.4 Weight for factual divergence
strict_mode bool False Reject if NLI unavailable (no heuristic fallback)
use_nli bool \| None None True = force NLI, False = disable, None = auto-detect
nli_model str \| None None HuggingFace model ID (default: FactCG-DeBERTa-v3-Large)
ground_truth_store GroundTruthStore \| None None Fact store for RAG retrieval
cache_size int 0 LRU cache max entries (0 = disabled)
cache_ttl float 300.0 Cache entry TTL in seconds
scorer_backend str "deberta" Backend: deberta, onnx, minicheck, hybrid, lite, rust
nli_quantize_8bit bool False 8-bit quantization (reduces VRAM from ~1.5GB to ~400MB)
nli_device str \| None None Torch device ("cuda", "cuda:0", "cpu")
nli_torch_dtype str \| None None Torch dtype ("float16", "bfloat16")
history_window int 5 Rolling history size for trend detection
llm_judge_enabled bool False Escalate to LLM when NLI confidence is low
llm_judge_confidence_threshold float 0.3 Softmax margin below which to escalate
llm_judge_provider str "" "openai" or "anthropic"
privacy_mode bool False Redact PII before sending to LLM judge
onnx_path str \| None None Directory with exported ONNX model
nli_devices str \| None None Multi-GPU sharding (comma-separated: "cuda:0,cuda:1")

Methods

review()

approved, score = scorer.review(prompt: str, action: str, session=None, tenant_id: str = "") -> tuple[bool, CoherenceScore]

Score a single prompt/response pair. Returns (approved, CoherenceScore).

review_batch()

results = scorer.review_batch(items: list[tuple[str, str]]) -> list[tuple[bool, CoherenceScore]]

Score multiple pairs. Currently routes each item through review() sequentially. For parallel execution, wrap the scorer in BatchProcessor.

items = [
    ("What is 2+2?", "The answer is 4."),
    ("Capital of France?", "Paris is in Germany."),
]
results = scorer.review_batch(items)
for approved, score in results:
    print(f"approved={approved}  score={score.score:.3f}")

score_chunked()

Sentence-level NLI scoring with max-aggregation. Catches localized hallucinations that full-text comparison would miss.

Scorer Backends

Backend Install Latency Accuracy GPU
deberta pip install director-ai[nli] 19 ms/pair (GPU batch) 75.8% BA Yes
onnx pip install director-ai[onnx] 14.6 ms/pair (GPU batch) 75.8% BA Yes
minicheck pip install director-ai[minicheck] ~60 ms/pair 72.6% BA Yes
lite included <0.5 ms/pair ~65% BA No
hybrid [nli] + LLM API key 20-50 ms/pair ~78% BA Yes
rust build backfire-kernel ~1 ms/pair ~65% BA No

Validation Rules

  • threshold must be in [0.0, 1.0]
  • soft_limit must be >= threshold
  • w_logic + w_fact must equal 1.0
  • hybrid backend requires llm_judge_provider

Full API

director_ai.core.scoring.scorer.CoherenceScorer

CoherenceScorer(threshold=0.5, history_window=5, use_nli=None, ground_truth_store=None, nli_model=None, soft_limit=None, w_logic=None, w_fact=None, strict_mode=False, cache_size=0, cache_ttl=300.0, nli_quantize_8bit=False, nli_device=None, nli_torch_dtype=None, llm_judge_enabled=False, llm_judge_confidence_threshold=0.3, llm_judge_provider='', llm_judge_model='', scorer_backend='deberta', onnx_path=None, nli_devices=None, onnx_batch_size=16, onnx_flush_timeout_ms=10.0, privacy_mode=False, cache=None, nli_max_length=512)

Weighted NLI divergence scorer for AI output verification.

Computes a composite coherence score from two NLI-based signals: - Logical divergence (H_logical): NLI contradiction probability between prompt and response. - Factual divergence (H_factual): NLI contradiction probability between retrieved context and response.

Final score: coherence = 1 - (0.6 * H_logical + 0.4 * H_factual). When coherence falls below threshold, the output is rejected.

Parameters:

Name Type Description Default
threshold float — minimum coherence to approve (default 0.5).
0.5
soft_limit float | None — scores between threshold and soft_limit

trigger a warning. Default: threshold + 0.1.

None
w_logic float — weight for logical divergence (default 0.6).
None
w_fact float — weight for factual divergence (default 0.4).

Must satisfy w_logic + w_fact = 1.0.

None
strict_mode bool — when True, disables heuristic fallbacks entirely.

If NLI model is unavailable and strict_mode is True, divergence returns 0.9 (reject) and sets strict_mode_rejected=True.

False
history_window int — rolling history size.
5
use_nli bool | None — True forces NLI, False disables it,

None (default) auto-detects based on installed packages.

None
ground_truth_store GroundTruthStore | None — fact store for RAG.
None
nli_model str | None — HuggingFace model ID or local path for NLI.
None
cache_size int — LRU score cache max entries (0 to disable).
0
cache_ttl float — cache entry TTL in seconds.
300.0
nli_quantize_8bit bool — load NLI model with 8-bit quantization.
False
nli_device str | None — torch device for NLI model.
None
nli_torch_dtype str | None — torch dtype ("float16", "bfloat16").
None
llm_judge_enabled bool — escalate to LLM when NLI margin is low.
False
llm_judge_confidence_threshold float — softmax margin below which

to escalate (default 0.3).

0.3
llm_judge_provider str — "openai" or "anthropic".
''
privacy_mode bool — redact PII (emails, phones, SSN-like patterns)

before sending text to external LLM judge.

False

close

close() -> None

Shut down internal thread pool.

calculate_factual_divergence

calculate_factual_divergence(prompt, text_output, tenant_id: str = '', *, _inner_agg=None, _outer_agg=None)

Check output against the Ground Truth Store.

Returns 0.0 (aligned) to 1.0 (hallucinated). When strict_mode is True and NLI is unavailable, returns 0.9 (reject).

calculate_factual_divergence_with_evidence

calculate_factual_divergence_with_evidence(prompt, text_output, tenant_id: str = '', *, _inner_agg=None, _outer_agg=None) -> tuple[float, ScoringEvidence | None]

Like calculate_factual_divergence but also returns evidence.

calculate_logical_divergence

calculate_logical_divergence(prompt, text_output, *, _inner_agg=None, _outer_agg=None)

Compute logical contradiction probability via NLI.

When strict_mode is True and NLI is unavailable, returns 0.9 (reject).

compute_divergence

compute_divergence(prompt, action)

Compute composite divergence (lower is better).

Weighted sum: W_LOGIC * H_logical + W_FACT * H_factual.

review

review(prompt: str, action: str, session=None, tenant_id: str = '') -> tuple[bool, CoherenceScore]

Score an action and decide whether to approve it.

Parameters:

Name Type Description Default
session ConversationSession | None — when provided, cross-turn

divergence is blended into the logical score and the turn is recorded after scoring.

None

review_batch

review_batch(items: list[tuple[str, str]], tenant_id: str = '') -> list[tuple[bool, CoherenceScore]]

Batch-review a list of (prompt, response) pairs.

When NLI is available, batches logical and factual divergence through NLIScorer.score_batch() (2 GPU forward passes total instead of 2*N). Falls back to sequential review() for items that need special handling (dialogue, summarization, rust backend, or when NLI is unavailable).

areview async

areview(prompt: str, action: str, session=None, tenant_id: str = '') -> tuple[bool, CoherenceScore]

Async version of review() — offloads NLI inference to a thread pool.