CoherenceScorer¶

The central scoring engine. Computes a composite coherence score from two independent signals — NLI contradiction probability (H_logical) and RAG fact deviation (H_factual) — then accepts or rejects the response.

coherence = 1.0 - (W_LOGIC × H_logical + W_FACT × H_factual)

Default weights: W_LOGIC = 0.6, W_FACT = 0.4.

Usage¶

from director_ai import CoherenceScorer, GroundTruthStore

store = GroundTruthStore()
store.add("capital", "Paris is the capital of France.")

scorer = CoherenceScorer(
    threshold=0.6,
    ground_truth_store=store,
    use_nli=True,
)

approved, score = scorer.review(
    "What is the capital of France?",
    "The capital of France is Berlin.",
)

print(f"Approved: {approved}")        # False
print(f"Score: {score.score:.3f}")    # ~0.35
print(f"Evidence: {score.evidence}")  # Retrieved context + NLI details

Constructor Parameters¶

Parameter	Type	Default	Description
`threshold`	`float`	`0.5`	Minimum coherence to approve (0.0–1.0)
`soft_limit`	`float \\| None`	`threshold + 0.1`	Warning zone upper bound
`w_logic`	`float`	`0.6`	Weight for NLI divergence
`w_fact`	`float`	`0.4`	Weight for factual divergence
`strict_mode`	`bool`	`False`	Reject if NLI unavailable (no heuristic fallback)
`use_nli`	`bool \\| None`	`None`	`True` = force NLI, `False` = disable, `None` = auto-detect
`nli_model`	`str \\| None`	`None`	HuggingFace model ID (default: FactCG-DeBERTa-v3-Large)
`ground_truth_store`	`GroundTruthStore \\| None`	`None`	Fact store for RAG retrieval
`cache_size`	`int`	`0`	LRU cache max entries (0 = disabled)
`cache_ttl`	`float`	`300.0`	Cache entry TTL in seconds
`scorer_backend`	`str`	`"deberta"`	Backend: `deberta`, `onnx`, `minicheck`, `hybrid`, `lite`, `rust`
`nli_quantize_8bit`	`bool`	`False`	8-bit quantization (reduces VRAM from ~1.5GB to ~400MB)
`nli_device`	`str \\| None`	`None`	Torch device (`"cuda"`, `"cuda:0"`, `"cpu"`)
`nli_torch_dtype`	`str \\| None`	`None`	Torch dtype (`"float16"`, `"bfloat16"`)
`history_window`	`int`	`5`	Rolling history size for trend detection
`llm_judge_enabled`	`bool`	`False`	Escalate to LLM when NLI confidence is low
`llm_judge_confidence_threshold`	`float`	`0.3`	Softmax margin below which to escalate
`llm_judge_provider`	`str`	`""`	`"openai"` or `"anthropic"`
`privacy_mode`	`bool`	`False`	Redact PII before sending to LLM judge
`onnx_path`	`str \\| None`	`None`	Directory with exported ONNX model
`nli_devices`	`str \\| None`	`None`	Multi-GPU sharding (comma-separated: `"cuda:0,cuda:1"`)

Methods¶

review()¶

approved, score = scorer.review(prompt: str, action: str, session=None, tenant_id: str = "") -> tuple[bool, CoherenceScore]

Score a single prompt/response pair. Returns (approved, CoherenceScore).

review_batch()¶

results = scorer.review_batch(items: list[tuple[str, str]]) -> list[tuple[bool, CoherenceScore]]

Score multiple pairs. Currently routes each item through review() sequentially. For parallel execution, wrap the scorer in BatchProcessor.

items = [
    ("What is 2+2?", "The answer is 4."),
    ("Capital of France?", "Paris is in Germany."),
]
results = scorer.review_batch(items)
for approved, score in results:
    print(f"approved={approved}  score={score.score:.3f}")

score_chunked()¶

Sentence-level NLI scoring with max-aggregation. Catches localized hallucinations that full-text comparison would miss.

Scorer Backends¶

Backend	Install	Latency	Accuracy	GPU
`deberta`	`pip install director-ai[nli]`	19 ms/pair (GPU batch)	75.8% BA	Yes
`onnx`	`pip install director-ai[onnx]`	14.6 ms/pair (GPU batch)	75.8% BA	Yes
`minicheck`	`pip install director-ai[minicheck]`	~60 ms/pair	72.6% BA	Yes
`lite`	included	<0.5 ms/pair	~65% BA	No
`hybrid`	`[nli]` + LLM API key	20-50 ms/pair	~78% BA	Yes
`rust`	build `backfire-kernel`	~1 ms/pair	~65% BA	No

Validation Rules¶

threshold must be in [0.0, 1.0]
soft_limit must be >= threshold
w_logic + w_fact must equal 1.0
hybrid backend requires llm_judge_provider

Full API¶

director_ai.core.scoring.scorer.CoherenceScorer ¶

CoherenceScorer(threshold=0.5, history_window=5, use_nli=None, ground_truth_store=None, nli_model=None, soft_limit=None, w_logic=None, w_fact=None, strict_mode=False, cache_size=0, cache_ttl=300.0, nli_quantize_8bit=False, nli_device=None, nli_torch_dtype=None, llm_judge_enabled=False, llm_judge_confidence_threshold=0.3, llm_judge_provider='', llm_judge_model='', scorer_backend='deberta', onnx_path=None, nli_devices=None, onnx_batch_size=16, onnx_flush_timeout_ms=10.0, privacy_mode=False, cache=None, nli_max_length=512)

Weighted NLI divergence scorer for AI output verification.

Computes a composite coherence score from two NLI-based signals: - Logical divergence (H_logical): NLI contradiction probability between prompt and response. - Factual divergence (H_factual): NLI contradiction probability between retrieved context and response.

Final score: coherence = 1 - (0.6 * H_logical + 0.4 * H_factual). When coherence falls below threshold, the output is rejected.

Parameters:

Name	Type	Description	Default
`threshold`	`float â€” minimum coherence to approve (default 0.5).`		`0.5`
`soft_limit`	`float \| None â€” scores between threshold and soft_limit`	trigger a warning. Default: threshold + 0.1.	`None`
`w_logic`	`float â€” weight for logical divergence (default 0.6).`		`None`
`w_fact`	`float â€” weight for factual divergence (default 0.4).`	Must satisfy w_logic + w_fact = 1.0.	`None`
`strict_mode`	`bool â€” when True, disables heuristic fallbacks entirely.`	If NLI model is unavailable and strict_mode is True, divergence returns 0.9 (reject) and sets `strict_mode_rejected=True`.	`False`
`history_window`	`int â€” rolling history size.`		`5`
`use_nli`	`bool \| None â€” True forces NLI, False disables it,`	None (default) auto-detects based on installed packages.	`None`
`ground_truth_store`	`GroundTruthStore \| None â€” fact store for RAG.`		`None`
`nli_model`	`str \| None â€” HuggingFace model ID or local path for NLI.`		`None`
`cache_size`	`int â€” LRU score cache max entries (0 to disable).`		`0`
`cache_ttl`	`float â€” cache entry TTL in seconds.`		`300.0`
`nli_quantize_8bit`	`bool â€” load NLI model with 8-bit quantization.`		`False`
`nli_device`	`str \| None â€” torch device for NLI model.`		`None`
`nli_torch_dtype`	`str \| None â€” torch dtype ("float16", "bfloat16").`		`None`
`llm_judge_enabled`	`bool â€” escalate to LLM when NLI margin is low.`		`False`
`llm_judge_confidence_threshold`	`float â€” softmax margin below which`	to escalate (default 0.3).	`0.3`
`llm_judge_provider`	`str â€” "openai" or "anthropic".`		`''`
`privacy_mode`	`bool â€” redact PII (emails, phones, SSN-like patterns)`	before sending text to external LLM judge.	`False`

close ¶

close() -> None

Shut down internal thread pool.

calculate_factual_divergence ¶

calculate_factual_divergence(prompt, text_output, tenant_id: str = '', *, _inner_agg=None, _outer_agg=None)

Check output against the Ground Truth Store.

Returns 0.0 (aligned) to 1.0 (hallucinated). When strict_mode is True and NLI is unavailable, returns 0.9 (reject).

calculate_factual_divergence_with_evidence ¶

calculate_factual_divergence_with_evidence(prompt, text_output, tenant_id: str = '', *, _inner_agg=None, _outer_agg=None) -> tuple[float, ScoringEvidence | None]

Like calculate_factual_divergence but also returns evidence.

calculate_logical_divergence ¶

calculate_logical_divergence(prompt, text_output, *, _inner_agg=None, _outer_agg=None)

Compute logical contradiction probability via NLI.

When strict_mode is True and NLI is unavailable, returns 0.9 (reject).

compute_divergence ¶

compute_divergence(prompt, action)

Compute composite divergence (lower is better).

Weighted sum: W_LOGIC * H_logical + W_FACT * H_factual.

review ¶

review(prompt: str, action: str, session=None, tenant_id: str = '') -> tuple[bool, CoherenceScore]

Score an action and decide whether to approve it.