Skip to content

CoherenceScorer

The central scoring engine. Computes a composite coherence score from two independent signals — NLI contradiction probability (H_logical) and RAG fact deviation (H_factual) — then accepts or rejects the response.

coherence = 1.0 - (W_LOGIC × H_logical + W_FACT × H_factual)

Default weights: W_LOGIC = 0.6, W_FACT = 0.4.

Usage

from director_ai import CoherenceScorer, GroundTruthStore

store = GroundTruthStore()
store.add("capital", "Paris is the capital of France.")

scorer = CoherenceScorer(
    threshold=0.6,
    ground_truth_store=store,
    use_nli=True,
)

approved, score = scorer.review(
    "What is the capital of France?",
    "The capital of France is Berlin.",
)

print(f"Approved: {approved}")        # False
print(f"Score: {score.score:.3f}")    # ~0.35
print(f"Evidence: {score.evidence}")  # Retrieved context + NLI details

Constructor Parameters

Parameter Type Default Description
threshold float 0.5 Minimum coherence to approve (0.0–1.0)
soft_limit float \| None threshold + 0.1 Warning zone upper bound
w_logic float 0.6 Weight for NLI divergence
w_fact float 0.4 Weight for factual divergence
strict_mode bool False Reject if NLI unavailable (no heuristic fallback)
use_nli bool \| None None True = force NLI, False = disable, None = auto-detect
nli_model str \| None None HuggingFace model ID (default: FactCG-DeBERTa-v3-Large)
ground_truth_store GroundTruthStore \| None None Fact store for RAG retrieval
cache_size int 0 LRU cache max entries (0 = disabled)
cache_ttl float 300.0 Cache entry TTL in seconds
scorer_backend str "deberta" Backend: deberta, onnx, minicheck, hybrid, lite, rust
nli_quantize_8bit bool False 8-bit quantization (reduces VRAM from ~1.5GB to ~400MB)
nli_device str \| None None Torch device ("cuda", "cuda:0", "cpu")
nli_torch_dtype str \| None None Torch dtype ("float16", "bfloat16")
history_window int 5 Rolling history size for trend detection
llm_judge_enabled bool False Escalate to LLM when NLI confidence is low
llm_judge_confidence_threshold float 0.3 Softmax margin below which to escalate
llm_judge_provider str "" "openai" or "anthropic"
privacy_mode bool False Redact PII before sending to LLM judge
onnx_path str \| None None Directory with exported ONNX model
nli_devices str \| None None Multi-GPU sharding (comma-separated: "cuda:0,cuda:1")

Methods

review()

approved, score = scorer.review(prompt: str, action: str, session=None, tenant_id: str = "") -> tuple[bool, CoherenceScore]

Score a single prompt/response pair. Returns (approved, CoherenceScore).

review_batch()

results = scorer.review_batch(items: list[tuple[str, str]]) -> list[tuple[bool, CoherenceScore]]

Score multiple pairs. Currently routes each item through review() sequentially. For parallel execution, wrap the scorer in BatchProcessor.

items = [
    ("What is 2+2?", "The answer is 4."),
    ("Capital of France?", "Paris is in Germany."),
]
results = scorer.review_batch(items)
for approved, score in results:
    print(f"approved={approved}  score={score.score:.3f}")

score_chunked()

Sentence-level NLI scoring with max-aggregation. Catches localized hallucinations that full-text comparison would miss.

Scorer Backends

Backend Install Latency Accuracy GPU
deberta pip install director-ai[nli] 19 ms/pair (GPU batch) 75.6% BA Yes
onnx pip install director-ai[onnx] 14.6 ms/pair (GPU batch) 75.6% BA Yes
minicheck pip install director-ai[minicheck] ~60 ms/pair 72.6% BA Yes
lite included <0.5 ms/pair ~65% BA No
hybrid [nli] + LLM API key 20-50 ms/pair ~78% BA Yes
rust build backfire-kernel ~1 ms/pair ~65% BA No

Validation Rules

  • threshold must be in [0.0, 1.0]
  • soft_limit must be >= threshold
  • w_logic + w_fact must equal 1.0
  • hybrid backend requires llm_judge_provider

OpenTelemetry

When OpenTelemetry is configured, review() emits a parent director_ai.review span plus optional stage spans for cache lookup, retrieval, NLI inference, calibration, and judge escalation. The stage spans are no-ops when OTel is unavailable or no collector is configured.

Stage span names and core attributes:

Span Key attributes
director_ai.cache cache.hit, cache.scope_present
director_ai.retrieval retrieval.top_k, retrieval.tenant_scoped, retrieval.has_context, retrieval.result_count
director_ai.nli nli.stage, nli.model_available, nli.score, nli.token_count
director_ai.calibration calibration.stage, calibration.threshold, calibration.verdict_confidence, calibration.signal_agreement
director_ai.judge judge.provider, judge.cache_hit, judge.nli_score, judge.adjusted_score

Full API

director_ai.core.scoring.scorer.CoherenceScorer

CoherenceScorer(threshold: float = 0.5, history_window: int = 5, use_nli: bool | None = None, ground_truth_store: Any | None = None, nli_model: str | None = None, soft_limit: float | None = None, w_logic: float | None = None, w_fact: float | None = None, strict_mode: bool = False, require_model_backed_nli: bool = False, cache_size: int = 0, cache_ttl: float = 300.0, nli_quantize_8bit: bool = False, nli_device: str | None = None, nli_torch_dtype: str | None = None, llm_judge_enabled: bool = False, llm_judge_confidence_threshold: float = 0.3, llm_judge_provider: str = '', llm_judge_model: str = '', llm_judge_model_revision: str | None = None, scorer_backend: str = 'deberta', onnx_path: str | None = None, nli_devices: list[str] | None = None, onnx_batch_size: int = 16, onnx_flush_timeout_ms: float = 10.0, privacy_mode: bool = False, cache: ScoreCache | None = None, nli_max_length: int = 512, nli_revision: str | None = None, reasoning_enabled: bool = False, reasoning_provider: str = '', reasoning_model: str = '', reasoning_model_revision: str | None = None, reasoning_escalation_margin: float = 0.15, minicheck_variant: str = 'deberta-v3-large')

Weighted NLI divergence scorer for AI output verification.

Computes a composite coherence score from two NLI-based signals: - Logical divergence (H_logical): NLI contradiction probability between prompt and response. - Factual divergence (H_factual): NLI contradiction probability between retrieved context and response.

Final score: coherence = 1 - (0.6 * H_logical + 0.4 * H_factual). When coherence falls below threshold, the output is rejected.

Parameters:

Name Type Description Default
threshold float – minimum coherence to approve (default 0.5).
0.5
soft_limit float | None – scores between threshold and soft_limit

trigger a warning. Default: threshold + 0.1.

None
w_logic float – weight for logical divergence (default 0.6).
None
w_fact float – weight for factual divergence (default 0.4).

Must satisfy w_logic + w_fact = 1.0.

None
strict_mode bool – when True, disables heuristic fallbacks entirely.

If NLI model is unavailable and strict_mode is True, divergence returns 0.9 (reject) and sets strict_mode_rejected=True.

False
require_model_backed_nli bool – when True, fail closed unless a

model-backed NLI backend is available (DeBERTa/ONNX/MiniCheck/Rust).

False
history_window int – rolling history size.
5
use_nli bool | None – True forces NLI, False disables it,

None (default) auto-detects based on installed packages.

None
ground_truth_store GroundTruthStore | None – fact store for RAG.
None
nli_model str | None – HuggingFace model ID or local path for NLI.
None
cache_size int – LRU score cache max entries (0 to disable).
0
cache_ttl float – cache entry TTL in seconds.
300.0
nli_quantize_8bit bool – load NLI model with 8-bit quantization.
False
nli_device str | None – torch device for NLI model.
None
nli_torch_dtype str | None – torch dtype ("float16", "bfloat16").
None
llm_judge_enabled bool – escalate to LLM when NLI margin is low.
False
llm_judge_confidence_threshold float – softmax margin below which

to escalate (default 0.3).

0.3
llm_judge_provider str – "openai" or "anthropic".
''
privacy_mode bool – redact PII (emails, phones, SSN-like patterns)

before sending text to external LLM judge.

False

Initialise backend, cache, threshold, and escalation state.

The constructor preserves the historical keyword surface while routing each backend option into one review pipeline. It validates score bounds, soft-limit ordering, and divergence weights before creating model or cache state.

from_config classmethod

from_config(config: ScorerConfig, *, ground_truth_store: Any = None, cache: Any = None) -> CoherenceScorer

Build a scorer from a grouped :class:ScorerConfig.

Value settings come from config; the runtime dependencies (ground_truth_store and cache) are injected separately so the config stays serialisable. Equivalent to the per-argument constructor.

close

close() -> None

Shut down internal thread pool.

__del__

__del__() -> None

Release the lazily-created parallel scoring pool during teardown.

enable_injection_detection

enable_injection_detection(injection_threshold: float = 0.7, drift_threshold: float = 0.6, injection_claim_threshold: float = 0.75, baseline_divergence: float = 0.4, stage1_weight: float = 0.3, require_model_backed_nli: bool = False, fail_closed_on_error: bool = False) -> None

Enable output-side injection detection on every review() call.

enable_adaptive_retrieval

enable_adaptive_retrieval(threshold: float = 0.5, default_retrieve: bool = True) -> None

Enable adaptive retrieval routing.

When enabled, non-factual queries (creative, conversational) skip KB retrieval entirely, saving latency and avoiding false KB matches on queries that do not need grounding.

calculate_factual_divergence

calculate_factual_divergence(prompt: str, text_output: str, tenant_id: str = '', *, _inner_agg: str | None = None, _outer_agg: str | None = None) -> float

Check output against the Ground Truth Store.

Returns 0.0 (aligned) to 1.0 (hallucinated). When strict_mode is True and NLI is unavailable, returns 0.9 (reject).

calculate_factual_divergence_with_evidence

calculate_factual_divergence_with_evidence(prompt: str, text_output: str, tenant_id: str = '', *, _inner_agg: str | None = None, _outer_agg: str | None = None) -> tuple[float, ScoringEvidence | None]

Like calculate_factual_divergence but also returns evidence.

calculate_logical_divergence

calculate_logical_divergence(prompt: str, text_output: str, *, _inner_agg: str | None = None, _outer_agg: str | None = None) -> float

Compute logical contradiction probability via NLI.

When strict_mode is True and NLI is unavailable, returns 0.9 (reject).

compute_divergence

compute_divergence(prompt: str, action: str) -> float

Compute composite divergence (lower is better).

Weighted sum: W_LOGIC * H_logical + W_FACT * H_factual.

review

review(prompt: str, action: str, session: Any | None = None, tenant_id: str = '') -> tuple[bool, CoherenceScore]

Score an action and decide whether to approve it.

Parameters:

Name Type Description Default
prompt str

Source prompt, user request, or retrieved question that frames the response under review.

required
action str

Candidate model output to score against the prompt and any configured grounding store.

required
session ConversationSession | None – when provided, cross-turn

divergence is blended into the logical score and the turn is recorded after scoring.

None
tenant_id str

Tenant scope for cache keys and tenant-aware grounding stores.

''

review_batch

review_batch(items: list[tuple[str, str]], tenant_id: str = '') -> list[tuple[bool, CoherenceScore]]

Batch-review a list of (prompt, response) pairs.

When NLI is available, batches logical and factual divergence through NLIScorer.score_batch() (2 GPU forward passes total instead of 2*N). Falls back to sequential review() for items that need special handling (dialogue, summarization, rust backend, or when NLI is unavailable).

areview async

areview(prompt: str, action: str, session: Any | None = None, tenant_id: str = '') -> tuple[bool, CoherenceScore]

Async version of review() – offloads NLI inference to a thread pool.