Skip to content

NLI Backends

Natural Language Inference scorer using FactCG-DeBERTa-v3-Large (75.8% balanced accuracy on AggreFact). Used internally by CoherenceScorer — direct use is only needed for custom pipelines or benchmarking.

Usage

from director_ai.core.nli import NLIScorer, nli_available

if nli_available():
    nli = NLIScorer()
    divergence = nli.score("Paris is the capital of France.", "Berlin is the capital of France.")
    print(f"Divergence: {divergence:.3f}")  # ~0.85 (high contradiction)

NLIScorer

Parameter Type Default Description
model_name str "yaxili96/FactCG-DeBERTa-v3-Large" HuggingFace model ID
device str \| None None Torch device ("cuda", "cpu")
quantize_8bit bool False 8-bit quantization
torch_dtype str \| None None "float16", "bfloat16"
backend str "deberta" "deberta", "onnx", "minicheck", "lite"
use_model bool True False = heuristic-only mode

Methods

  • score(premise, hypothesis) -> float — NLI divergence in [0, 1]
  • score_batch(pairs) -> list[float] — batch inference (2 GPU kernels)
  • score_chunked(premise, hypothesis) -> tuple[float, list[float]] — sentence-level with max-aggregation; returns (aggregate_score, per_chunk_scores)
  • score_claim_coverage(source, summary, support_threshold=0.6) -> tuple[float, list[float], list[str]] — per-claim coverage; returns (coverage, per_claim_divergences, claims)

nli_available()

from director_ai.core.nli import nli_available

if nli_available():
    print("torch + transformers installed — NLI model available")

Returns True if torch and transformers are importable.

Backend Comparison

Backend Model Latency (GPU batch) Accuracy VRAM
deberta FactCG-DeBERTa-v3-Large 19 ms/pair 75.8% BA ~1.5 GB
onnx Same (exported) 14.6 ms/pair 75.8% BA ~1.2 GB
minicheck MiniCheck-DeBERTa-L ~60 ms/pair 72.6% BA ~400 MB
lite word-overlap heuristic <0.5 ms/pair ~65% BA 0

Full API

director_ai.core.scoring.nli.NLIScorer

NLIScorer(use_model: bool = True, max_length: int = 512, model_name: str | None = None, backend: str = 'deberta', quantize_8bit: bool = False, device: str | None = None, torch_dtype: str | None = None, onnx_path: str | None = None, onnx_batch_size: int = 16, onnx_flush_timeout_ms: float = 10.0, cost_per_token: float = _DEFAULT_COST_PER_TOKEN, lora_adapter_path: str | None = None)

NLI-based logical divergence scorer.

Parameters:

Name Type Description Default
use_model bool — attempt to load model on first score().
True
max_length int — max token length for NLI input.
512
model_name str | None — HuggingFace model ID or local path.
None
backend str | ScorerBackend — "deberta", "onnx", "minicheck",

"lite", or a ScorerBackend instance.

'deberta'
quantize_8bit bool — 8-bit quantization (requires bitsandbytes).
False
device str | None — torch device ("cpu", "cuda", "cuda:0").
None
torch_dtype str | None — "float16", "bfloat16", or "float32".
None
onnx_path str | None — directory with exported ONNX model.
None

score

score(premise: str, hypothesis: str) -> float

Compute logical divergence between premise and hypothesis.

Returns float in [0, 1]: 0 = entailment, 1 = contradiction.

ascore async

ascore(premise: str, hypothesis: str) -> float

Async score() — runs inference in a thread pool.

score_batch

score_batch(pairs: list[tuple[str, str]]) -> list[float]

Score multiple (premise, hypothesis) pairs.

Uses a single batched forward pass when a model backend is available (3-5x faster than sequential scoring).

ascore_batch async

ascore_batch(pairs: list[tuple[str, str]]) -> list[float]

Async batch scoring — runs in a thread pool.

score_chunked

score_chunked(premise: str, hypothesis: str, outer_agg: str = 'max', inner_agg: str = 'max', premise_ratio: float = 0.4, overlap_ratio: float = 0.0) -> tuple[float, list[float]]

Bidirectional chunked scoring for long premises and hypotheses.

Returns (aggregated_score, per_hypothesis_chunk_scores).

score_batch_with_confidence

score_batch_with_confidence(pairs: list[tuple[str, str]]) -> list[tuple[float, float]]

Score pairs and return (divergence, confidence) tuples.

Confidence is 1 - entropy of the softmax distribution, normalised to [0, 1]. High confidence = model is certain about its prediction.

score_chunked_confidence_weighted

score_chunked_confidence_weighted(premise: str, hypothesis: str, inner_agg: str = 'max', premise_ratio: float = 0.4, overlap_ratio: float = 0.0) -> tuple[float, list[float]]

Chunked scoring with confidence-weighted outer aggregation.

Instead of max/mean over hypothesis chunks, weights each chunk's divergence by the model's confidence (1 - normalised entropy). Uncertain chunks contribute less to the aggregate.

decompose_claims

decompose_claims(text: str) -> list[str]

Split text into individual claim sentences.

score_decomposed

score_decomposed(premise: str, hypothesis: str) -> tuple[float, list[float]]

Score each claim in hypothesis independently against premise.

Returns (max_score, per_claim_scores).

score_claim_coverage

score_claim_coverage(source: str, summary: str, support_threshold: float = 0.6) -> tuple[float, list[float], list[str]]

Decompose summary into claims and compute coverage against source.

A claim is "supported" when its NLI divergence < support_threshold. Coverage = supported_claims / total_claims.

For long sources, each claim is scored with chunked NLI so that at least one source chunk can provide evidence.

Returns (coverage, per_claim_divergences, claims).

score_claim_coverage_with_attribution

score_claim_coverage_with_attribution(source: str, summary: str, support_threshold: float = 0.6) -> tuple[float, list[float], list[str], list]

Like score_claim_coverage but also returns sentence-level attributions.

For each claim, finds the source sentence with lowest divergence (best evidence match). Returns list of ClaimAttribution objects.

director_ai.core.scoring.nli.nli_available

nli_available() -> bool

Check whether torch + transformers are importable.