NLI Backends¶
Natural Language Inference scorer using FactCG-DeBERTa-v3-Large (75.8% balanced accuracy on AggreFact). Used internally by CoherenceScorer — direct use is only needed for custom pipelines or benchmarking.
Usage¶
from director_ai.core.nli import NLIScorer, nli_available
if nli_available():
nli = NLIScorer()
divergence = nli.score("Paris is the capital of France.", "Berlin is the capital of France.")
print(f"Divergence: {divergence:.3f}") # ~0.85 (high contradiction)
NLIScorer¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str |
"yaxili96/FactCG-DeBERTa-v3-Large" |
HuggingFace model ID |
device |
str \| None |
None |
Torch device ("cuda", "cpu") |
quantize_8bit |
bool |
False |
8-bit quantization |
torch_dtype |
str \| None |
None |
"float16", "bfloat16" |
backend |
str |
"deberta" |
"deberta", "onnx", "minicheck", "lite" |
use_model |
bool |
True |
False = heuristic-only mode |
Methods¶
score(premise, hypothesis) -> float— NLI divergence in [0, 1]score_batch(pairs) -> list[float]— batch inference (2 GPU kernels)score_chunked(premise, hypothesis) -> tuple[float, list[float]]— sentence-level with max-aggregation; returns (aggregate_score, per_chunk_scores)score_claim_coverage(source, summary, support_threshold=0.6) -> tuple[float, list[float], list[str]]— per-claim coverage; returns (coverage, per_claim_divergences, claims)
nli_available()¶
from director_ai.core.nli import nli_available
if nli_available():
print("torch + transformers installed — NLI model available")
Returns True if torch and transformers are importable.
Backend Comparison¶
| Backend | Model | Latency (GPU batch) | Accuracy | VRAM |
|---|---|---|---|---|
deberta |
FactCG-DeBERTa-v3-Large | 19 ms/pair | 75.8% BA | ~1.5 GB |
onnx |
Same (exported) | 14.6 ms/pair | 75.8% BA | ~1.2 GB |
minicheck |
MiniCheck-DeBERTa-L | ~60 ms/pair | 72.6% BA | ~400 MB |
lite |
word-overlap heuristic | <0.5 ms/pair | ~65% BA | 0 |
Full API¶
director_ai.core.scoring.nli.NLIScorer
¶
NLIScorer(use_model: bool = True, max_length: int = 512, model_name: str | None = None, backend: str = 'deberta', quantize_8bit: bool = False, device: str | None = None, torch_dtype: str | None = None, onnx_path: str | None = None, onnx_batch_size: int = 16, onnx_flush_timeout_ms: float = 10.0, cost_per_token: float = _DEFAULT_COST_PER_TOKEN, lora_adapter_path: str | None = None)
NLI-based logical divergence scorer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
use_model
|
bool — attempt to load model on first score().
|
|
True
|
max_length
|
int — max token length for NLI input.
|
|
512
|
model_name
|
str | None — HuggingFace model ID or local path.
|
|
None
|
backend
|
str | ScorerBackend — "deberta", "onnx", "minicheck",
|
"lite", or a ScorerBackend instance. |
'deberta'
|
quantize_8bit
|
bool — 8-bit quantization (requires bitsandbytes).
|
|
False
|
device
|
str | None — torch device ("cpu", "cuda", "cuda:0").
|
|
None
|
torch_dtype
|
str | None — "float16", "bfloat16", or "float32".
|
|
None
|
onnx_path
|
str | None — directory with exported ONNX model.
|
|
None
|
score
¶
Compute logical divergence between premise and hypothesis.
Returns float in [0, 1]: 0 = entailment, 1 = contradiction.
ascore
async
¶
Async score() — runs inference in a thread pool.
score_batch
¶
Score multiple (premise, hypothesis) pairs.
Uses a single batched forward pass when a model backend is available (3-5x faster than sequential scoring).
ascore_batch
async
¶
Async batch scoring — runs in a thread pool.
score_chunked
¶
score_chunked(premise: str, hypothesis: str, outer_agg: str = 'max', inner_agg: str = 'max', premise_ratio: float = 0.4, overlap_ratio: float = 0.0) -> tuple[float, list[float]]
Bidirectional chunked scoring for long premises and hypotheses.
Returns (aggregated_score, per_hypothesis_chunk_scores).
score_batch_with_confidence
¶
Score pairs and return (divergence, confidence) tuples.
Confidence is 1 - entropy of the softmax distribution, normalised to [0, 1]. High confidence = model is certain about its prediction.
score_chunked_confidence_weighted
¶
score_chunked_confidence_weighted(premise: str, hypothesis: str, inner_agg: str = 'max', premise_ratio: float = 0.4, overlap_ratio: float = 0.0) -> tuple[float, list[float]]
Chunked scoring with confidence-weighted outer aggregation.
Instead of max/mean over hypothesis chunks, weights each chunk's divergence by the model's confidence (1 - normalised entropy). Uncertain chunks contribute less to the aggregate.
decompose_claims
¶
Split text into individual claim sentences.
score_decomposed
¶
Score each claim in hypothesis independently against premise.
Returns (max_score, per_claim_scores).
score_claim_coverage
¶
score_claim_coverage(source: str, summary: str, support_threshold: float = 0.6) -> tuple[float, list[float], list[str]]
Decompose summary into claims and compute coverage against source.
A claim is "supported" when its NLI divergence < support_threshold. Coverage = supported_claims / total_claims.
For long sources, each claim is scored with chunked NLI so that at least one source chunk can provide evidence.
Returns (coverage, per_claim_divergences, claims).
score_claim_coverage_with_attribution
¶
score_claim_coverage_with_attribution(source: str, summary: str, support_threshold: float = 0.6) -> tuple[float, list[float], list[str], list]
Like score_claim_coverage but also returns sentence-level attributions.
For each claim, finds the source sentence with lowest divergence (best evidence match). Returns list of ClaimAttribution objects.
director_ai.core.scoring.nli.nli_available
¶
Check whether torch + transformers are importable.