NLI Backends¶

Natural Language Inference scorer using FactCG-DeBERTa-v3-Large (75.8% balanced accuracy on AggreFact). Used internally by CoherenceScorer — direct use is only needed for custom pipelines or benchmarking.

Usage¶

from director_ai.core.nli import NLIScorer, nli_available

if nli_available():
    nli = NLIScorer()
    divergence = nli.score("Paris is the capital of France.", "Berlin is the capital of France.")
    print(f"Divergence: {divergence:.3f}")  # ~0.85 (high contradiction)

NLIScorer¶

Parameter	Type	Default	Description
`model_name`	`str`	`"yaxili96/FactCG-DeBERTa-v3-Large"`	HuggingFace model ID
`device`	`str \\| None`	`None`	Torch device (`"cuda"`, `"cpu"`)
`quantize_8bit`	`bool`	`False`	8-bit quantization
`torch_dtype`	`str \\| None`	`None`	`"float16"`, `"bfloat16"`
`backend`	`str`	`"deberta"`	`"deberta"`, `"onnx"`, `"minicheck"`, `"lite"`
`use_model`	`bool`	`True`	`False` = heuristic-only mode

Methods¶

score(premise, hypothesis) -> float — NLI divergence in [0, 1]
score_batch(pairs) -> list[float] — batch inference (2 GPU kernels)
score_chunked(premise, hypothesis) -> tuple[float, list[float]] — sentence-level with max-aggregation; returns (aggregate_score, per_chunk_scores)
score_claim_coverage(source, summary, support_threshold=0.6) -> tuple[float, list[float], list[str]] — per-claim coverage; returns (coverage, per_claim_divergences, claims)

nli_available()¶

from director_ai.core.nli import nli_available

if nli_available():
    print("torch + transformers installed — NLI model available")

Returns True if torch and transformers are importable.

Backend Comparison¶

Backend	Model	Latency (GPU batch)	Accuracy	VRAM
`deberta`	FactCG-DeBERTa-v3-Large	19 ms/pair	75.8% BA	~1.5 GB
`onnx`	Same (exported)	14.6 ms/pair	75.8% BA	~1.2 GB
`minicheck`	MiniCheck-DeBERTa-L	~60 ms/pair	72.6% BA	~400 MB
`lite`	word-overlap heuristic	<0.5 ms/pair	~65% BA	0

Full API¶

director_ai.core.scoring.nli.NLIScorer ¶

NLIScorer(use_model: bool = True, max_length: int = 512, model_name: str | None = None, backend: str = 'deberta', quantize_8bit: bool = False, device: str | None = None, torch_dtype: str | None = None, onnx_path: str | None = None, onnx_batch_size: int = 16, onnx_flush_timeout_ms: float = 10.0, cost_per_token: float = _DEFAULT_COST_PER_TOKEN, lora_adapter_path: str | None = None)

NLI-based logical divergence scorer.

Parameters:

Name	Type	Description	Default
`use_model`	`bool â€” attempt to load model on first score().`		`True`
`max_length`	`int â€” max token length for NLI input.`		`512`
`model_name`	`str \| None â€” HuggingFace model ID or local path.`		`None`
`backend`	`str \| ScorerBackend â€” "deberta", "onnx", "minicheck",`	"lite", or a ScorerBackend instance.	`'deberta'`
`quantize_8bit`	`bool â€” 8-bit quantization (requires bitsandbytes).`		`False`
`device`	`str \| None â€” torch device ("cpu", "cuda", "cuda:0").`		`None`
`torch_dtype`	`str \| None â€” "float16", "bfloat16", or "float32".`		`None`
`onnx_path`	`str \| None â€” directory with exported ONNX model.`		`None`

score ¶

score(premise: str, hypothesis: str) -> float

Compute logical divergence between premise and hypothesis.

Returns float in [0, 1]: 0 = entailment, 1 = contradiction.

ascore `async` ¶

ascore(premise: str, hypothesis: str) -> float

Async score() â€” runs inference in a thread pool.

score_batch ¶

score_batch(pairs: list[tuple[str, str]]) -> list[float]

Score multiple (premise, hypothesis) pairs.

Uses a single batched forward pass when a model backend is available (3-5x faster than sequential scoring).

ascore_batch `async` ¶

ascore_batch(pairs: list[tuple[str, str]]) -> list[float]

Async batch scoring â€” runs in a thread pool.

score_chunked ¶

score_chunked(premise: str, hypothesis: str, outer_agg: str = 'max', inner_agg: str = 'max', premise_ratio: float = 0.4, overlap_ratio: float = 0.0) -> tuple[float, list[float]]

Bidirectional chunked scoring for long premises and hypotheses.

Returns (aggregated_score, per_hypothesis_chunk_scores).

score_batch_with_confidence ¶

score_batch_with_confidence(pairs: list[tuple[str, str]]) -> list[tuple[float, float]]

Score pairs and return (divergence, confidence) tuples.

Confidence is 1 - entropy of the softmax distribution, normalised to [0, 1]. High confidence = model is certain about its prediction.

score_chunked_confidence_weighted ¶

score_chunked_confidence_weighted(premise: str, hypothesis: str, inner_agg: str = 'max', premise_ratio: float = 0.4, overlap_ratio: float = 0.0) -> tuple[float, list[float]]

Chunked scoring with confidence-weighted outer aggregation.

Instead of max/mean over hypothesis chunks, weights each chunk's divergence by the model's confidence (1 - normalised entropy). Uncertain chunks contribute less to the aggregate.

decompose_claims ¶

decompose_claims(text: str) -> list[str]

Split text into individual claim sentences.

score_decomposed ¶

score_decomposed(premise: str, hypothesis: str) -> tuple[float, list[float]]

Score each claim in hypothesis independently against premise.

Returns (max_score, per_claim_scores).

score_claim_coverage ¶

score_claim_coverage(source: str, summary: str, support_threshold: float = 0.6) -> tuple[float, list[float], list[str]]

Decompose summary into claims and compute coverage against source.

A claim is "supported" when its NLI divergence < support_threshold. Coverage = supported_claims / total_claims.

For long sources, each claim is scored with chunked NLI so that at least one source chunk can provide evidence.

Returns (coverage, per_claim_divergences, claims).

score_claim_coverage_with_attribution ¶

score_claim_coverage_with_attribution(source: str, summary: str, support_threshold: float = 0.6) -> tuple[float, list[float], list[str], list]

Like score_claim_coverage but also returns sentence-level attributions.

For each claim, finds the source sentence with lowest divergence (best evidence match). Returns list of ClaimAttribution objects.

director_ai.core.scoring.nli.nli_available ¶

nli_available() -> bool

Check whether torch + transformers are importable.