Skip to content

Glossary

Key terms used throughout Director-AI documentation.


Balanced Accuracy (BA)
Macro-averaged recall across supported/not-supported classes. The standard metric for the LLM-AggreFact leaderboard. See Benchmarks.
Bidirectional NLI
Scoring the premise→hypothesis pair and the reverse (hypothesis→premise), then combining scores. Reduces false positives on paraphrases. See Threshold Tuning.
Coherence Score
The final 0–1 output of CoherenceScorer.review(). Computed as 1 - (w_logic * H_logical + w_factual * H_factual). Higher means more coherent. See Scoring.
Contradiction
An NLI label indicating the hypothesis negates the premise. Director-AI treats high contradiction probability as evidence of hallucination. See NLI Backends.
DeBERTa
The transformer architecture (He et al., 2021) used by Director-AI's default NLI model. DeBERTa-v3-Large has 0.4B parameters. See Scoring.
DirectorConfig
Centralized configuration object for threshold, backend, cache, and device settings. Serializable to YAML/JSON. See Configuration.
Dual-Entropy Scoring
Director-AI's approach of combining two independent entropy signals — H_logical (NLI) and H_factual (RAG retrieval) — into a single coherence score.
Entailment
An NLI label indicating the hypothesis follows from the premise. High entailment probability means the LLM response is consistent with ground truth.
Evidence Chunk
A specific passage from the knowledge base returned with every rejection. Tells the user why the response was flagged. See Evidence & Fallback.
FactCG
The fine-tuned DeBERTa-v3-Large model from Li et al. (NAACL 2025) that Director-AI uses as its default NLI backend. 77.2% BA (paper) / 75.86% BA (our eval). See Benchmarks.
False-Halt Rate
Percentage of correct responses incorrectly halted by the streaming kernel. 0.0% on Wikipedia passages in heuristic mode. See Benchmarks — Streaming False-Halt.
False-Positive Rate (FPR)
Percentage of correct premise-hypothesis pairs incorrectly flagged as incoherent. See Benchmarks — False-Positive Rate.
Ground Truth Store
The knowledge base that CoherenceScorer checks LLM responses against. Implementations: GroundTruthStore (dict-based), VectorGroundTruthStore (vector DB). See KB Ingestion.
Guard
The top-level guard() function that wraps an LLM SDK client (OpenAI, Anthropic, etc.) with automatic coherence checking. See Quickstart.
H_factual
The retrieval-based entropy component. Measures how well the LLM response is supported by retrieved KB chunks. Range 0–1 (0 = fully supported).
H_logical
The NLI-based entropy component. Measures logical contradiction between the LLM response and ground truth. Range 0–1 (0 = no contradiction).
Hallucination
An LLM output that contradicts the provided ground truth or makes unsupported factual claims. Director-AI detects this via NLI + RAG, not via content moderation.
Hard Limit
The threshold value. Responses scoring below this are rejected outright. See Threshold Tuning.
Heuristic Scoring
The fallback scoring mode (no NLI model loaded). Uses text overlap, entity matching, and n-gram similarity. Fast (<0.2 ms) but less accurate than NLI. See Scoring.
Hybrid Judge
A mode where NLI scoring is combined with an LLM-as-judge (GPT-4o-mini, Claude Sonnet, or local DeBERTa classifier) for higher catch rates. See Benchmarks.
Knowledge Base (KB)
Synonym for Ground Truth Store (see above). See KB Ingestion.
LLM-AggreFact
A 29,320-sample benchmark for factual consistency evaluation (Tang et al., 2024). Director-AI's primary accuracy benchmark. See Benchmarks.
MiniCheck
An alternative NLI model family (Tang et al., EMNLP 2024). Supported as a backend via scorer_backend="minicheck". See NLI Backends.
NLI (Natural Language Inference)
The task of determining whether a hypothesis is entailed, contradicted, or neutral given a premise. Director-AI uses NLI to detect factual inconsistency. See NLI Backends.
ONNX
Open Neural Network Exchange format. Director-AI exports DeBERTa to ONNX for faster inference via OnnxBackend. See ONNX Export.
Premise
In NLI, the known-true statement (from your KB). The LLM response is the hypothesis tested against it.
Hypothesis
In NLI, the statement being evaluated (the LLM's response). Tested against the premise (your KB facts).
RAG (Retrieval-Augmented Generation)
A pattern where an LLM's response is grounded in retrieved documents. Director-AI scores the output of RAG pipelines, not the retrieval itself.
Reranker
A cross-encoder model that re-scores retrieved chunks for relevance before they reach the scorer. See Vector Store API.
Scorer Backend
The engine that computes NLI scores. Options: deberta (default), onnx, minicheck, lite (heuristic), rust (PyO3 FFI). See NLI Backends.
Sliding Window
In streaming mode, the scorer evaluates the most recent N tokens rather than the full response. Keeps latency constant as responses grow. See Streaming Halt.
Soft Limit
The soft_limit threshold. Scores between threshold and soft_limit trigger a warning but don't reject. See Threshold Tuning.
Streaming Halt
Director-AI's signature feature: stopping LLM token generation mid-stream when coherence drops below threshold. See Streaming Halt.
Streaming Kernel
The StreamingKernel class that wraps a token stream and injects coherence checks every N tokens. See API — StreamingKernel.
Threshold
The coherence score cutoff (0–1). Responses below this are rejected. Domain-dependent: 0.55 for support, 0.6 default, 0.7+ for medical/legal. See Threshold Tuning.
Token-Level Scoring
Evaluating coherence incrementally as each token (or batch of tokens) arrives, rather than waiting for the complete response.
Trend Detection
In streaming mode, detecting a downward trend in coherence scores across consecutive windows — an early warning before the score crosses the threshold.
Vector Backend
The storage engine for VectorGroundTruthStore. Options: memory (in-process), chroma, faiss, qdrant, pinecone, weaviate, elasticsearch. See Vector Store API.