Glossary¶
Key terms used throughout Director-AI documentation.
- Balanced Accuracy (BA)
- Macro-averaged recall across supported/not-supported classes. The standard metric for the LLM-AggreFact leaderboard. See Benchmarks.
- Bidirectional NLI
- Scoring the premise→hypothesis pair and the reverse (hypothesis→premise), then combining scores. Reduces false positives on paraphrases. See Threshold Tuning.
- Coherence Score
- The final 0–1 output of
CoherenceScorer.review(). Computed as1 - (w_logic * H_logical + w_factual * H_factual). Higher means more coherent. See Scoring. - Contradiction
- An NLI label indicating the hypothesis negates the premise. Director-AI treats high contradiction probability as evidence of hallucination. See NLI Backends.
- DeBERTa
- The transformer architecture (He et al., 2021) used by Director-AI's default NLI model. DeBERTa-v3-Large has 0.4B parameters. See Scoring.
- DirectorConfig
- Centralized configuration object for threshold, backend, cache, and device settings. Serializable to YAML/JSON. See Configuration.
- Dual-Entropy Scoring
- Director-AI's approach of combining two independent entropy signals — H_logical (NLI) and H_factual (RAG retrieval) — into a single coherence score.
- Entailment
- An NLI label indicating the hypothesis follows from the premise. High entailment probability means the LLM response is consistent with ground truth.
- Evidence Chunk
- A specific passage from the knowledge base returned with every rejection. Tells the user why the response was flagged. See Evidence & Fallback.
- FactCG
- The fine-tuned DeBERTa-v3-Large model from Li et al. (NAACL 2025) that Director-AI uses as its default NLI backend. 77.2% BA (paper) / 75.86% BA (our eval). See Benchmarks.
- False-Halt Rate
- Percentage of correct responses incorrectly halted by the streaming kernel. 0.0% on Wikipedia passages in heuristic mode. See Benchmarks — Streaming False-Halt.
- False-Positive Rate (FPR)
- Percentage of correct premise-hypothesis pairs incorrectly flagged as incoherent. See Benchmarks — False-Positive Rate.
- Ground Truth Store
- The knowledge base that
CoherenceScorerchecks LLM responses against. Implementations:GroundTruthStore(dict-based),VectorGroundTruthStore(vector DB). See KB Ingestion. - Guard
- The top-level
guard()function that wraps an LLM SDK client (OpenAI, Anthropic, etc.) with automatic coherence checking. See Quickstart. - H_factual
- The retrieval-based entropy component. Measures how well the LLM response is supported by retrieved KB chunks. Range 0–1 (0 = fully supported).
- H_logical
- The NLI-based entropy component. Measures logical contradiction between the LLM response and ground truth. Range 0–1 (0 = no contradiction).
- Hallucination
- An LLM output that contradicts the provided ground truth or makes unsupported factual claims. Director-AI detects this via NLI + RAG, not via content moderation.
- Hard Limit
- The
thresholdvalue. Responses scoring below this are rejected outright. See Threshold Tuning. - Heuristic Scoring
- The fallback scoring mode (no NLI model loaded). Uses text overlap, entity matching, and n-gram similarity. Fast (<0.2 ms) but less accurate than NLI. See Scoring.
- Hybrid Judge
- A mode where NLI scoring is combined with an LLM-as-judge (GPT-4o-mini, Claude Sonnet, or local DeBERTa classifier) for higher catch rates. See Benchmarks.
- Knowledge Base (KB)
- Synonym for Ground Truth Store (see above). See KB Ingestion.
- LLM-AggreFact
- A 29,320-sample benchmark for factual consistency evaluation (Tang et al., 2024). Director-AI's primary accuracy benchmark. See Benchmarks.
- MiniCheck
- An alternative NLI model family (Tang et al., EMNLP 2024). Supported as a backend via
scorer_backend="minicheck". See NLI Backends. - NLI (Natural Language Inference)
- The task of determining whether a hypothesis is entailed, contradicted, or neutral given a premise. Director-AI uses NLI to detect factual inconsistency. See NLI Backends.
- ONNX
- Open Neural Network Exchange format. Director-AI exports DeBERTa to ONNX for faster inference via
OnnxBackend. See ONNX Export. - Premise
- In NLI, the known-true statement (from your KB). The LLM response is the hypothesis tested against it.
- Hypothesis
- In NLI, the statement being evaluated (the LLM's response). Tested against the premise (your KB facts).
- RAG (Retrieval-Augmented Generation)
- A pattern where an LLM's response is grounded in retrieved documents. Director-AI scores the output of RAG pipelines, not the retrieval itself.
- Reranker
- A cross-encoder model that re-scores retrieved chunks for relevance before they reach the scorer. See Vector Store API.
- Scorer Backend
- The engine that computes NLI scores. Options:
deberta(default),onnx,minicheck,lite(heuristic),rust(PyO3 FFI). See NLI Backends. - Sliding Window
- In streaming mode, the scorer evaluates the most recent N tokens rather than the full response. Keeps latency constant as responses grow. See Streaming Halt.
- Soft Limit
- The
soft_limitthreshold. Scores betweenthresholdandsoft_limittrigger a warning but don't reject. See Threshold Tuning. - Streaming Halt
- Director-AI's signature feature: stopping LLM token generation mid-stream when coherence drops below threshold. See Streaming Halt.
- Streaming Kernel
- The
StreamingKernelclass that wraps a token stream and injects coherence checks every N tokens. See API — StreamingKernel. - Threshold
- The coherence score cutoff (0–1). Responses below this are rejected. Domain-dependent: 0.55 for support, 0.6 default, 0.7+ for medical/legal. See Threshold Tuning.
- Token-Level Scoring
- Evaluating coherence incrementally as each token (or batch of tokens) arrives, rather than waiting for the complete response.
- Trend Detection
- In streaming mode, detecting a downward trend in coherence scores across consecutive windows — an early warning before the score crosses the threshold.
- Vector Backend
- The storage engine for
VectorGroundTruthStore. Options:memory(in-process),chroma,faiss,qdrant,pinecone,weaviate,elasticsearch. See Vector Store API.