Architecture¶
Component Overview¶
graph TD
subgraph "CoherenceScorer"
NLI["NLIScorer<br/>(DeBERTa / ONNX / MiniCheck)"]
GTS["GroundTruthStore<br/>(keyword + vector retrieval)"]
LLM["LLM Judge<br/>(OpenAI / Anthropic)"]
Cache["ScoreCache<br/>(LRU + TTL)"]
end
subgraph "StreamingKernel"
Window["Sliding Window<br/>(avg, trend, hard limit)"]
Halt["Halt Decision<br/>(hard / soft mode)"]
end
subgraph "VerifiedScorer"
SentMatch["Sentence Matching<br/>(NLI pair-wise)"]
Signals["5 Signals<br/>(NLI, Entity, Number, Negation, Traceability)"]
Verdicts["Per-Claim Verdicts<br/>(supported / contradicted / fabricated)"]
end
subgraph "VectorBackend"
IMB["InMemoryBackend"]
Chroma["ChromaBackend"]
ST["SentenceTransformerBackend"]
Hybrid["HybridBackend<br/>(BM25 + Dense + RRF)"]
ColBERT["ColBERTBackend<br/>(late interaction)"]
Pinecone["PineconeBackend"]
Reranker["RerankedBackend<br/>(cross-encoder wrapper)"]
end
Prompt["Prompt + Response"] --> CoherenceScorer
Prompt --> |"/v1/verify"| VerifiedScorer
VerifiedScorer --> SentMatch --> Signals --> Verdicts
CoherenceScorer --> NLI
CoherenceScorer --> GTS
CoherenceScorer --> LLM
CoherenceScorer --> Cache
GTS --> VectorBackend
CoherenceScorer --> |score| Halt
TokenStream["Token Stream"] --> Window
Window --> |coherence_callback| CoherenceScorer
Window --> Halt
Halt --> |approved / halted| Output["Output"]
Data Flow¶
- Prompt arrives at
CoherenceScorer.review(prompt, response) - Cache check: if
(prompt, response)was recently scored, return cached result - Logical divergence: NLI scores
(prompt, response)for contradiction probability - Factual divergence:
GroundTruthStore.retrieve_context(prompt)fetches KB facts, then NLI scores(context, response) - Composite score:
coherence = 1 - (W_LOGIC * H_logical + W_FACT * H_factual) - LLM judge (optional): if NLI confidence is low (or
hybridmode), escalate to LLM-as-judge and blend scores - Gate:
approved = coherence >= threshold
Dual-Entropy Formula¶
The coherence score combines two independent divergence signals:
| Constant | Default | Description |
|---|---|---|
W_LOGIC |
0.6 | Weight for logical divergence (NLI contradiction) |
W_FACT |
0.4 | Weight for factual divergence (RAG retrieval) |
LLM judge blending (when activated):
| Constant | Value | Description |
|---|---|---|
LLM_JUDGE_NLI_WEIGHT |
0.7 | NLI score weight in blend |
LLM_JUDGE_LLM_WEIGHT |
0.3 | LLM judge weight in blend |
LLM_JUDGE_AGREE_DIVERGENCE |
0.2 | Divergence when LLM says YES |
LLM_JUDGE_DISAGREE_DIVERGENCE |
0.8 | Divergence when LLM says NO |
Streaming Oversight¶
StreamingKernel processes tokens one-by-one with three halt mechanisms:
- Hard limit: immediate halt if any token's coherence <
hard_limit - Window average: halt if sliding window mean <
window_threshold - Downward trend: halt if coherence drops >
trend_thresholdovertrend_windowtokens
Soft-halt mode (halt_mode="soft") finishes the current sentence before halting (50-token safety cap).
Vector Backend ABC¶
All vector backends implement three methods:
class VectorBackend(ABC):
def add(self, doc_id: str, text: str, metadata: dict | None = None) -> None: ...
def query(self, text: str, n_results: int = 3) -> list[dict]: ...
def count(self) -> int: ...
VectorGroundTruthStore wraps any backend and falls back to keyword matching when vector search returns no results.