Architecture¶

Component Overview¶

graph TD
    subgraph "CoherenceScorer"
        NLI["NLIScorer<br/>(DeBERTa / ONNX / MiniCheck)"]
        GTS["GroundTruthStore<br/>(keyword + vector retrieval)"]
        LLM["LLM Judge<br/>(OpenAI / Anthropic)"]
        Cache["ScoreCache<br/>(LRU + TTL)"]
    end

    subgraph "StreamingKernel"
        Window["Sliding Window<br/>(avg, trend, hard limit)"]
        Halt["Halt Decision<br/>(hard / soft mode)"]
    end

    subgraph "VerifiedScorer"
        SentMatch["Sentence Matching<br/>(NLI pair-wise)"]
        Signals["5 Signals<br/>(NLI, Entity, Number, Negation, Traceability)"]
        Verdicts["Per-Claim Verdicts<br/>(supported / contradicted / fabricated)"]
    end

    subgraph "VectorBackend"
        IMB["InMemoryBackend"]
        Chroma["ChromaBackend"]
        ST["SentenceTransformerBackend"]
        Hybrid["HybridBackend<br/>(BM25 + Dense + RRF)"]
        ColBERT["ColBERTBackend<br/>(late interaction)"]
        Pinecone["PineconeBackend"]
        Reranker["RerankedBackend<br/>(cross-encoder wrapper)"]
    end

    Prompt["Prompt + Response"] --> CoherenceScorer
    Prompt --> |"/v1/verify"| VerifiedScorer
    VerifiedScorer --> SentMatch --> Signals --> Verdicts
    CoherenceScorer --> NLI
    CoherenceScorer --> GTS
    CoherenceScorer --> LLM
    CoherenceScorer --> Cache
    GTS --> VectorBackend
    CoherenceScorer --> |score| Halt
    TokenStream["Token Stream"] --> Window
    Window --> |coherence_callback| CoherenceScorer
    Window --> Halt
    Halt --> |approved / halted| Output["Output"]

Data Flow¶

Prompt arrives at CoherenceScorer.review(prompt, response)
Cache check: if (prompt, response) was recently scored, return cached result
Logical divergence: NLI scores (prompt, response) for contradiction probability
Factual divergence: GroundTruthStore.retrieve_context(prompt) fetches KB facts, then NLI scores (context, response)
Composite score: coherence = 1 - (W_LOGIC * H_logical + W_FACT * H_factual)
LLM judge (optional): if NLI confidence is low (or hybrid mode), escalate to LLM-as-judge and blend scores
Gate: approved = coherence >= threshold

Dual-Entropy Formula¶

The coherence score combines two independent divergence signals:

coherence = 1.0 - (W_LOGIC * H_logical + W_FACT * H_factual)

Constant	Default	Description
`W_LOGIC`	0.6	Weight for logical divergence (NLI contradiction)
`W_FACT`	0.4	Weight for factual divergence (RAG retrieval)

LLM judge blending (when activated):

Constant	Value	Description
`LLM_JUDGE_NLI_WEIGHT`	0.7	NLI score weight in blend
`LLM_JUDGE_LLM_WEIGHT`	0.3	LLM judge weight in blend
`LLM_JUDGE_AGREE_DIVERGENCE`	0.2	Divergence when LLM says YES
`LLM_JUDGE_DISAGREE_DIVERGENCE`	0.8	Divergence when LLM says NO

Streaming Oversight¶

StreamingKernel processes tokens one-by-one with three halt mechanisms:

Hard limit: immediate halt if any token's coherence < hard_limit
Window average: halt if sliding window mean < window_threshold
Downward trend: halt if coherence drops > trend_threshold over trend_window tokens

Soft-halt mode (halt_mode="soft") finishes the current sentence before halting (50-token safety cap).

Vector Backend ABC¶

All vector backends implement three methods:

class VectorBackend(ABC):
    def add(self, doc_id: str, text: str, metadata: dict | None = None) -> None: ...
    def query(self, text: str, n_results: int = 3) -> list[dict]: ...
    def count(self) -> int: ...

VectorGroundTruthStore wraps any backend and falls back to keyword matching when vector search returns no results.