Why Director-AI¶
For the broader product, application, and market map, start with Product Overview. This page focuses on the technical reason Director-AI exists: generated text can become visible before post-hoc checks run.
The Streaming Problem¶
Every major LLM provider defaults to streaming. OpenAI, Anthropic, Google — they all send tokens as they are generated. Users see the response character by character.
Post-hoc guardrails check after generation completes. By then, the user may already have read the unsupported claim: a wrong medication dosage displayed for 3 seconds, a fabricated legal citation copied into a brief, or an incorrect refund policy quoted to a customer.
The industry standard — generate first, check later — is the wrong UX boundary for fact-critical streams.
What Director-AI Does Differently¶
Director-AI makes factual coherence a control point before output is accepted, stored, routed, or acted on.
Claim-level streaming halt. The production streaming signal is contradiction-driven:
completed streamed claims are checked against retrieved facts and the stream
halts when a claim contradicts governed knowledge. The latest local benchmark
artifact (benchmarks/results/streaming_contradiction_halt_base.json) reports
2/135 false halts and 2/3 caught contradiction passages on the small streaming
suite; broader held-out contradiction evidence is recorded in
benchmarks/results/contradiction_holdout_finetuned.json.
Dual-entropy scoring. Two independent signals:
- H_logical — NLI contradiction detection via DeBERTa (0.4B params). Catches logical inconsistencies between the response and your facts.
- H_factual — RAG retrieval against your knowledge base. Catches claims that have no supporting evidence.
The final score combines both: coherence = 1 - (0.6 * H_logical + 0.4 * H_factual).
Evidence on rejection. Every halt includes the specific KB chunks that contradicted the response. No black-box "this was flagged" — your users or QA team see exactly why.
0.4B parameters, sub-millisecond latency. FactCG-DeBERTa-v3-Large runs at 0.5 ms/pair on an L40S (FP16, batch=32). No API calls, no metering, no rate limits.
When NOT to Use Director-AI¶
Director-AI solves one problem: factual coherence — does the LLM output match your ground truth?
It does not handle:
| Problem | Use Instead |
|---|---|
| Toxicity / hate speech | NeMo Guardrails, LLM-Guard |
| Prompt injection (input-side only) | Rebuff, LLM-Guard — though Director-AI now includes InjectionDetector for output-side NLI-based detection |
| PII leakage | Presidio, LLM-Guard |
| Content moderation | OpenAI Moderation API, Llama Guard |
| Code safety | Semgrep, Snyk Code |
You can (and should) combine Director-AI with these tools. Director-AI guards facts; the others guard behaviour.
Decision Matrix¶
| Your Situation | Recommendation |
|---|---|
| RAG chatbot with a knowledge base | Director-AI with VectorGroundTruthStore — KB Ingestion guide |
| Streaming LLM responses to users | Director-AI contradiction-driven StreamingKernel — Streaming guide |
| LLM agent making multi-step decisions | Director-AI CoherenceAgent — API reference |
| Customer support bot with product facts | Director-AI with domain-specific KB — Support cookbook |
| Medical / legal / finance compliance | Director-AI with curated KB plus a tuned profile; stock regulated profiles are calibration starting points |
| Toxicity filtering only | NeMo Guardrails or LLM-Guard instead |
| Prompt injection defence only | Rebuff or LLM-Guard instead |
Cost Comparison¶
| System | Cost per 1K calls | Latency | Local/Offline |
|---|---|---|---|
| Director-AI (NLI mode) | $0 | 0.5 ms (L40S) | Yes |
| Director-AI (hybrid + GPT-4o-mini) | $0.07 | 2.3 s | No |
| Director-AI (hybrid + Claude Sonnet) | $1.40 | 14.2 s | No |
| GPT-4o as judge | $1.16 | ~2 s | No |
| Claude Haiku 4.5 as judge | $0.37 | ~1.5 s | No |
| GuardrailsAI (LLM-as-judge) | LLM cost | 2.26 s | No |
| SelfCheckGPT (multi-call) | 3-5x LLM cost | 5-10 s | No |
NLI-only mode is free, fast, and fully offline. Add an LLM judge only if you need the 90.7% hybrid catch rate — and even then, GPT-4o-mini matches Claude at 6x lower cost.
Next: Quickstart — score your first response in 2 minutes.