Benchmarks¶
Reproducible, measured results across accuracy, latency, false-positive rate, and end-to-end guardrail performance. All numbers from our test suite unless marked "(est.)".
NLI Accuracy — LLM-AggreFact (29,320 samples)¶
Model: yaxili96/FactCG-DeBERTa-v3-Large (0.4B params).
Metric: macro-averaged balanced accuracy (standard for LLM-AggreFact).
| # | System | BA | Params | Streaming | Latency | License |
|---|---|---|---|---|---|---|
| 1 | Bespoke-MiniCheck-7B | 77.4% | 7B | No | ~100 ms (vLLM) | CC BY-NC 4.0 |
| 2 | Claude-3.5 Sonnet | 77.2% | ~200B | No | API | Proprietary |
| 3 | FactCG-DeBERTa-L (paper) | 77.2% | 0.4B | No | — | MIT |
| 4 | Granite Guardian 3.3 (IBM) | 76.5% | 8B | No | — | Apache 2.0 |
| 5 | GPT-4o | 75.9% | ~200B | No | API | Proprietary |
| 6 | Director-AI (FactCG) | 75.86% | 0.4B | Yes | 0.5 ms | AGPL v3 |
| 7 | MiniCheck-Flan-T5-L | 75.0% | 0.8B | No | ~120 ms | MIT |
| 8 | MiniCheck-DeBERTa-L | 74.1% | 0.4B | No | ~120 ms | MIT |
| 9 | Paladin-mini (Microsoft) | 73.1% | 3.8B | No | — | Phi-4 license |
| 10 | AlignScore | 72.5–73.4% | 0.355B | No | — | MIT |
| 11 | HHEM-2.1-Open (Vectara) | ~71.8% | 0.25B | No | ~200 ms (est.) | Apache 2.0 |
Director-AI wraps the same FactCG-DeBERTa-L model that scores 77.2% in the NAACL 2025 paper. Our eval yields 75.86% — a 1.4pp gap from threshold tuning methodology and data split version.
Director-AI beats all frontier LLMs at $0/call
75.86% BA with a 0.4B parameter model — outperforming Claude Haiku 4.5 (75.10%), Claude Sonnet 4.6 (74.25%), GPT-4o (73.46%), and GPT-4o-mini (71.66%) on the same AggreFact test set. Zero API cost, sub-millisecond latency.
Per-Dataset Breakdown (threshold=0.46)¶
| Dataset | Bal. Acc | Pos | Neg | Failure Mode |
|---|---|---|---|---|
| Reveal | 89.1% | 400 | 1310 | — |
| Lfqa | 86.4% | 1121 | 790 | — |
| RAGTruth | 82.2% | 15102 | 1269 | — |
| ClaimVerify | 78.1% | 789 | 299 | — |
| Wice | 76.9% | 111 | 247 | — |
| TofuEval-MeetB | 74.3% | 622 | 150 | Summarization |
| AggreFact-XSum | 74.3% | 285 | 273 | Extreme summarization |
| FactCheck-GPT | 73.0% | 376 | 1190 | GPT-generated claims |
| TofuEval-MediaS | 71.9% | 554 | 172 | Summarization (media) |
| AggreFact-CNN | 68.8% | 501 | 57 | Extreme class imbalance (9:1) |
| ExpertQA | 59.1% | 2971 | 731 | Long expert answers |
Director-AI vs Frontier LLMs (1K samples each)¶
We evaluated frontier LLMs on the same AggreFact test set using three prompting modes: binary (yes/no), confidence (0–100 score with threshold sweep), and fewshot (3 labeled examples + confidence).
| # | Model | Params | Confidence BA | Fewshot BA | Cost/1K calls |
|---|---|---|---|---|---|
| — | Director-AI | 0.4B | 75.86% | — | $0 |
| 1 | Claude Haiku 4.5 | ~20B | 75.10% (-0.76pp) | — | $0.37 |
| 2 | Claude Sonnet 4.6 | ~200B | 74.25% (-1.61pp) | 73.30% (-2.56pp) | $1.40 |
| 3 | GPT-4o | ~200B | 73.46% (-2.40pp) | 71.69% (-4.17pp) | $1.16 |
| 4 | GPT-4o-mini | ~8B | 71.66% (-4.20pp) | — | $0.07 |
Director-AI beats all tested frontier LLMs on AggreFact at $0 per call and 0.5 ms latency.
Latency¶
Per-Backend (GTX 1060, 16-pair batch)¶
| Backend | Median | P95 | Per-pair |
|---|---|---|---|
| Heuristic (no NLI) | 0.15 ms | 0.44 ms | 0.15 ms |
| Streaming token | 0.02 ms | 0.02 ms | 0.02 ms |
| ONNX GPU batch | 233 ms | 250 ms | 14.6 ms |
| PyTorch GPU batch | 304 ms | 353 ms | 19.0 ms |
| ONNX GPU seq | 1042 ms | 1249 ms | 65.1 ms |
| PyTorch GPU seq | 3145 ms | 3580 ms | 196.6 ms |
| ONNX CPU batch | 6124 ms | 8143 ms | 383 ms |
Cross-GPU (16-pair batch, per-pair median)¶
| GPU | VRAM | Compute | ONNX CUDA | PyTorch FP16 | PyTorch FP32 |
|---|---|---|---|---|---|
| L40S | 45 GB | 8.9 | — | 0.5 ms (b32) | 1.7 ms (b32) |
| RTX 6000 Ada | 48 GB | 8.9 | 0.9 ms | 1.2 ms | 2.1 ms |
| RTX A5000 | 24 GB | 8.6 | 2.0 ms | 3.4 ms | 4.8 ms |
| RTX A6000 | 48 GB | 8.6 | 3.5 ms | 9.7 ms | 10.1 ms |
| Quadro RTX 5000 | 16 GB | 7.5 | 5.1 ms | 2.5 ms | 5.9 ms |
| GTX 1060 6GB | 6 GB | 6.1 | 13.9 ms | N/A | 17.4 ms |
L40S FP16 batch=32 achieves sub-millisecond latency (0.5 ms/pair).
Sub-millisecond: 0.5 ms/pair on L40S FP16
Faster than a single OpenAI API round-trip by 3 orders of magnitude. Even a consumer GTX 1060 achieves 14.6 ms/pair with ONNX GPU batching.
Batch Coalescing¶
review_batch() coalesces NLI inference into a single .forward() call.
| Mode | Median (16-pair) | Per-Pair | Speedup |
|---|---|---|---|
scorer.review() x 16 (serial) |
14,099 ms | 881 ms | baseline |
scorer.review_batch(16) (coalesced) |
5,627 ms | 352 ms | 2.5x |
PyO3 FFI Overhead (Rust Kernel)¶
| Operation | Python | Rust FFI | Speedup |
|---|---|---|---|
| StreamingKernel (500 tok) | 1.970 ms | 0.139 ms | 14.2x |
| CoherenceScorer.review() | 0.022 ms | 0.002 ms | 11.0x |
| Kuramoto UPDE 100 steps | 2.626 ms | 0.272 ms | 9.7x |
End-to-End Guardrail¶
Full pipeline: CoherenceAgent + GroundTruthStore + StreamingKernel.
NLI-Only Mode (300 traces, GTX 1060)¶
Threshold=0.35, soft_limit=0.45, scorer_backend=deberta.
| Task | N | Catch Rate | Precision | F1 |
|---|---|---|---|---|
| QA | 100 | 36.0% | 81.8% | 50.7% |
| Summarization | 100 | 24.0% | 66.7% | 35.3% |
| Dialogue | 100 | 80.0% | 48.2% | 60.2% |
| Overall | 300 | 46.7% | 56.9% | 51.3% |
Evidence coverage: 100%. Avg latency: 15.8 ms (p95: 40 ms).
Hybrid Mode — NLI + LLM Judge (600 traces, L40S)¶
| Judge | Task | N | Catch | FPR | Precision | F1 | Avg Latency |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | QA | 200 | 78.0% | 4.0% | 95.1% | 85.7% | 10.1 s |
| Claude Sonnet 4 | Summarization | 200 | 95.0% | 93.0% | 50.5% | 66.0% | 26.3 s |
| Claude Sonnet 4 | Dialogue | 200 | 99.0% | 95.0% | 51.0% | 67.4% | 6.2 s |
| Claude Sonnet 4 | Overall | 600 | 90.7% | 64.0% | 58.6% | 71.2% | 14.2 s |
| GPT-4o-mini | QA | 200 | 77.0% | 3.0% | 96.2% | 85.6% | 1.3 s |
| GPT-4o-mini | Summarization | 200 | 95.0% | 93.0% | 50.5% | 66.0% | 4.3 s |
| GPT-4o-mini | Dialogue | 200 | 99.0% | 95.0% | 51.0% | 67.4% | 1.3 s |
| GPT-4o-mini | Overall | 600 | 90.3% | 63.7% | 58.7% | 71.1% | 2.3 s |
Hybrid mode improves catch rate from 46.7% to 90.7% (+94% relative). QA task achieves production-grade precision (95–96%) at 3–4% FPR. GPT-4o-mini matches Claude at 6x lower latency.
Local Judge — DeBERTa Binary Classifier (L40S)¶
Replaces LLM API judge with a locally fine-tuned DeBERTa-v3-base (86M params) trained on 35K borderline NLI samples. Judge inference: 3.97 ms median.
| Metric | NLI-Only | + Local Judge | Delta |
|---|---|---|---|
| Catch rate | 93.63% | 93.80% | +0.17pp |
| FPR | 66.87% | 66.33% | -0.54pp |
| Precision | 58.34% | 58.58% | +0.24pp |
| F1 | 71.89% | 72.12% | +0.23pp |
QA precision: 95.15% at 4.2% FPR. Matches GPT-4o-mini hybrid accuracy at 575x lower latency (3.97 ms vs 2,300 ms) and zero API cost.
False-Positive Rate¶
Summarization FPR (200 correct HaluEval samples)¶
Measures how often correct (non-hallucinated) summaries are falsely rejected.
| Phase | Config | FPR | Reduction |
|---|---|---|---|
| 0 (v3.3) | max-max aggregation | 95.0% | baseline |
| 1 | min-mean aggregation | 60.0% | -37% |
| 2 | + premise_ratio 0.85 | 42.5% | -55% |
| 3 (v3.4) | direct NLI, w_logic=0, trimmed_mean | 25.5% | -73% |
| 4 (v3.5) | + bidirectional NLI, baseline=0.20 | 10.5% | -89% |
Streaming False-Halt¶
0.0% false-halt rate across 20 known-good Wikipedia passages streamed through StreamingKernel (heuristic mode).
Domain Profile Validation (2026-03-21, GTX 1060 6GB, v3.9.4+calibration)¶
Measured with CoherenceScorer(use_nli=True) on CUDA. No KB loaded — these results show NLI-only scoring without knowledge base grounding. With a populated KB, the factual component carries real signal and scores separate better.
Since v3.9.4, scores are calibrated to [0, 1] when no KB is loaded (previously compressed to [0.25, 0.55]).
PubMedQA — Medical (500 samples, w_logic=0.5, w_fact=0.5)¶
Score range: min=0.010, median=0.058, max=0.772.
| Threshold | Catch Rate | FPR | Precision | F1 |
|---|---|---|---|---|
| 0.05 | 52.0% | 38.9% | 52.2% | 52.1% |
| 0.10 | 77.3% | 66.2% | 48.9% | 59.9% |
| 0.15 | 87.6% | 81.8% | 46.7% | 60.9% |
| 0.30 | 96.0% | 94.5% | 45.4% | 61.6% |
| 0.50 | 99.6% | 98.9% | 45.2% | 62.1% |
Without KB grounding, NLI treats the scientific context as premise and the answer as hypothesis. Entailment detection works (catches contradictions) but precision is limited — the model can't verify claims it hasn't seen in a KB.
FinanceBench — Finance (150 known-good samples, w_logic=0.4, w_fact=0.6)¶
All 150 samples are expert-verified correct answers to SEC filing questions. FPR is the only metric.
Score range: min=0.007, median=0.039, max=0.626.
| Threshold | FP | TN | FPR |
|---|---|---|---|
| 0.10 | 121 | 29 | 80.7% |
| 0.30 | 145 | 5 | 96.7% |
| 0.70 | 150 | 0 | 100% |
FinanceBench evidence passages are multi-page SEC filings (10-K, 10-Q). The NLI model chunks them and most chunks don't entail the short answer, producing low scores. This is the expected failure mode without KB grounding — the evidence should be loaded into the vector store for proper RAG scoring, not passed as raw prompt text.
CUAD — Legal (510 samples, not measured)¶
CUAD-RAGBench documents (full legal contracts) exceeded 6GB VRAM during chunked NLI inference. Requires ≥16GB GPU.
Key Finding¶
Without KB: NLI-only scoring has limited discrimination on domain QA tasks. PubMedQA best F1=62.1% (t=0.50), FinanceBench 80%+ FPR at any useful threshold.
With KB (the intended use case): the factual component uses retrieval to score response claims against stored facts. Calibration does not apply — the full [0, 1] range is naturally available. This is where domain profiles add value: weight configuration (w_logic/w_fact) and scoring mode (hybrid, reranker) optimize retrieval-based scoring for each domain.
Recommendation: always load domain knowledge into the vector store. NLI-only mode is a fallback for domains without structured KB, not the primary product path.
Additional Datasets¶
RAGTruth (2,700 samples, NLI-only, L40S)¶
| Metric | Value |
|---|---|
| Catch rate | 49.3% (465/943) |
| False positive rate | 40.9% |
| Precision | 39.3% |
| F1 | 43.7% |
FreshQA (600 samples, NLI-only, L40S)¶
| Metric | Value |
|---|---|
| Catch rate | 98.6% (146/148) |
| False positive rate | 97.8% |
| Precision | 24.8% |
| F1 | 39.7% |
FreshQA's high FPR is expected: without ground-truth context, the NLI model cannot verify consistency and defaults to flagging. The 98.6% catch rate on false-premise questions demonstrates detection of factual impossibilities.
Competitive Positioning¶
| Feature | Director-AI | NeMo Guardrails | Lynx | GuardrailsAI | SelfCheckGPT |
|---|---|---|---|---|---|
| Approach | NLI + RAG + hybrid judge | LLM self-consistency | Fine-tuned LLM | LLM-as-judge | Multi-call LLM |
| Model size | 0.4B + optional LLM | LLM-dependent | 8–70B | LLM-dependent | LLM-dependent |
| Latency | 0.5 ms (L40S FP16) | 50–300 ms + LLM | 1–10 s | 2.26 s | 5–10 s |
| Streaming halt | Yes (token-level) | No | No | No | No |
| Offline/local | Yes (NLI mode) | No | Yes (GPU) | No | No |
| AggreFact BA | 75.86% (0.4B) | N/A | N/A | N/A | N/A |
| E2E catch rate | 90.7% (hybrid) | N/A | N/A | N/A | N/A |
| Integrations | LC/LI/LG/HS/CrewAI | LangChain | Python | LC/LI | Python |
| License | AGPL v3 | Apache 2.0 | Apache 2.0 | Apache 2.0 | MIT |
Other Systems (Different Benchmarks)¶
These systems publish results on benchmarks other than LLM-AggreFact. Scores are not directly comparable.
| System | Benchmark | Score | Params | License |
|---|---|---|---|---|
| ORION (Deepchecks) | RAGTruth F1 | 83.0% | encoder | Proprietary |
| LettuceDetect-large | RAGTruth F1 | 79.2% | 396M | MIT |
| Lynx-70B (Patronus) | HaluBench | 87.4% | 70B | Apache 2.0 |
| Lynx-8B (Patronus) | HaluBench | 82.9% | 8B | Apache 2.0 |
| SelfCheckGPT-NLI | WikiBio AUC-PR | 92.5% | LLM wrapper | MIT |
| RAGAS Faithfulness | Multi-dataset | 76.2% avg P | LLM wrapper | Apache 2.0 |
Where Director-AI Wins¶
- Only streaming guardrail — token-level halt, zero competitors offer this
- Sub-millisecond latency — 0.5 ms/pair on L40S FP16 (measured, batch=32)
- Beats all frontier LLMs on AggreFact — 75.86% BA > Claude Haiku (75.10%), Sonnet (74.25%), GPT-4o (73.46%) — using the FactCG-DeBERTa-v3-Large model (MIT)
- $0 per-call cost — vs $0.07–$1.40/1K for API-based competitors
- 0.4B params — runs on consumer hardware (GTX 1060: 14.6 ms/pair)
- 90.7% E2E catch rate (hybrid) — NLI + LLM judge catches 9/10 hallucinations (HaluEval, 600 traces)
- 95–96% QA precision at 3–4% FPR — production-grade on QA tasks (HaluEval hybrid mode)
- Tested SDK integrations — guard() verified with OpenAI and Anthropic SDKs (2026-03-20)
Honest Limitations
- NLI-only domain scoring is weak without KB — PubMedQA F1=62.1%, FinanceBench 80%+ FPR. Load your domain knowledge into the vector store for meaningful scoring.
- Summarization accuracy weakest — AggreFact-CNN 68.8%, ExpertQA 59.1%. FPR at 10.5% (v3.5, bidirectional NLI)
- ONNX CPU not competitive — 383 ms/pair without CUDAExecutionProvider
- Fine-tuned NLI models regress — 22/23 fine-tunes hurt; only CommitmentBank (+0.54pp) helps. See NLI fine-tuning survey
- Hybrid mode requires LLM API — NLI-only mode is fully local, but hybrid needs OpenAI/Anthropic
- Long documents OOM on consumer GPUs — legal contracts (CUAD) exceed 6GB VRAM during chunked NLI. Needs ≥16GB.
- SDK integrations tested — OpenAI and Anthropic guard() verified (2026-03-20). Bedrock, Gemini, Cohere use duck-type detection but not end-to-end tested.
NLI Fine-Tuning Survey (21 Models)¶
All fine-tuned from FactCG-DeBERTa-v3-Large (75.86% BA baseline) on the named dataset, then benchmarked on the full AggreFact test set.
Finding: 22/23 fine-tunes hurt performance. Only CommitmentBank (+0.54pp) helps.
| Model | BA | Delta | Pattern |
|---|---|---|---|
| base (FactCG) | 75.86% | — | Production model |
| factcg-cb (CommitmentBank) | 76.40% | +0.54pp | Complex inference, diverse, too small to trigger catastrophic forgetting |
| factcg-cb-lowlr (LR=5e-6) | 72.33% | -3.53pp | Even conservative LR hurts |
| factcg-rte | 73.28% | -2.58pp | Entailment pairs |
| factcg-vitaminc | 70.29% | -5.57pp | Contrastive fact-check |
| factcg-legal | 69.52% | -6.34pp | Domain-specific NLI |
| factcg-multinli | 66.30% | -9.56pp | General entailment |
| factcg-anli | 63.25% | -12.61pp | Adversarial NLI |
| factcg-docnli | 61.37% | -14.49pp | Document-level NLI |
| factcg-fever | 54.57% | -21.29pp | Claim manipulation |
| factcg-paws | 52.35% | -23.51pp | Paraphrase adversaries |
| factcg-dialogue-nli | 50.33% | -25.53pp | Dialogue implicature |
Root cause: catastrophic forgetting regardless of learning rate or data source. Threshold shifts to 0.85–0.95 indicate models output extreme probabilities, losing calibration.
Full Benchmark Suite¶
All scripts in benchmarks/. Run each with python -m benchmarks.<name>.
| Script | Dataset | Metric | Result |
|---|---|---|---|
aggrefact_eval --sweep |
LLM-AggreFact (29K) | Balanced accuracy | 75.8% |
e2e_eval --nli |
HaluEval (300) | Catch rate / F1 | 46.7% / 51.3% |
e2e_eval --hybrid |
HaluEval (600) | Catch rate / F1 | 90.7% / 71.2% |
run_ragtruth_freshqa |
RAGTruth (2,700) | Catch rate | 49.3% |
run_ragtruth_freshqa |
FreshQA (600) | Catch rate | 98.6% |
latency_bench |
N/A | Per-pair ms | 0.9 ms (Ada) |
gpu_bench |
N/A | Cross-GPU ms | 6 GPUs |
retrieval_bench |
Synthetic (50 facts) | Hit@1 / Hit@3 | 40% / 63% |
streaming_false_halt_bench |
Wikipedia (20 passages) | False-halt % | 0.0% |
anli_eval |
ANLI R1/R2/R3 | Accuracy / F1 | Requires GPU |
fever_eval |
FEVER dev | Accuracy / F1 | Requires GPU |
mnli_eval |
MNLI matched+mismatched | Accuracy / F1 | Requires GPU |
paws_eval |
PAWS | Binary P/R/F1 | Requires GPU |
truthfulqa_eval |
TruthfulQA (817 Qs) | Accuracy | Requires GPU |
vitaminc_eval |
VitaminC | Accuracy / F1 | Requires GPU |
falsepositive_eval |
SQuAD/NQ/TriviaQA | FP rate | Requires GPU |
medical_eval --nli |
PubMedQA (500) | Catch / FPR / F1 | 77.3% / 66.2% / 59.9% (t=0.30, GTX 1060, 2026-03-20) |
legal_eval --nli |
CUAD-RAGBench (510) | Catch / FPR / F1 | OOM on 6GB VRAM (needs ≥16GB) |
finance_eval --nli |
FinanceBench (150) | FPR (known-good) | 0% FPR at t≤0.30 (GTX 1060, 2026-03-20) |
Reproduction¶
Reproduce every number on this page
export HF_TOKEN=hf_...
python -m benchmarks.aggrefact_eval --sweep
python -m benchmarks.e2e_eval --nli
python -m benchmarks.latency_bench
python -m benchmarks.streaming_false_halt_bench
python -m benchmarks.run_all --max-samples 500
benchmarks/. Results are deterministic given the same data split and hardware.
Methodology¶
- Balanced accuracy: macro-averaged recall across supported/not-supported classes. Standard metric for LLM-AggreFact (Tang et al., 2024).
- Latency: median of 30 iterations after 5 warmup runs, single batch of 16 premise-hypothesis pairs. GPU clock not locked; reported on idle systems.
- E2E eval: synthetic traces with ground-truth labels. TP/FP/TN/FN computed against agent
haltedflag at the stated threshold. - False-halt rate: 20 known-good Wikipedia passages streamed through StreamingKernel; a halt on any passage counts as a false halt.
- Competitor latency: values marked "~" or "(est.)" are from published papers or documentation, not our own measurements.
Sources¶
- LLM-AggreFact Leaderboard
- FactCG (arXiv 2501.17144, NAACL 2025)
- MiniCheck (arXiv 2404.10774, EMNLP 2024)
- Granite Guardian 3.3
- Paladin-mini (arXiv 2506.20384)
- AlignScore (arXiv 2305.16739)
- LettuceDetect (arXiv 2502.17125)
- ORION (arXiv 2504.15771)
- Vectara HHEM-2.1
- SelfCheckGPT (arXiv 2303.08896)
- NVIDIA NeMo Guardrails