External Validation Report¶
Last reviewed: 2026-06-02.
This report records the public validation state for Director-AI factuality and guardrail claims. It distinguishes official leaderboard evidence from local reproducibility packets and deployment-oriented diagnostics.
Summary¶
| Surface | Current status | Evidence |
|---|---|---|
| LLM-AggreFact public leaderboard | FactCG-DeBERTa-L is visible at 75.6 percent average balanced accuracy | https://llm-aggrefact.github.io/ |
| Director-AI submission | Submitted by email on 2026-06-02; awaiting maintainer response | docs/ROADMAP_STATUS.md |
| Local global-threshold packet | 75.8 percent average balanced accuracy at threshold 0.46 | benchmarks/results/aggrefact_yaxili96_FactCG-DeBERTa-v3-Large.json |
| Tuned-threshold replay | Deployment-oriented replay result; not an official default leaderboard row | benchmarks/results/aggrefact_sweep_lr_low.json |
| Validation packet | Commands, required artefacts, and claim boundaries documented | benchmarks/EXTERNAL_VALIDATION_PACKET.md |
LLM-AggreFact Position¶
LLM-AggreFact measures grounded factuality across 11 datasets and reports per-dataset mean balanced accuracy. The live public board currently places the FactCG-DeBERTa-L row close to the top of the visible table, behind larger or specialised systems and ahead of several large general-purpose models.
Director-AI uses the same default scorer family and adds:
- SDK guard and direct scoring APIs;
- REST, gRPC, middleware, and proxy deployment surfaces;
- RAG evidence integration;
- token-level streaming halt;
- structured verification;
- injection detection;
- audit and compliance artefacts;
- Rust-accelerated compute paths where installed.
The leaderboard measures only the factuality scorer. It does not measure these control-plane features.
Local Result Packet¶
The local result packet reports:
| Field | Value |
|---|---|
| Benchmark | LLM-AggreFact |
| Model | yaxili96/FactCG-DeBERTa-v3-Large |
| Threshold | 0.46 |
| Total samples | 29,320 |
| Average balanced accuracy | 75.8 percent |
Primary file:
benchmarks/results/aggrefact_yaxili96_FactCG-DeBERTa-v3-Large.json
This packet should be treated as reproducibility evidence for the submitted Director-AI row, not as an upstream acknowledgement.
Per-Dataset Local Results¶
| Dataset | Balanced accuracy |
|---|---|
| Reveal | 89.14 percent |
| Lfqa | 86.38 percent |
| RAGTruth | 82.18 percent |
| ClaimVerify | 78.13 percent |
| Wice | 76.94 percent |
| TofuEval-MeetB | 74.26 percent |
| AggreFact-XSum | 74.25 percent |
| FactCheck-GPT | 73.02 percent |
| TofuEval-MediaS | 71.85 percent |
| AggreFact-CNN | 68.82 percent |
| ExpertQA | 59.09 percent |
The weakness map is important for deployment planning. ExpertQA and class-imbalanced summarisation are the main risk areas for the default threshold.
External Claim Language¶
Use this wording until the upstream submission is acknowledged:
Director-AI's default FactCG scorer is near the top public LLM-AggreFact factuality results and ships with production guardrail surfaces for streaming halt, RAG evidence, structured verification, APIs, and audit evidence. A Director-AI result packet has been submitted for leaderboard review.
Do not state that Director-AI has a separate official leaderboard row until the maintainers add or acknowledge it.
Non-Comparable Metrics¶
These results are valuable but should not be merged into the LLM-AggreFact claim:
- HaluEval end-to-end catch, precision, false-positive rate, and F1;
- local judge benchmark results;
- streaming false-halt diagnostics;
- prompt-injection replication results;
- customer-specific acceptance tests;
- tuned-threshold replay results.
Each belongs to a different operational question and should keep its own dataset, metric, threshold, and hardware record.
Next Validation Actions¶
- Preserve the submitted packet unchanged unless the maintainer requests a different format.
- If there is no response after the follow-up window, open an upstream pull request with the same packet and command record.
- Add the upstream acknowledgement or pull-request URL to
docs/ROADMAP_STATUS.md. - Re-run the external packet before the next public release if the scorer, threshold, model revision, or dataset loader changes.