Guardrail Landscape¶

Last reviewed: 2026-06-02.

LLM guardrails are not one product category. They cover several different failure modes, and a system that performs well in one category may not protect another category at all. Director-AI is positioned as a factual-coherence and evidence-control layer, not as a replacement for moderation, access control, or human review.

Guardrail Categories¶

Category	Main question	Typical tools	Director-AI fit
Content safety	Is the prompt or answer unsafe, abusive, or policy-violating?	Safety classifiers, provider moderation, human review queues	Complementary
Prompt-injection defence	Is the user or retrieved content trying to redirect the model?	Intent classifiers, prompt hardening, retrieval sanitisation, tool gating	Partial coverage through injection detection and policy gates
Grounded factuality	Is the answer supported by the supplied evidence?	NLI, claim checking, RAG faithfulness scoring, verifier models	Primary lane
Retrieval integrity	Did the retrieval layer expose the right evidence to the model?	Access control, document provenance, vector-store checks, citation audits	Supported through knowledge and evidence APIs
Structured-output control	Does the answer satisfy a schema or machine contract?	JSON schema validation, type checks, deterministic validators	Supported
Tool and agent control	Should the model be allowed to call this tool or hand off this trace?	Least-privilege tool policies, approval gates, trajectory review	Supported through guard control and trajectory surfaces
Audit and governance	Can an operator reconstruct what happened and why?	Telemetry, compliance reports, immutable evidence packs	Primary enterprise lane

The practical production pattern is layered. A content-safety classifier should not be expected to prove factual grounding. A factuality checker should not be expected to replace data-loss prevention or tool permissioning. A prompt filter should not be treated as a complete agent-security boundary.

Where Director-AI Competes¶

Director-AI is most competitive when the protected workflow needs:

document-grounded answers;
private knowledge-base evidence;
streamed output that may need to halt before completion;
reusable guard policies across local, REST, SDK, and proxy integrations;
inspectable scores, evidence, and audit events;
customer-specific acceptance evidence before deployment.

It is less suitable as the only control for:

generic user-generated-content moderation;
image or video safety review;
broad social-platform trust and safety;
endpoint data-loss prevention;
authorisation for high-risk tools.

Those controls can still sit beside Director-AI in a complete deployment.

Current Public Benchmark Position¶

The live LLM-AggreFact leaderboard lists FactCG-DeBERTa-L at 75.6 percent per-dataset mean balanced accuracy. Director-AI's default scorer uses the same FactCG-DeBERTa-v3-Large model family and ships the surrounding guard, streaming, evidence, API, and deployment surfaces.

The repository also carries a local reproducibility packet reporting 75.8 percent at threshold 0.46:

benchmarks/results/aggrefact_yaxili96_FactCG-DeBERTa-v3-Large.json

That packet is close to the public leaderboard row but is not yet a separate official Director-AI leaderboard entry. A submission email was sent on 2026-06-02. Until the maintainers acknowledge it, the conservative public statement is:

Director-AI's default FactCG scorer is near the top public LLM-AggreFact factuality results while adding production guardrail, streaming, evidence, and deployment features that the leaderboard does not measure.

Feature Comparison¶

Capability	Content-safety classifiers	Programmable rail frameworks	Cloud managed guardrails	Director-AI
Harm and policy moderation	Primary	Depends on configured model	Primary	Complementary
Grounded factuality scoring	Usually limited	Depends on configured checks	Limited to service-specific features	Primary
Private RAG evidence links	Usually no	Possible with integration work	Platform-specific	Yes
Token-level streaming halt	Usually no	Usually post-call or flow-level	Often complete-response only	Yes
Local/offline scoring path	Often yes for open models	Depends on model provider	No	Yes
API/proxy/SDK deployment	Varies	Yes	Managed service APIs	Yes
Customer-specific evidence packs	Usually no	Operator-built	Platform-specific	Yes
Deterministic structured checks	Usually no	Possible	Platform-specific	Yes
Tool and handoff guard surfaces	Usually no	Flow-dependent	Platform-specific	Yes
Commercial audit narrative	Usually moderation-centred	Operator-built	Platform-provided	Evidence-centred

Claim Boundaries¶

Keep these statements separate in external material:

Official LLM-AggreFact leaderboard rows.
Local reproducibility packets stored in this repository.
Tuned-threshold replay results, which are useful for deployment but not the same as the default leaderboard convention.
End-to-end HaluEval and RAG guardrail metrics, which measure a different operational surface.
Streaming false-halt diagnostics, which are not hallucination catch-rate claims.
Customer-specific acceptance results, which require customer data and customer approval criteria.

Sources To Recheck Before External Claims¶

LLM-AggreFact leaderboard: https://llm-aggrefact.github.io/
OWASP Top 10 for LLM Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/
NVIDIA NeMo Guardrails documentation: https://docs.nvidia.com/nemo-guardrails/
Amazon Bedrock Guardrails automated reasoning documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails-automated-reasoning-checks.html
Llama Guard model documentation: https://huggingface.co/meta-llama/Llama-Guard-4-12B