Evidence Firewall¶
The evidence firewall screens every retrieved chunk before the model sees it. Grounding only helps if the chunks reaching the model are the ones the tenant is allowed to read, came from a verified write, are still fresh, and carry facts rather than injected instructions. The firewall enforces that as an admission gate in front of retrieval: each chunk runs ten checks and is either admitted into the grounding context or quarantined with a stable, tenant-safe reason code.
It is side-effect-free apart from counter metrics, deterministic for a given clock, and opt-in — retrieval behaves exactly as before unless a deployment enables it.
The ten admission checks¶
| Check | Fails when | Reason code |
|---|---|---|
| Tenant authorisation | the chunk is owned by a tenant the request may not read | tenant_mismatch |
| Provenance present | no content digest, version, or signature marker exists | provenance_missing |
| Signature verified | the write was not signature-verified at ingest | signature_unverified |
| Content-hash match | a recorded text digest no longer matches the text | content_hash_mismatch |
| Expiry | the recorded expires_at has passed |
expired |
| Age | the chunk is older than the policy's max age | too_old |
| Source owner | no source owner/key is recorded | source_owner_unknown |
| Sensitivity | the label is not in the allowed set | sensitivity_blocked |
| Allowed use case | the chunk's use-case list excludes the request | use_case_not_allowed |
| Poisoning scan | the text reads as an injected instruction | poisoning_detected |
A chunk owned by an empty tenant denotes a shared, non-tenant corpus and passes the tenant check. A chunk with no recorded text digest passes the content-hash check — absence of a digest is the provenance check's job, not the integrity check's. A chunk with no allowed-use-case list is unrestricted.
The reason codes are drawn from a closed vocabulary and never contain chunk text, so a quarantine decision can be logged and shipped to a customer audit trail without leaking another tenant's data.
Policy¶
FirewallPolicy selects which checks are enforced and with what bounds.
Defaults are fail-closed on the integrity-critical checks (tenant, provenance,
signature, content hash, expiry, poisoning) and opt-in on the corpus-shape
checks (sensitivity labels, declared use case, source owner) that depend on a
customer taxonomy.
from director_ai.core.evidence_firewall import (
EvidenceFirewall,
FirewallContext,
FirewallPolicy,
)
firewall = EvidenceFirewall(FirewallPolicy(max_age_seconds=90 * 86_400))
report = firewall.screen(results, FirewallContext(tenant_id="acme", now_unix=now))
for chunk in report.admitted: # safe to hand to the model
...
for verdict in report.quarantined: # held back, with reasons
print(verdict.chunk.chunk_id, verdict.failed_reasons)
FirewallPolicy.permissive() is an explicit, named "firewall disabled" posture
for non-tenant development corpora; production should never use it.
The poisoning scan is dependency-free and runs on the hot path. It scores how
strongly a chunk's text reads as an instruction aimed at the model — "ignore
the previous instructions", "you are now …", a leaked system-prompt
fragment, or an embedded tool-call literal — rather than a fact aimed at the
user. The model-backed InjectionDetector can be injected in its place:
Wiring into retrieval¶
VectorGroundTruthStore takes an optional evidence_firewall. When supplied,
every retrieval batch is screened inside the active-results filter, so
quarantined chunks never reach retrieve_context or
retrieve_context_with_chunks. When omitted, retrieval is unchanged.
from director_ai.core.evidence_firewall import EvidenceFirewall
from director_ai.core.retrieval.vector_store import VectorGroundTruthStore
store = VectorGroundTruthStore(evidence_firewall=EvidenceFirewall())
DirectorConfig builds and attaches the firewall automatically from
evidence_firewall_* settings. The firewall is off by default; set
evidence_firewall_enabled=True to turn it on:
from director_ai.core.config import DirectorConfig
config = DirectorConfig(
evidence_firewall_enabled=True,
evidence_firewall_max_age_seconds=90 * 86_400,
evidence_firewall_enforce_sensitivity=True,
evidence_firewall_allowed_sensitivity=("public", "internal"),
)
store = config.build_store() # firewall attached
Report shape¶
FirewallReport.to_dict() and ChunkVerdict.to_dict() serialise to JSON-safe,
tenant-safe dicts — chunk ids, per-check outcomes, and reason codes, never raw
text — suitable for an audit record:
{
"admitted_count": 1,
"quarantined_count": 1,
"verdicts": [
{"chunk_id": "doc1", "admitted": true, "checks": [...], "failed_reasons": []},
{"chunk_id": "doc2", "admitted": false,
"checks": [...], "failed_reasons": ["signature_unverified"]}
]
}
Metrics¶
evidence_firewall_chunks_screened_total— every chunk the firewall saw.evidence_firewall_chunks_quarantined_total{reason}— quarantines, labelled by reason code, so a dashboard can show which check is dropping chunks.
Performance¶
Screening is branching plus a native hashlib SHA-256 recompute, so there is
no Rust path to add — the digest is already a C path and the rest is control
flow. benchmarks/evidence_firewall.py reports per-chunk screening latency and
the poison-scan share of it; on the reference workstation a chunk screens in
roughly 12 µs, of which the poison scan is about a third.