Token-level hallucinated-span detection¶
Status: opt-in, model-backed (
director-ai[nli]). Off by default; setspan_detection_enabledto load the token classifier. A flagged span means "this text is not supported by the context", reported for review.
The response-level scorer and the claim-decompose path judge a whole answer or a
whole claim. They miss the failure mode that dominates RAGTruth: a short
baseless addition — a plausible detail absent from the source — embedded in
otherwise-grounded text. Entailment scoring treats the answer as mostly supported
and lets the span through. This detector works at the token level instead: it
reads [context] [SEP] [response] through a ModernBERT token classifier, labels
each response token supported or hallucinated, and reports the character spans
that are unsupported.
Measured results¶
benchmarks/ragtruth_token_detector_bench.py on the balanced RAGTruth test split
(2700 examples: 943 hallucinated / 1757 grounded), example level (a response is
flagged when it contains any hallucinated span):
| approach | F1 | balanced acc | FPR | precision |
|---|---|---|---|---|
| NLI / claim-decompose | 0.366 | — | 0.347 | — |
| token detector | 0.763 | 0.814 | 0.071 | 0.841 |
| LettuceDetect (reference) | 0.792 | — | — | — |
Operating point: token probability ≥ 0.95, at least one flagged token. The detector reaches parity with the open token-level baseline on balanced accuracy and sits a few F1 points behind LettuceDetect; it is not a differentiator over that open-source model, but it removes a real weakness in the grounding path. Context is truncated to 1024 tokens, which caps recall on long sources.
Usage¶
from director_ai.guard import ProductionGuard
from director_ai.core.config import DirectorConfig
guard = ProductionGuard(config=DirectorConfig(span_detection_enabled=True))
result = guard.detect_spans(
context="The restaurant serves Chinese and Szechuan dishes.",
response="It serves Szechuan dishes. The head chef won three Michelin stars in 2019.",
)
result.hallucinated # True
[s.text for s in result.spans]
# -> [' chef won', ' Michelin', ' in 2019'] # the fabricated detail, unsupported by the context
The default operating point flags a token only at P(hallucinated) >= 0.95
(span_token_threshold). A borderline addition that scores just under the cut is
left unflagged; lower the threshold to trade precision for recall.
Direct use without the guard:
from director_ai.core.scoring.span_detector import HallucinationSpanDetector
detector = HallucinationSpanDetector.from_pretrained()
detection = detector.detect(context, response)
detection.coverage # fraction of response tokens flagged
Configuration¶
| Field | Default | Meaning |
|---|---|---|
span_detection_enabled |
False |
load the detector (opt-in) |
span_model |
anulum/director-ragtruth-token-modernbert |
token classifier |
span_token_threshold |
0.95 |
per-token probability cut |
span_min_tokens |
1 |
flagged tokens needed to mark a response |
span_max_length |
1024 |
context+response token budget |
span_device |
-1 |
CUDA device index, -1 for CPU |
Performance¶
The transformer forward pass dominates the cost, but the token-probability →
character-span reduction that follows it has a Rust path
(backfire_kernel.rust_merge_flagged_spans) with a bit-identical pure-Python
floor. The detector uses the accelerator transparently when it is installed and
falls back to Python otherwise — span output is identical either way (a
20,000-case differential check finds no drift), so the Rust build is an
optimisation, never a correctness dependency.
Measured by benchmarks/rust_compute_bench.py (functional evidence — shared
workstation, not an isolated claim-grade run):
| reduction input | Python floor | Rust path | speedup |
|---|---|---|---|
| 30-token response | 4.32 µs | 1.68 µs | 2.6× |
| 400-token response | 59.52 µs | 22.04 µs | 2.7× |
Where it fits¶
This is a span-level grounding signal, complementary to the real-time streaming halt (which acts on contradiction during generation) and the response-level scoring. Use it for post-hoc review of RAG answers where you need to know which phrases are unsupported, not just whether the answer as a whole is grounded.