Token-level hallucinated-span detection¶

Status: opt-in, model-backed (director-ai[nli]). Off by default; set span_detection_enabled to load the token classifier. A flagged span means "this text is not supported by the context", reported for review.

The response-level scorer and the claim-decompose path judge a whole answer or a whole claim. They miss the failure mode that dominates RAGTruth: a short baseless addition — a plausible detail absent from the source — embedded in otherwise-grounded text. Entailment scoring treats the answer as mostly supported and lets the span through. This detector works at the token level instead: it reads [context] [SEP] [response] through a ModernBERT token classifier, labels each response token supported or hallucinated, and reports the character spans that are unsupported.

Measured results¶

benchmarks/ragtruth_token_detector_bench.py on the balanced RAGTruth test split (2700 examples: 943 hallucinated / 1757 grounded), example level (a response is flagged when it contains any hallucinated span):

approach	F1	balanced acc	FPR	precision
NLI / claim-decompose	0.366	—	0.347	—
token detector	0.763	0.814	0.071	0.841
LettuceDetect (reference)	0.792	—	—	—

Operating point: token probability ≥ 0.95, at least one flagged token. The detector reaches parity with the open token-level baseline on balanced accuracy and sits a few F1 points behind LettuceDetect; it is not a differentiator over that open-source model, but it removes a real weakness in the grounding path. Context is truncated to 1024 tokens, which caps recall on long sources.

Usage¶

from director_ai.guard import ProductionGuard
from director_ai.core.config import DirectorConfig

guard = ProductionGuard(config=DirectorConfig(span_detection_enabled=True))

result = guard.detect_spans(
    context="The restaurant serves Chinese and Szechuan dishes.",
    response="It serves Szechuan dishes. The head chef won three Michelin stars in 2019.",
)
result.hallucinated          # True
[s.text for s in result.spans]
# -> [' chef won', ' Michelin', ' in 2019']   # the fabricated detail, unsupported by the context

The default operating point flags a token only at P(hallucinated) >= 0.95 (span_token_threshold). A borderline addition that scores just under the cut is left unflagged; lower the threshold to trade precision for recall.

Direct use without the guard:

from director_ai.core.scoring.span_detector import HallucinationSpanDetector

detector = HallucinationSpanDetector.from_pretrained()
detection = detector.detect(context, response)
detection.coverage           # fraction of response tokens flagged

Configuration¶

Field	Default	Meaning
`span_detection_enabled`	`False`	load the detector (opt-in)
`span_model`	`anulum/director-ragtruth-token-modernbert`	token classifier
`span_token_threshold`	`0.95`	per-token probability cut
`span_min_tokens`	`1`	flagged tokens needed to mark a response
`span_max_length`	`1024`	context+response token budget
`span_device`	`-1`	CUDA device index, `-1` for CPU

Performance¶

The transformer forward pass dominates the cost, but the token-probability → character-span reduction that follows it has a Rust path (backfire_kernel.rust_merge_flagged_spans) with a bit-identical pure-Python floor. The detector uses the accelerator transparently when it is installed and falls back to Python otherwise — span output is identical either way (a 20,000-case differential check finds no drift), so the Rust build is an optimisation, never a correctness dependency.

Measured by benchmarks/rust_compute_bench.py (functional evidence — shared workstation, not an isolated claim-grade run):

reduction input	Python floor	Rust path	speedup
30-token response	4.32 µs	1.68 µs	2.6×
400-token response	59.52 µs	22.04 µs	2.7×

Where it fits¶

This is a span-level grounding signal, complementary to the real-time streaming halt (which acts on contradiction during generation) and the response-level scoring. Use it for post-hoc review of RAG answers where you need to know which phrases are unsupported, not just whether the answer as a whole is grounded.