Hallucination Root-Cause Analysis¶
Move from detective to prescriptive: where the mechanistic interpretability attributor (ReDeEP) says which layers and attention heads produced a hallucination signal, the root-cause analyzer says why and what to fix — classifying the dominant failure mode and emitting targeted fine-tuning recommendations that can feed the Customer Model Factory.
Failure modes¶
| Cause | Mechanistic signature | Prescription |
|---|---|---|
parametric_knowledge_override |
A Knowledge-FFN layer with high ffn_knowledge and low external_attention |
Fine-tune to down-weight parametric recall, up-weight retrieved context |
attention_ignores_evidence |
Low external_attention without strong parametric recall |
Add retrieval-grounding training examples |
underactive_copying_heads |
Copying-Heads with low copying_score |
Train on copy-from-context exemplars |
Quick start¶
from director_ai import ProductionGuard
from director_ai.core.config import DirectorConfig
guard = ProductionGuard(DirectorConfig())
# `report` is a MechanisticAttributionReport from the ReDeEP attributor.
diagnosis = guard.root_cause_analyzer.diagnose(report)
print(diagnosis.dominant_cause) # e.g. "parametric_knowledge_override"
print(diagnosis.implicated_layers) # e.g. (18, 19, 20)
for rec in diagnosis.recommendations:
print(rec.cause, "→", rec.action, rec.targets)
diagnose() returns a RootCauseDiagnosis with the dominant_cause, all
detected causes, per-cause Recommendations (cause, action, targets like
ffn_layer:18 / head:5.2), and the implicated_layers / implicated_heads.
When the report is not a hallucination the dominant cause is no_hallucination
with no recommendations.
Tenant safety¶
The diagnosis is computed from the attribution report alone — no model access and
no prompt/response text. to_dict() carries only component indices, scores, and
recommendations, so it is safe to log and to attach to audit records.
Tuning¶
from director_ai.core.interpretability import HallucinationRootCauseAnalyzer
analyzer = HallucinationRootCauseAnalyzer(
parametric_knowledge_threshold=0.5, # FFN knowledge at/above → strong recall
low_attention_threshold=0.3, # external attention at/below → ignoring evidence
low_copying_threshold=0.3, # copying score at/below → weak copying head
)
Notes¶
- Pure deterministic analysis over the attribution report; no ML stack required
(the report's signals come from the ReDeEP
ActivationProvider, which a real deployment backs with transformer activations/attention). - The recommendations are designed to be consumed by a targeted fine-tuning pipeline (Customer Model Factory), closing the loop from detection to repair.