Hallucination Root-Cause Analysis¶

Move from detective to prescriptive: where the mechanistic interpretability attributor (ReDeEP) says which layers and attention heads produced a hallucination signal, the root-cause analyzer says why and what to fix — classifying the dominant failure mode and emitting targeted fine-tuning recommendations that can feed the Customer Model Factory.

Failure modes¶

Cause	Mechanistic signature	Prescription
`parametric_knowledge_override`	A Knowledge-FFN layer with high `ffn_knowledge` and low `external_attention`	Fine-tune to down-weight parametric recall, up-weight retrieved context
`attention_ignores_evidence`	Low `external_attention` without strong parametric recall	Add retrieval-grounding training examples
`underactive_copying_heads`	Copying-Heads with low `copying_score`	Train on copy-from-context exemplars

Quick start¶

from director_ai import ProductionGuard
from director_ai.core.config import DirectorConfig

guard = ProductionGuard(DirectorConfig())

# `report` is a MechanisticAttributionReport from the ReDeEP attributor.
diagnosis = guard.root_cause_analyzer.diagnose(report)

print(diagnosis.dominant_cause)        # e.g. "parametric_knowledge_override"
print(diagnosis.implicated_layers)     # e.g. (18, 19, 20)
for rec in diagnosis.recommendations:
    print(rec.cause, "→", rec.action, rec.targets)

diagnose() returns a RootCauseDiagnosis with the dominant_cause, all detected causes, per-cause Recommendations (cause, action, targets like ffn_layer:18 / head:5.2), and the implicated_layers / implicated_heads. When the report is not a hallucination the dominant cause is no_hallucination with no recommendations.

Tenant safety¶

The diagnosis is computed from the attribution report alone — no model access and no prompt/response text. to_dict() carries only component indices, scores, and recommendations, so it is safe to log and to attach to audit records.

Tuning¶

from director_ai.core.interpretability import HallucinationRootCauseAnalyzer

analyzer = HallucinationRootCauseAnalyzer(
    parametric_knowledge_threshold=0.5,   # FFN knowledge at/above → strong recall
    low_attention_threshold=0.3,          # external attention at/below → ignoring evidence
    low_copying_threshold=0.3,            # copying score at/below → weak copying head
)

Notes¶

Pure deterministic analysis over the attribution report; no ML stack required (the report's signals come from the ReDeEP ActivationProvider, which a real deployment backs with transformer activations/attention).
The recommendations are designed to be consumed by a targeted fine-tuning pipeline (Customer Model Factory), closing the loop from detection to repair.