Mechanistic Interpretability (ReDeEP)¶
MechanisticAttributor traces a hallucination signal to the specific
transformer components that produced it, following the ReDeEP decoupling
(Sun et al., 2025): a model's output is driven by Knowledge FFNs (feed-forward
layers injecting parametric knowledge learned in training) and Copying
Heads (attention heads that copy from the external retrieved context). A
response is hallucination-prone when it leans on parametric knowledge while
under-using the supplied context.
The attributor consumes per-layer FFN-knowledge and external-attention signals plus per-head copying scores and reports an overall risk together with which Knowledge-FFN layers and which Copying-Heads drove it — the per-component, regulator-facing explanation (EU AI Act, FDA) that a single score cannot give.
Decoupled risk¶
Per layer: risk = ffn_weight · ffn_knowledge + attention_weight · (1 − external_attention)
(weights normalised to sum to 1). High parametric injection with low external
attention → high risk. The overall risk is the mean across layers; at or above
risk_threshold the response is flagged.
Injected signals¶
Signals are injected, so the attribution logic runs without an ML stack and is
fully deterministic under test. A real deployment extracts them from a
transformer's MLP activations and attention maps (HuggingFace
output_attentions / TransformerLens hooks) and feeds them in.
from director_ai.core.interpretability import MechanisticAttributor
attributor = MechanisticAttributor(risk_threshold=0.5, top_k=5)
# From per-layer arrays a real integration has already reduced:
layers = MechanisticAttributor.layer_signals_from_arrays(
ffn_knowledge=per_layer_mlp_magnitude, # [0, 1] per layer
external_attention=per_layer_context_attention, # [0, 1] per layer
)
heads = MechanisticAttributor.head_signals_from_matrix(per_layer_per_head_copying)
report = attributor.attribute(layers, heads)
if report.is_hallucination:
worst = report.knowledge_ffn_layers[0]
print(report.reason)
# "hallucination: risk=0.90 ...; top Knowledge-FFN layer 17 (ffn=0.93, external=0.07)"
for head in report.copying_heads:
print("copying head", head.layer_index, head.head_index, head.copying_score)
Implement the ActivationProvider protocol to pull signals straight from a model
and call attribute_from(provider).
Full API¶
director_ai.core.interpretability.redeep.MechanisticAttributor
¶
MechanisticAttributor(*, ffn_weight: float = 0.5, attention_weight: float = 0.5, risk_threshold: float = 0.5, top_k: int = 5)
Attribute a hallucination signal to Knowledge FFNs and Copying Heads.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ffn_weight
|
float
|
Relative weight of parametric reliance (FFN knowledge) versus external-context neglect in the per-layer risk. Normalised to sum to 1. |
0.5
|
attention_weight
|
float
|
Relative weight of parametric reliance (FFN knowledge) versus external-context neglect in the per-layer risk. Normalised to sum to 1. |
0.5
|
risk_threshold
|
float
|
Overall risk at or above which the response is flagged a hallucination. Default 0.5. |
0.5
|
top_k
|
int
|
Number of top Knowledge-FFN layers and Copying-Heads to report. |
5
|
attribute
¶
attribute(layer_signals: Sequence[LayerSignal], head_signals: Sequence[HeadSignal] = ()) -> MechanisticAttributionReport
Return the mechanistic attribution for one analysed response.
attribute_from
¶
Attribute using signals pulled from an :class:ActivationProvider.
layer_signals_from_arrays
staticmethod
¶
layer_signals_from_arrays(ffn_knowledge: Sequence[float], external_attention: Sequence[float]) -> list[LayerSignal]
Build per-layer signals from two parallel per-layer arrays.
Convenience for a real integration that has already reduced a transformer's MLP activations and context-attention into per-layer scalars.
head_signals_from_matrix
staticmethod
¶
Build per-head signals from a [layer][head] copying-score matrix.
director_ai.core.interpretability.redeep.MechanisticAttributionReport
dataclass
¶
MechanisticAttributionReport(hallucination_risk: float, is_hallucination: bool, knowledge_ffn_layers: tuple[LayerContribution, ...], copying_heads: tuple[HeadContribution, ...], reason: str)
Where a hallucination signal came from, mechanistically.
director_ai.core.interpretability.redeep.LayerSignal
dataclass
¶
Decoupled per-layer signals for one analysed response.
ffn_knowledge is the normalised magnitude of the layer's FFN update —
how much parametric knowledge it injected. external_attention is the
normalised attention mass the layer placed on the external context tokens —
how much it used the retrieved evidence. Both in [0, 1].
director_ai.core.interpretability.redeep.HeadSignal
dataclass
¶
Per-head copying score — attention mass on external context tokens.
director_ai.core.interpretability.redeep.LayerContribution
dataclass
¶
LayerContribution(layer_index: int, ffn_knowledge: float, external_attention: float, decoupled_risk: float)
A Knowledge-FFN layer ranked by its hallucination contribution.
director_ai.core.interpretability.redeep.HeadContribution
dataclass
¶
A Copying-Head ranked by how much it grounded the response.