Skip to content

Mechanistic Interpretability (ReDeEP)

MechanisticAttributor traces a hallucination signal to the specific transformer components that produced it, following the ReDeEP decoupling (Sun et al., 2025): a model's output is driven by Knowledge FFNs (feed-forward layers injecting parametric knowledge learned in training) and Copying Heads (attention heads that copy from the external retrieved context). A response is hallucination-prone when it leans on parametric knowledge while under-using the supplied context.

The attributor consumes per-layer FFN-knowledge and external-attention signals plus per-head copying scores and reports an overall risk together with which Knowledge-FFN layers and which Copying-Heads drove it — the per-component, regulator-facing explanation (EU AI Act, FDA) that a single score cannot give.

Decoupled risk

Per layer: risk = ffn_weight · ffn_knowledge + attention_weight · (1 − external_attention) (weights normalised to sum to 1). High parametric injection with low external attention → high risk. The overall risk is the mean across layers; at or above risk_threshold the response is flagged.

Injected signals

Signals are injected, so the attribution logic runs without an ML stack and is fully deterministic under test. A real deployment extracts them from a transformer's MLP activations and attention maps (HuggingFace output_attentions / TransformerLens hooks) and feeds them in.

from director_ai.core.interpretability import MechanisticAttributor

attributor = MechanisticAttributor(risk_threshold=0.5, top_k=5)

# From per-layer arrays a real integration has already reduced:
layers = MechanisticAttributor.layer_signals_from_arrays(
    ffn_knowledge=per_layer_mlp_magnitude,        # [0, 1] per layer
    external_attention=per_layer_context_attention,  # [0, 1] per layer
)
heads = MechanisticAttributor.head_signals_from_matrix(per_layer_per_head_copying)

report = attributor.attribute(layers, heads)
if report.is_hallucination:
    worst = report.knowledge_ffn_layers[0]
    print(report.reason)
    # "hallucination: risk=0.90 ...; top Knowledge-FFN layer 17 (ffn=0.93, external=0.07)"
    for head in report.copying_heads:
        print("copying head", head.layer_index, head.head_index, head.copying_score)

Implement the ActivationProvider protocol to pull signals straight from a model and call attribute_from(provider).

Full API

director_ai.core.interpretability.redeep.MechanisticAttributor

MechanisticAttributor(*, ffn_weight: float = 0.5, attention_weight: float = 0.5, risk_threshold: float = 0.5, top_k: int = 5)

Attribute a hallucination signal to Knowledge FFNs and Copying Heads.

Parameters:

Name Type Description Default
ffn_weight float

Relative weight of parametric reliance (FFN knowledge) versus external-context neglect in the per-layer risk. Normalised to sum to 1.

0.5
attention_weight float

Relative weight of parametric reliance (FFN knowledge) versus external-context neglect in the per-layer risk. Normalised to sum to 1.

0.5
risk_threshold float

Overall risk at or above which the response is flagged a hallucination. Default 0.5.

0.5
top_k int

Number of top Knowledge-FFN layers and Copying-Heads to report.

5

attribute

attribute(layer_signals: Sequence[LayerSignal], head_signals: Sequence[HeadSignal] = ()) -> MechanisticAttributionReport

Return the mechanistic attribution for one analysed response.

attribute_from

attribute_from(provider: ActivationProvider) -> MechanisticAttributionReport

Attribute using signals pulled from an :class:ActivationProvider.

layer_signals_from_arrays staticmethod

layer_signals_from_arrays(ffn_knowledge: Sequence[float], external_attention: Sequence[float]) -> list[LayerSignal]

Build per-layer signals from two parallel per-layer arrays.

Convenience for a real integration that has already reduced a transformer's MLP activations and context-attention into per-layer scalars.

head_signals_from_matrix staticmethod

head_signals_from_matrix(copying_scores: Sequence[Sequence[float]]) -> list[HeadSignal]

Build per-head signals from a [layer][head] copying-score matrix.

director_ai.core.interpretability.redeep.MechanisticAttributionReport dataclass

MechanisticAttributionReport(hallucination_risk: float, is_hallucination: bool, knowledge_ffn_layers: tuple[LayerContribution, ...], copying_heads: tuple[HeadContribution, ...], reason: str)

Where a hallucination signal came from, mechanistically.

director_ai.core.interpretability.redeep.LayerSignal dataclass

LayerSignal(layer_index: int, ffn_knowledge: float, external_attention: float)

Decoupled per-layer signals for one analysed response.

ffn_knowledge is the normalised magnitude of the layer's FFN update — how much parametric knowledge it injected. external_attention is the normalised attention mass the layer placed on the external context tokens — how much it used the retrieved evidence. Both in [0, 1].

director_ai.core.interpretability.redeep.HeadSignal dataclass

HeadSignal(layer_index: int, head_index: int, copying_score: float)

Per-head copying score — attention mass on external context tokens.

director_ai.core.interpretability.redeep.LayerContribution dataclass

LayerContribution(layer_index: int, ffn_knowledge: float, external_attention: float, decoupled_risk: float)

A Knowledge-FFN layer ranked by its hallucination contribution.

director_ai.core.interpretability.redeep.HeadContribution dataclass

HeadContribution(layer_index: int, head_index: int, copying_score: float)

A Copying-Head ranked by how much it grounded the response.