Skip to content

Long-Context Intent-Drift Interlock

IntentDriftInterlock catches slow-burn jailbreaks that no single turn reveals. A crescendo attack never trips a per-turn guard: each request drifts a little further from the declared intent than the last, every step individually benign. The signal lives in the trajectory — sustained drift, a rising slope, accumulating cross-turn contradiction — not in any one turn.

What it accumulates

The interlock folds three per-turn signals (all already produced during scoring) into a fixed-size state that survives a 100k-token conversation without keeping the raw history:

Signal Source
intent divergence the cross-turn NLI divergence (cross_turn_divergence)
injection risk the per-turn intent-grounded injection_risk
contradiction trend the slope from the contradiction tracker

It keeps an exponential moving average of divergence and injection (old turns decay rather than being dropped), a windowed slope of the divergence (the gradual-escalation signal), and combines them into a single drift_risk. The interlock trips when drift_risk crosses trigger_threshold after at least min_turns turns — so a conversation creeping from 0.3 to 0.6 divergence over six turns is halted even though no single turn ever clears a per-turn block, and a lone off-topic opening turn never fires it.

  divergence per turn:  0.2  0.3  0.4  0.5  0.6  0.7    (each below a 0.7 block)
  drift_risk:           .12  .12  .14  .35  .42  .49 ◄── trips here

Enabling

The interlock is opt-in on the session and off by default — when absent, the scorer leaves intent_drift_risk / intent_drift_triggered as None:

from director_ai.core import CoherenceScorer
from director_ai.core.runtime.session import ConversationSession

scorer = CoherenceScorer(threshold=0.5, use_nli=True)
session = ConversationSession(track_intent_drift=True)

for prompt, response in conversation:
    approved, score = scorer.review(prompt, response, session=session)
    if score.intent_drift_triggered:
        halt("slow-burn intent drift detected")

Meaningful divergence signal requires a model-backed NLI backend; the interlock itself is pure and numeric, so it is fully deterministic and testable without any model.

Tuning

from director_ai.core.runtime.intent_drift import IntentDriftInterlock

session.intent_drift = IntentDriftInterlock(
    half_life_turns=4.0,     # EMA decay horizon
    window=8,                # slope estimate window (bounds memory)
    trigger_threshold=0.6,   # drift_risk at/above which it trips
    min_turns=3,             # turns required before it can trip
)

Full API

director_ai.core.runtime.intent_drift.IntentDriftInterlock

IntentDriftInterlock(*, half_life_turns: float = 4.0, window: int = 8, trigger_threshold: float = 0.6, min_turns: int = 3)

Accumulate per-turn safety signals into a bounded drift state.

Parameters:

Name Type Description Default
half_life_turns float

Turns over which an EMA signal decays to half weight (default 4).

4.0
window int

Number of recent intent-divergence values kept for the slope estimate (default 8). Bounds memory to a fixed size regardless of conversation length.

8
trigger_threshold float

drift_risk at or above which the interlock trips (default 0.6).

0.6
min_turns int

Turns required before the interlock can trip, so a single off-topic opening turn never fires it (default 3).

3

update

update(*, intent_divergence: float, injection_risk: float = 0.0, contradiction_trend: float = 0.0) -> DriftState

Fold one turn's signals and return the current drift state.

reset

reset() -> None

Clear all accumulated state (e.g. on a new conversation).

director_ai.core.runtime.intent_drift.DriftState dataclass

DriftState(turn_count: int, sustained_divergence: float, escalation: float, injection_pressure: float, contradiction_pressure: float, drift_risk: float, triggered: bool)

Compressed safety state after folding one turn (JSON-safe).