Multimodal Checks¶
Multimodal checks adapt image, audio, and video evidence into the shared
GuardDecision and SafetyEvent contracts. The adapter is explicitly opt-in:
a modality must be enabled before it can be checked, and a modality must be
marked benchmarked before a supported result may become allow.
Decision Boundary¶
MultimodalVerifierAdapter enforces production-safe defaults:
- disabled or unsupported modalities raise errors instead of silently passing
- uncertain evidence maps to
warn, neverallow - hallucinated or temporally inconsistent evidence maps to
halt - unbenchmarked modalities map to
warneven when the low-level checker says the claim is consistent - optional caption and metadata grounding can reduce a modality score before a decision is emitted
- audit payloads and safety events include media references, not raw media, transcripts, frame data, captions, metadata values, or claim text
from director_ai.core.guard_control import RiskEnvelope
from director_ai.core.multimodal_guard import (
MultimodalCheckRequest,
MultimodalVerifierAdapter,
)
adapter = MultimodalVerifierAdapter(
image_guard=image_guard,
caption_score_fn=caption_grounder,
metadata_score_fn=metadata_grounder,
enabled_modalities=("image",),
benchmarked_modalities=("image",),
)
result = adapter.check(
MultimodalCheckRequest(
modality="image",
claim_text="The image shows a labelled package.",
media_ref="media://image-42",
image_bytes=image_bytes,
caption_text="Package label is absent.",
metadata={"captured_at": "2026-05-13", "source": "inspection-rig"},
),
risk_envelope=RiskEnvelope(
action_category="multimodal",
reversibility="reversible",
domain="regulated",
calibrated_threshold=0.5,
no_go_threshold=0.85,
),
policy_id="policy.multimodal.regulated",
)
Grounding callbacks receive either (caption_text, claim_text) or
(metadata, claim_text) and must return a finite score in [0, 1]. Scores
below the grounding floor halt the claim; scores below the grounding allow
threshold produce a warning unless the base verifier already found a stricter
verdict. Evidence references use suffixes such as #caption and
#metadata:captured_at, so downstream audit logs can identify which grounding
channel was used without storing private captions or metadata values.
Full API¶
director_ai.core.multimodal_guard.adapter.MultimodalCheckRequest
dataclass
¶
MultimodalCheckRequest(modality: Modality, claim_text: str, media_ref: str, image_bytes: bytes = b'', transcript_text: str = '', frame_similarities: Sequence[float] = (), caption_text: str = '', metadata: Mapping[str, str] = dict())
Input envelope for opt-in multimodal verification.
director_ai.core.multimodal_guard.adapter.MultimodalCheckResult
dataclass
¶
MultimodalCheckResult(request: MultimodalCheckRequest, signal: VerifierSignal, guard_decision: GuardDecision)
Tenant-safe multimodal verification result.
to_safety_event
¶
to_safety_event(*, hook_id: str, hook_scope: str = 'agent', request_id: str = '', tenant_id: str = '', latency_ms: float | None = None) -> SafetyEvent
Convert the decision into the shared tenant-safe event schema.
director_ai.core.multimodal_guard.adapter.MultimodalVerifierAdapter
¶
MultimodalVerifierAdapter(*, image_guard: MultimodalGuard | Any | None = None, audio_score_fn: Callable[[str, str], float] | None = None, caption_score_fn: Callable[[str, str], float] | None = None, metadata_score_fn: Callable[[Mapping[str, str], str], float] | None = None, enabled_modalities: Sequence[str] = (), benchmarked_modalities: Sequence[str] = (), temporal_alpha: float = 0.5, temporal_floor: float = 0.2, grounding_floor: float = 0.4, grounding_allow_threshold: float = 0.75)
Opt-in adapter from modality-specific checks to guard decisions.
check
¶
check(request: MultimodalCheckRequest, *, risk_envelope: RiskEnvelope, policy_id: str) -> MultimodalCheckResult
Run the modality check and return a shared guard decision.