llama.cpp)¶

Director-AI can guard a self-hosted LLM at the pre-sampling boundary — it scores a candidate continuation and masks or biases the token's logit before the server samples it. This is stronger than guarding the output stream after the fact: an unsafe token can be prevented from ever being emitted, not detected once it already has.

The hook is server-neutral: it returns a decision plus a per-server payload, and a thin adapter applies that to the server's logits processor / sampling callback. The same hook drives vLLM, TGI, and llama.cpp.

Why this matters¶

	Output-stream halt (`StreamingKernel`)	Pre-sampling hook
Acts	after a token is emitted	before a token is sampled
Can suppress a token	no (only halts the stream)	yes (masks/biases the logit)
Integration point	response stream	server logits processor

For anyone running vLLM, TGI, or llama.cpp in production, the pre-sampling hook is the tightest place to enforce a guard.

Mechanism vs contribution

Pre-sampling logits processing is a standard server feature, not a Director-AI invention — vLLM ships a custom logits-processor API, and others embed safety classifiers the same way. Director-AI's contribution is wiring its grounding / contradiction guard into that mechanism cleanly and identically across vLLM, TGI, and llama.cpp, with claim-boundary gating and a tenant-safe audit event — not the hook point itself.

Quickstart¶

from director_ai.core import CoherenceScorer, GroundTruthStore
from director_ai.integrations.inference_server_hooks import (
    InferenceHookRequest,
    build_inference_server_hook,
)

store = GroundTruthStore()
store.add("capital of France", "The capital of France is Paris.")
scorer = CoherenceScorer(threshold=0.3, ground_truth_store=store, use_nli=False)

# score_fn maps the candidate continuation to a coherence score in [0, 1].
hook = build_inference_server_hook(
    "vllm",
    score_fn=lambda text: scorer.review("capital of France", text)[1].score,
    hard_limit=0.4,          # below this, the candidate token is rejected
)

request = InferenceHookRequest(
    server="vllm",
    accumulated_text="The capital of France is ",
    candidate_token="Berlin",
    token_id=12345,
    request_id="req-1",
    tenant_id="tenant-a",
)

decision = hook.check(request, logits=current_logits)  # current_logits optional
if not decision.allow:
    # decision.adjusted_logits has the candidate token masked to block_logit;
    # decision.blocked_token_ids lists what was suppressed;
    # decision.safety_event is a tenant-safe audit record;
    # decision.server_payload carries the vLLM/TGI/llama.cpp-specific action.
    apply_to_server(decision.server_payload)

hook.check() scores request.candidate_text (the accumulated text plus the candidate token). If the score is at or above hard_limit the token is allowed; otherwise its logit is set to block_logit (default -1e9, an effective mask), a SafetyEvent is emitted, and the blocked token id is reported.

API¶

`build_inference_server_hook(server, score_fn, *, hard_limit=0.4, block_token_id=None, block_logit=-1e9)`¶

Builds an InferenceServerHook with the default policy. server is one of "vllm", "tgi", "llama_cpp". score_fn is Callable[[str], float] returning a coherence/grounding score in [0, 1].

`InferenceHookRequest`¶

The candidate token plus stream context: server, accumulated_text, candidate_token, optional token_id, request_id, tenant_id, and metadata. The candidate_text property returns accumulated_text + candidate_token — what score_fn receives.

`InferenceServerHook.check(request, logits=None) -> InferenceHookDecision`¶

Scores one candidate token and returns a server-neutral action. InferenceHookDecision carries allow, score, reason, adjusted_logits (the masked logit vector when logits is supplied), blocked_token_ids, safety_event, and server_payload.

`InferenceServerHookPolicy`¶

Tunes the decision: hard_limit, block_token_id, block_logit, steering_bias_logit, halt_reason, tenant_safe_explanation.

Drop-in logits processors (vLLM / TGI / llama.cpp)¶

director_ai.integrations.inference_logits_adapters wires the hook into each server's real per-step contract — (token_ids, logits) -> logits — so it is a drop-in logits processor rather than glue you write yourself. At a claim boundary the adapter decodes the text so far, scores it, and on a halt decision masks every logit except the end-of-sequence token, which makes the server stop on its next step. Between claim boundaries it passes the logits through untouched.

from director_ai.integrations.inference_server_hooks import build_inference_server_hook
from director_ai.integrations.inference_logits_adapters import build_vllm_logits_processor

hook = build_inference_server_hook("vllm", score_fn=my_coherence_score, hard_limit=0.4)
processor = build_vllm_logits_processor(
    hook, decode_fn=tokenizer.decode, eos_token_id=tokenizer.eos_token_id
)
# vLLM:        SamplingParams(logits_processors=[processor])
# TGI / HF:    LogitsProcessorList([processor])         (build_tgi_logits_processor)
# llama.cpp:   Llama(..., logits_processor=processor)   (build_llama_cpp_logits_processor)

The adapter only needs len(), indexing, and item assignment on logits, so a Python list, a NumPy array, and a torch tensor all work — no torch or server SDK is imported. The halt is sticky (once tripped, every later step stays halted) and emits a tenant-safe SafetyEvent to an optional on_halt sink.

Measured overhead¶

benchmarks/inference_hook_overhead.py (heuristic scorer, no model; results in benchmarks/results/inference_hook_overhead.json):

Step	p50	When it applies
Pass-through / token	0.39 µs	every token between claim boundaries
Allow at boundary (1 local score)	~128 µs	once per completed claim
EOS mask (vocab 32k / 128k)	0.6 ms / 2.4 ms	one-off, only when a halt fires

Steady-state per-token overhead is sub-microsecond; the coherence score is paid only at claim boundaries, and the EOS mask (a pure-Python loop here) is a one-off that a vectorised tensor write reduces to microseconds in production.

Predictive pre-halt steering (advanced tier)¶

InferenceServerHook.steer(request, steering_decision, logits) applies a predictive PreHaltSteeringDecision (from the advanced core.trajectory module): proceed allows the token, escalate biases its logit by steering_bias_logit (discouraged but not blocked), and halt masks it. This lets the guard steer generation away from a predicted unsafe trajectory before it commits, rather than only reacting at the hard limit.

Multi-tenant and audit¶

Every rejection produces a tenant-safe SafetyEvent (hook_scope = "inference_server") carrying the request id, tenant id, threshold, and observed score — never the raw prompt or response — so the pre-sampling guard feeds the same hash-chained audit trail as the rest of Director-AI.