Tier-6 Reasoning Escalation¶

ReasoningScorer is an escalation-only reasoning tier above the NLI scorer (Tier 5). The NLI scorer is fast but verdict-only — a single divergence number with no rationale and no harm taxonomy. This tier adds a causal-LM safety chain-of-thought that fires only when the lower tier is borderline, so the median request never pays for it.

When it fires¶

The tier is consulted only when the composite coherence score sits within escalation_margin (default 0.15) of the decision boundary — the band where the lower tier's approve/reject call is least certain:

        reject            borderline            approve
   ───────────────|===================|───────────────►  coherence
                threshold-m         threshold+m
                  └── reasoning tier fires here ──┘

Confident scores on either side skip the tier entirely. It is disabled by default; enabling it is purely additive.

Structured verdict¶

When it fires, the reasoning model returns a ReasoningVerdict:

Field	Meaning
`approved`	the reasoning approve/reject decision
`confidence`	0–1 confidence in the verdict
`rationale`	one-sentence explanation
`harm_category`	a canonical `HarmCategory` (HarmBench taxonomy) or `None`
`detected_issues`	short specific issue strings
`adjusted_score`	the blended composite coherence

The model's free-form category string is normalised through the same to_harm_category() taxonomy the input sanitizer uses, so a Tier-6 verdict and a sanitizer block speak the same seven-category language.

The verdict blends into the lower-tier score at the same 30/70 confidence-scaled ratio the LLM judge uses. Approval then requires both the blended score to clear the threshold and the reasoning verdict to approve — so a confident safety rejection halts a borderline output, while an unavailable backend or an unparsable reply leaves the lower-tier verdict untouched (the tier never silently flips a decision on failure).

Backends¶

Three backends mirror the LLM judge:

"local" — a local causal-LM loaded with transformers (no API calls);
"openai" — OpenAI Chat Completions;
"anthropic" — Anthropic Messages.

With an external provider, the borderline prompt/response is sent to that provider; set privacy_mode=True to redact PII first.

Enabling¶

Via DirectorConfig:

from director_ai.core.config import DirectorConfig

config = DirectorConfig(
    reasoning_enabled=True,
    reasoning_provider="local",          # or "openai" / "anthropic"
    reasoning_model="Qwen/Qwen2.5-7B-Instruct",
    reasoning_escalation_margin=0.15,
)
scorer = config.build_scorer()

Or directly:

from director_ai.core import CoherenceScorer

scorer = CoherenceScorer(
    threshold=0.5,
    reasoning_enabled=True,
    reasoning_provider="anthropic",
)
approved, score = scorer.review(prompt, response)
if score.reasoning_escalated:
    print(score.reasoning_harm_category, score.reasoning_rationale)

Full API¶

director_ai.core.scoring.reasoning_scorer.ReasoningScorer ¶

ReasoningScorer(provider: str = '', model: str = '', model_revision: str | None = None, escalation_margin: float = 0.15, device: str | None = None, privacy_mode: bool = False, max_new_tokens: int = 256, cost_callback: CostCallback | None = None)

Causal-LM reasoning tier consulted only on borderline lower-tier scores.

Parameters:

Name	Type	Description	Default
`provider`	`str`	`"openai"`, `"anthropic"`, `"local"`, or `""` (disabled).	`''`
`model`	`str`	Model id (HuggingFace path for local, API model name otherwise).	`''`
`model_revision`	`str \| None`	Immutable revision for local remote-model loads.	`None`
`escalation_margin`	`float`	Half-width of the borderline band around the decision boundary within which the tier fires (default 0.15, slightly wider than the LLM judge).	`0.15`
`device`	`str \| None`	Torch device for the local model.	`None`
`privacy_mode`	`bool`	Redact prompt/response before sending to an external provider.	`False`
`max_new_tokens`	`int`	Generation budget for the local causal-LM rationale.	`256`

enabled `property` ¶

enabled: bool

True when a provider is configured.

should_escalate ¶

should_escalate(score: float, *, centre: float = 0.5) -> bool

True when score is within escalation_margin of centre.

centre is the lower tier's decision boundary (its effective threshold), so the tier fires precisely in the band where the lower tier's approve/reject call is least certain.

reason ¶

reason(prompt: str, response: str, score: float, *, task_type: str = 'default', evidence_text: str = '', redactor: Redactor | None = None) -> ReasoningVerdict | None

Run the reasoning tier and return a structured verdict.

Returns None when the backend is unavailable or its reply cannot be parsed, so the caller keeps the lower tier's decision unchanged (the reasoning tier never silently flips a verdict on failure).

director_ai.core.scoring.reasoning_scorer.ReasoningVerdict `dataclass` ¶

ReasoningVerdict(approved: bool, confidence: float, rationale: str, harm_category: HarmCategory | None = None, detected_issues: list[str] = list(), adjusted_score: float | None = None)

Structured outcome of a Tier-6 reasoning escalation.

to_dict ¶

to_dict() -> dict[str, Any]

JSON-safe payload (never carries raw prompt/response text).