Policy Evaluation¶
evaluate_policy_variants() compares thresholds, profiles, or scorer policies
on the same labelled sample set. It is intended for controlled internal A/B
evaluation before changing production thresholds.
from director_ai import (
LabelledPolicySample,
PolicyVariant,
evaluate_policy_variants,
)
samples = [
LabelledPolicySample(
prompt="q1",
response="supported answer",
label=True,
score=0.91,
dataset="regression",
benchmark_evidence=True,
),
LabelledPolicySample(
prompt="q2",
response="unsupported answer",
label=False,
score=0.12,
dataset="regression",
benchmark_evidence=True,
),
]
report = evaluate_policy_variants(
samples,
variants=[
PolicyVariant(name="balanced", threshold=0.5),
PolicyVariant(name="strict", threshold=0.7, profile="medical"),
],
)
Each variant receives the same samples. The result contains balanced accuracy, precision, recall, false-positive rate, false-negative rate, and the confusion matrix.
A/B Comparison¶
Use compare_policy_variants() for a two-arm baseline/candidate comparison.
from director_ai import compare_policy_variants
comparison = compare_policy_variants(
samples,
baseline=PolicyVariant(name="current", threshold=0.7),
candidate=PolicyVariant(name="candidate", threshold=0.5),
)
print(comparison.delta_balanced_accuracy)
print(comparison.winner)
Scoring Callback¶
If samples do not carry cached scores, pass a score function. The callback receives the sample and variant so deployments can build profile-specific scorers outside the harness.
def score_fn(sample, variant):
scorer = scorers[variant.name]
approved, score = scorer.review(sample.prompt, sample.response)
return score.score
report = evaluate_policy_variants(samples, variants=variants, score_fn=score_fn)
Provenance Guardrails¶
The report separates real benchmark rows, internal rows, and synthetic rows.
public_benchmark_eligible is true only when every sample is real benchmark
evidence from exactly one named dataset.
Synthetic or mixed-provenance reports are valid for internal engineering decisions, but they must not be copied into public benchmark claims.
Full API¶
director_ai.core.evaluation.policy.LabelledPolicySample
dataclass
¶
LabelledPolicySample(prompt: str, response: str, label: bool, score: float | None = None, dataset: str = '', synthetic: bool = False, benchmark_evidence: bool = False, metadata: Mapping[str, str] = dict())
One labelled prompt/response sample used for policy evaluation.
director_ai.core.evaluation.policy.PolicyVariant
dataclass
¶
PolicyVariant(name: str, threshold: float, profile: str = '', w_logic: float = 0.6, w_fact: float = 0.4, metadata: Mapping[str, str] = dict())
Profile or threshold policy evaluated against the same labelled samples.
director_ai.core.evaluation.policy.PolicyEvaluationReport
dataclass
¶
PolicyEvaluationReport(results: tuple[PolicyVariantResult, ...], sample_count: int, datasets: tuple[str, ...], provenance_counts: Mapping[str, int], public_benchmark_eligible: bool, public_claim_reason: str)
Multi-variant policy evaluation report with provenance guardrails.
director_ai.core.evaluation.policy.PolicyComparisonReport
dataclass
¶
PolicyComparisonReport(baseline: PolicyVariantResult, candidate: PolicyVariantResult, delta_balanced_accuracy: float, delta_false_positive_rate: float, delta_false_negative_rate: float, winner: str, evaluation: PolicyEvaluationReport)
Two-arm A/B comparison extracted from a policy evaluation report.
director_ai.core.evaluation.policy.evaluate_policy_variants
¶
evaluate_policy_variants(samples: Sequence[LabelledPolicySample], *, variants: Sequence[PolicyVariant], score_fn: ScoreFunction | None = None) -> PolicyEvaluationReport
Evaluate all variants on the same labelled sample set.
director_ai.core.evaluation.policy.compare_policy_variants
¶
compare_policy_variants(samples: Sequence[LabelledPolicySample], *, baseline: PolicyVariant, candidate: PolicyVariant, score_fn: ScoreFunction | None = None) -> PolicyComparisonReport
Run a controlled two-arm policy comparison.