Skip to content

Policy Evaluation

evaluate_policy_variants() compares thresholds, profiles, or scorer policies on the same labelled sample set. It is intended for controlled internal A/B evaluation before changing production thresholds.

from director_ai import (
    LabelledPolicySample,
    PolicyVariant,
    evaluate_policy_variants,
)

samples = [
    LabelledPolicySample(
        prompt="q1",
        response="supported answer",
        label=True,
        score=0.91,
        dataset="regression",
        benchmark_evidence=True,
    ),
    LabelledPolicySample(
        prompt="q2",
        response="unsupported answer",
        label=False,
        score=0.12,
        dataset="regression",
        benchmark_evidence=True,
    ),
]

report = evaluate_policy_variants(
    samples,
    variants=[
        PolicyVariant(name="balanced", threshold=0.5),
        PolicyVariant(name="strict", threshold=0.7, profile="medical"),
    ],
)

Each variant receives the same samples. The result contains balanced accuracy, precision, recall, false-positive rate, false-negative rate, and the confusion matrix.

A/B Comparison

Use compare_policy_variants() for a two-arm baseline/candidate comparison.

from director_ai import compare_policy_variants

comparison = compare_policy_variants(
    samples,
    baseline=PolicyVariant(name="current", threshold=0.7),
    candidate=PolicyVariant(name="candidate", threshold=0.5),
)

print(comparison.delta_balanced_accuracy)
print(comparison.winner)

Scoring Callback

If samples do not carry cached scores, pass a score function. The callback receives the sample and variant so deployments can build profile-specific scorers outside the harness.

def score_fn(sample, variant):
    scorer = scorers[variant.name]
    approved, score = scorer.review(sample.prompt, sample.response)
    return score.score

report = evaluate_policy_variants(samples, variants=variants, score_fn=score_fn)

Provenance Guardrails

The report separates real benchmark rows, internal rows, and synthetic rows. public_benchmark_eligible is true only when every sample is real benchmark evidence from exactly one named dataset.

Synthetic or mixed-provenance reports are valid for internal engineering decisions, but they must not be copied into public benchmark claims.

Full API

director_ai.core.evaluation.policy.LabelledPolicySample dataclass

LabelledPolicySample(prompt: str, response: str, label: bool, score: float | None = None, dataset: str = '', synthetic: bool = False, benchmark_evidence: bool = False, metadata: Mapping[str, str] = dict())

One labelled prompt/response sample used for policy evaluation.

director_ai.core.evaluation.policy.PolicyVariant dataclass

PolicyVariant(name: str, threshold: float, profile: str = '', w_logic: float = 0.6, w_fact: float = 0.4, metadata: Mapping[str, str] = dict())

Profile or threshold policy evaluated against the same labelled samples.

director_ai.core.evaluation.policy.PolicyEvaluationReport dataclass

PolicyEvaluationReport(results: tuple[PolicyVariantResult, ...], sample_count: int, datasets: tuple[str, ...], provenance_counts: Mapping[str, int], public_benchmark_eligible: bool, public_claim_reason: str)

Multi-variant policy evaluation report with provenance guardrails.

director_ai.core.evaluation.policy.PolicyComparisonReport dataclass

PolicyComparisonReport(baseline: PolicyVariantResult, candidate: PolicyVariantResult, delta_balanced_accuracy: float, delta_false_positive_rate: float, delta_false_negative_rate: float, winner: str, evaluation: PolicyEvaluationReport)

Two-arm A/B comparison extracted from a policy evaluation report.

director_ai.core.evaluation.policy.evaluate_policy_variants

evaluate_policy_variants(samples: Sequence[LabelledPolicySample], *, variants: Sequence[PolicyVariant], score_fn: ScoreFunction | None = None) -> PolicyEvaluationReport

Evaluate all variants on the same labelled sample set.

director_ai.core.evaluation.policy.compare_policy_variants

compare_policy_variants(samples: Sequence[LabelledPolicySample], *, baseline: PolicyVariant, candidate: PolicyVariant, score_fn: ScoreFunction | None = None) -> PolicyComparisonReport

Run a controlled two-arm policy comparison.