Online Calibration¶
Improve the guardrail from production feedback — the longer you use it, the better it gets for your data.
Why Online Calibration?¶
Every guardrail ships with a default threshold (0.30 for director-ai). But the optimal threshold depends on your deployment: your documents, your domain, your tolerance for false positives vs false negatives. Online calibration collects human corrections and automatically adjusts the threshold to minimize errors on your actual data.
Workflow¶
Deploy → Collect Feedback → Calibrate → Apply → Deploy (improved)
↑ │
└────────────────────────────────────────────────┘
Collecting Feedback¶
from director_ai import FeedbackStore
store = FeedbackStore("my_deployment.db")
# After a human reviews a guardrail decision:
store.report(
prompt="What is our refund policy?",
response="We offer 60-day refunds on all products.",
guardrail_approved=True, # guardrail said: approved
human_approved=False, # human says: wrong (it's 30-day)
guardrail_score=0.62,
domain="customer_support",
)
# Corrections accumulate over time
print(f"Total corrections: {store.count()}")
print(f"Disagreements: {len(store.get_disagreements())}")
What Gets Stored¶
Each correction records: - The prompt and response text - The guardrail's score and verdict - The human's verdict (approved or not) - A domain tag (optional, for per-domain calibration) - Timestamp
Storage is SQLite by default (single file, no server needed). Thread-safe with WAL mode for concurrent access.
Calibrating¶
from director_ai import OnlineCalibrator, FeedbackStore
store = FeedbackStore("my_deployment.db")
calibrator = OnlineCalibrator(store, min_corrections=20)
# After enough corrections accumulate:
report = calibrator.calibrate()
print(f"Corrections: {report.correction_count}")
print(f"Current accuracy: {report.current_accuracy:.1%}")
print(f"FPR: {report.fpr:.3f} ± {report.fpr_ci:.3f}")
print(f"FNR: {report.fnr:.3f} ± {report.fnr_ci:.3f}")
if report.optimal_threshold is not None:
print(f"Optimal threshold: {report.optimal_threshold}")
else:
print("Insufficient data for threshold optimization")
Per-Domain Calibration¶
# Calibrate medical domain separately
med_report = calibrator.calibrate(domain="medical")
fin_report = calibrator.calibrate(domain="finance")
print(f"Medical FPR: {med_report.fpr:.3f}")
print(f"Finance FPR: {fin_report.fpr:.3f}")
Confidence Intervals¶
Error rates include Wilson score 95% confidence intervals. With 50 corrections where 3 are false positives:
This is what makes "we guarantee <X% hallucination rate" measurable per deployment — not a marketing claim, but a statistical fact with confidence bounds.
Exporting Training Data¶
After sufficient corrections accumulate (500+), export as a fine-tuning dataset:
training_data = store.export_training_data()
# [{"prompt": "...", "response": "...", "label": 0, "domain": "medical"}, ...]
import json
with open("finetune_data.jsonl", "w") as f:
for item in training_data:
f.write(json.dumps(item) + "\n")
This dataset can be used with finetune_nli() to train a domain-specific NLI model.
Calibration Report¶
@dataclass
class CalibrationReport:
correction_count: int
optimal_threshold: float | None # None if insufficient data
current_accuracy: float
tpr: float # true positive rate
tnr: float # true negative rate
fpr: float # false positive rate
fnr: float # false negative rate
fpr_ci: float # 95% CI half-width
fnr_ci: float # 95% CI half-width
Data Moat¶
The feedback loop creates a switching cost: the longer a customer uses director-ai, the more calibration data they accumulate. Switching to a competitor means losing that deployment-specific accuracy improvement.