Feature Depth Assessment¶
Honest audit of two features that external reviewers flagged as potentially "scaffolding" or "marketing claims": the agent loop monitor and online calibration system. Both were audited at source level on 2026-04-05.
1. Agent Loop Monitor¶
Location: src/director_ai/agentic/loop_monitor.py (240 lines)
What It Does¶
Five independent detection mechanisms running per agent step:
| Mechanism | Implementation | Trigger |
|---|---|---|
| Circular tool calls | Counter-based action hash tracking | Same tool+args repeated ≥3× |
| Goal drift | Jaccard similarity between goal words and action | Similarity drops below threshold |
| Budget exhaustion | Cumulative token tracking | Warns at 90%, halts at 100% |
| Wall-clock timeout | Monotonic time comparison | Elapsed > max_seconds |
| Step count limit | Simple counter | Steps > max_steps (default 50) |
Maturity¶
- Tests: 18 test classes, 181 lines (circular detection, drift scoring, budget, timeout, step limits, loop status)
- Integration: Wired into
server.pyas HTTP endpoint/v1/agent/step— receives step history, returnsStepVerdictwith halt/warn/continue - CLI integration: Used in
_cli_verify.pyfor goal-based verification - Verdict: Production-ready. Not scaffolding.
Known Limitations¶
- Default goal drift scorer uses Jaccard word overlap (fast, simple). For semantic drift detection, the scorer is pluggable — replace with NLI-based callback for production. Documented in docstring.
- No persistent state across requests — each
/v1/agent/stepcall receives full step history. Stateless by design (the caller tracks state).
2. Online Calibration (FeedbackStore + OnlineCalibrator)¶
Location:
- src/director_ai/core/calibration/feedback_store.py (188 lines)
- src/director_ai/core/calibration/online_calibrator.py (164 lines)
What It Does¶
FeedbackStore — SQLite3 database (WAL mode, thread-safe) storing human feedback:
- report() — stores (prompt, response, guardrail_score, guardrail_approved, human_approved, domain, timestamp)
- get_corrections() — retrieves with domain filter
- get_disagreements() — filters where guardrail ≠ human (FP/FN)
- export_training_data() — outputs fine-tuning format
OnlineCalibrator — threshold optimisation from feedback:
- Computes confusion matrix (TP, TN, FP, FN) from guardrail vs. human verdicts
- Calculates TPR, TNR, FPR, FNR with Wilson 95% confidence intervals
- Threshold sweep: tests 91 thresholds (5%–95%), finds threshold maximising balanced accuracy (TPR + TNR) / 2
- Returns CalibrationReport with optimal_threshold (requires ≥20 corrections with scores)
Maturity¶
- Tests: 24 test classes, 343 lines (empty store, perfect guardrail, all-FP, mixed errors, threshold separation, domain filtering, performance bounds)
- Integration: Wired into
ProductionGuardclass inguard.py, paired withConformalPredictorfor confidence intervals - Demo:
examples/online_calibration_demo.py— realistic 25-correction scenario with per-domain calibration - Verdict: Production-ready. Real threshold optimisation, real database, real statistics.
Known Limitations¶
FeedbackStore.report()must be called explicitly by the integration layer — it does not auto-hook into the scoring pipeline. The caller decides when to record feedback.- Calibrated threshold is computed on-demand via
calibrate()— it is not persisted or auto-applied. The caller must read theCalibrationReport.optimal_thresholdand update their scorer configuration. - Minimum 20 corrections required for threshold recommendation. Below that, returns
None.
Side-by-Side Summary¶
| Aspect | Loop Monitor | Online Calibration |
|---|---|---|
| Implementation | 240 lines | 352 lines |
| Tests | 18 classes (181 lines) | 24 classes (343 lines) |
| Scaffolding? | No — fully functional | No — fully functional |
| Server integration | /v1/agent/step endpoint |
ProductionGuard class |
| Statistical rigour | Jaccard heuristic (pluggable) | Wilson CIs, balanced accuracy |
| Known gaps | Default scorer is simple | Manual report() wiring |
Conclusion¶
Both features are implemented, tested, and integrated — not stubs or future placeholders. The implementation is intentionally pragmatic: simple defaults (Jaccard for drift, SQLite for feedback) with escape hatches for heavier backends (NLI scorer, external databases). This is a deliberate design choice, not incomplete work.
External reviewers questioning depth should inspect the test suites:
- tests/test_loop_monitor.py
- tests/test_calibration.py