Skip to content

Feature Depth Assessment

Honest audit of two features that external reviewers flagged as potentially "scaffolding" or "marketing claims": the agent loop monitor and online calibration system. Both were audited at source level on 2026-04-05.


1. Agent Loop Monitor

Location: src/director_ai/agentic/loop_monitor.py (240 lines)

What It Does

Five independent detection mechanisms running per agent step:

Mechanism Implementation Trigger
Circular tool calls Counter-based action hash tracking Same tool+args repeated ≥3×
Goal drift Jaccard similarity between goal words and action Similarity drops below threshold
Budget exhaustion Cumulative token tracking Warns at 90%, halts at 100%
Wall-clock timeout Monotonic time comparison Elapsed > max_seconds
Step count limit Simple counter Steps > max_steps (default 50)

Maturity

  • Tests: 18 test classes, 181 lines (circular detection, drift scoring, budget, timeout, step limits, loop status)
  • Integration: Wired into server.py as HTTP endpoint /v1/agent/step — receives step history, returns StepVerdict with halt/warn/continue
  • CLI integration: Used in _cli_verify.py for goal-based verification
  • Verdict: Production-ready. Not scaffolding.

Known Limitations

  • Default goal drift scorer uses Jaccard word overlap (fast, simple). For semantic drift detection, the scorer is pluggable — replace with NLI-based callback for production. Documented in docstring.
  • No persistent state across requests — each /v1/agent/step call receives full step history. Stateless by design (the caller tracks state).

2. Online Calibration (FeedbackStore + OnlineCalibrator)

Location: - src/director_ai/core/calibration/feedback_store.py (188 lines) - src/director_ai/core/calibration/online_calibrator.py (164 lines)

What It Does

FeedbackStore — SQLite3 database (WAL mode, thread-safe) storing human feedback: - report() — stores (prompt, response, guardrail_score, guardrail_approved, human_approved, domain, timestamp) - get_corrections() — retrieves with domain filter - get_disagreements() — filters where guardrail ≠ human (FP/FN) - export_training_data() — outputs fine-tuning format

OnlineCalibrator — threshold optimisation from feedback: - Computes confusion matrix (TP, TN, FP, FN) from guardrail vs. human verdicts - Calculates TPR, TNR, FPR, FNR with Wilson 95% confidence intervals - Threshold sweep: tests 91 thresholds (5%–95%), finds threshold maximising balanced accuracy (TPR + TNR) / 2 - Returns CalibrationReport with optimal_threshold (requires ≥20 corrections with scores)

Maturity

  • Tests: 24 test classes, 343 lines (empty store, perfect guardrail, all-FP, mixed errors, threshold separation, domain filtering, performance bounds)
  • Integration: Wired into ProductionGuard class in guard.py, paired with ConformalPredictor for confidence intervals
  • Demo: examples/online_calibration_demo.py — realistic 25-correction scenario with per-domain calibration
  • Verdict: Production-ready. Real threshold optimisation, real database, real statistics.

Known Limitations

  • FeedbackStore.report() must be called explicitly by the integration layer — it does not auto-hook into the scoring pipeline. The caller decides when to record feedback.
  • Calibrated threshold is computed on-demand via calibrate() — it is not persisted or auto-applied. The caller must read the CalibrationReport.optimal_threshold and update their scorer configuration.
  • Minimum 20 corrections required for threshold recommendation. Below that, returns None.

Side-by-Side Summary

Aspect Loop Monitor Online Calibration
Implementation 240 lines 352 lines
Tests 18 classes (181 lines) 24 classes (343 lines)
Scaffolding? No — fully functional No — fully functional
Server integration /v1/agent/step endpoint ProductionGuard class
Statistical rigour Jaccard heuristic (pluggable) Wilson CIs, balanced accuracy
Known gaps Default scorer is simple Manual report() wiring

Conclusion

Both features are implemented, tested, and integrated — not stubs or future placeholders. The implementation is intentionally pragmatic: simple defaults (Jaccard for drift, SQLite for feedback) with escape hatches for heavier backends (NLI scorer, external databases). This is a deliberate design choice, not incomplete work.

External reviewers questioning depth should inspect the test suites: - tests/test_loop_monitor.py - tests/test_calibration.py