Skip to content

Changelog

See the full changelog in CHANGELOG.md on GitHub.

v3.9.4 (2026-03-20)

Fixed

  • Domain profile thresholds: medical 0.75→0.30, finance 0.70→0.30, legal 0.68→0.30 (measured on PubMedQA and FinanceBench).
  • Score calibration: rescales [0.25, 0.55] → [0, 1] when NLI available but no KB loaded.
  • README claims verified: model attribution, hardware context, FPR 2.0%→10.5%.
  • Docker: removed dead registry links, GPU Dockerfile fixed (optimum dep).
  • Code scanning: blake2b for API key hash, pip deps pinned with hashes.
  • All notebooks fixed: correct API signatures, field access patterns, thresholds.
  • __init__.py: all PUBLIC_API.md symbols now importable from top-level.

v3.9.3 (2026-03-19)

Fixed

  • Rust scorer word-overlap heuristic tests.
  • Rust FFI borrow lifetime fix.
  • License tests use env-var signing key consistently.

v3.9.2 (2026-03-19)

Fixed

  • License validation hardened (UUID, HMAC, expiry).
  • Cache scope isolation prevents cross-session replay.
  • Tenant-aware retrieval consistency across all vector store methods.
  • Batch processor catches per-item exceptions gracefully.
  • Config from_profile() re-applies values after __post_init__ override.
  • Redis enterprise store: tenant-prefixed keys, TTL, connection handling.
  • Fine-tuning lazy imports for torch/transformers.

Added

  • 353 new tests — coverage from 81% to 90%+.

v3.9.1 (2026-03-19)

Fixed

  • Cross-tenant cache replay: cache key includes tenant_id.
  • Batch/single scoring parity: review_batch() routes through review().
  • Vector fallback cross-tenant leak: add_fact() prefixes tenant_id.
  • streaming_oversight crash: ingest_token()check_halt().
  • Timeout kills stream: all streaming paths catch TimeoutError.

v3.9.0 (2026-03-15)

Added

  • VerifiedScorer: sentence-level multi-signal fact verification. 5 independent signals: NLI, entity consistency, numerical consistency, negation detection, traceability (fabrication). POST /v1/verify endpoint.
  • Document ingestion API: POST /v1/knowledge/upload (PDF/DOCX/HTML), /ingest, /search, CRUD by doc ID. Chunker, parser, registry modules.
  • Mode selector: --mode general|grounded|auto. Single field replaces 8+ manual config settings.
  • Dataset-type classifier: DatasetTypeClassifier predicts per-input NLI threshold from text features.
  • ColBERT backend: late-interaction retrieval via RAGatouille for higher retrieval recall.
  • Domain embedding tuner: POST /v1/knowledge/tune-embeddings fine-tunes embeddings on ingested documents.
  • Calibrated abstention: returns neutral when retrieval confidence is below threshold.
  • Readiness probe: GET /v1/ready returns 503 when scorer/NLI not loaded.
  • HaltMonitor class (renamed from SafetyKernel, alias kept).
  • LLM judge confidence now scales blend weight.
  • clear_model_cache() for explicit GPU memory release.

Defaults Changed

  • Hybrid retrieval (BM25 + dense + RRF) enabled by default.
  • Cross-encoder reranker enabled by default.
  • RAG claim decomposition enabled by default for all grounded scoring.

Infrastructure

  • 180 files updated to canonical SPDX 5-line headers.
  • Connection pooling on PostgreSQL audit sink.
  • Redis-backed rate limiting when redis_url configured.
  • Latency gate tightened to 5ms avg / 15ms p95.
  • NLI pipeline consistency gate in regression suite.

v3.8.0 (2026-03-14)

Added

  • score() one-call convenience function.
  • DirectorGuard FastAPI middleware.
  • create_proxy_app() OpenAI-compatible guardrail proxy.
  • Config profiles: medical, finance, legal, creative, customer_support, summarization, lite.
  • from_env() environment variable configuration.

Security

  • Input sanitizer: prompt injection, unicode escape, YAML injection detection.
  • PII redactor for privacy mode.
  • Constant-time API key comparison via HMAC.

v3.7.0 (2026-03-10)

Added

  • Sentence-level attribution: ClaimAttribution dataclass maps each summary claim to the source sentence with lowest divergence. Available in ScoringEvidence.attributions and the /v1/review API response.
  • Cost transparency: ScoringEvidence.token_count and estimated_cost_usd track NLI token consumption per check.
  • Domain benchmarks: medical_eval.py (MedNLI + PubMedQA), legal_eval.py (ContractNLI + CUAD/RAGBench), finance_eval.py (FinanceBench + Financial PhraseBank).
  • Fine-tuning pipeline: finetune_nli(), FinetuneConfig, FinetuneResult. CLI: director-ai finetune train.jsonl. Install: pip install director-ai[finetune].
  • Load testing benchmark: concurrent RPS measurement with P50/P95/P99 latency.
  • export_tensorrt() — pre-builds TRT engine cache from ONNX model.
  • CLI director-ai export subcommand (--format onnx|tensorrt).

Performance

  • ONNX CUDA: 4.5ms/pair median (2.4x vs PyTorch 10.9ms), L4 GPU.
  • ONNX FP16: 4.2ms/pair. ONNX CPU: 4.1ms/pair (competitive at batch=4).

v3.6.0 (2026-03-10)

Fixed

  • Summarization FPR reduced from 10.5% to 2.0% — Layer C (claim decomposition + coverage scoring) decomposes summaries into atomic claims, scores each against source via NLI, computes coverage. Blended with Layer A: final = 0.4 * (1 - coverage) + 0.6 * layer_a. All three task types now below 5% FPR.

Added

  • NLIScorer.score_claim_coverage() method
  • Config: nli_claim_coverage_enabled, nli_claim_support_threshold (0.6), nli_claim_coverage_alpha (0.4)
  • ScoringEvidence includes claim_coverage, per_claim_divergences, claims
  • 21 new tests (2084 total)
  • Claim coverage FPR diagnostic benchmark

v3.5.0 (2026-03-10)

Fixed

  • Summarization FPR reduced from 25.5% to 10.5% — bidirectional NLI scores both source→summary and summary→source, takes min. Baseline calibration (0.20) shifts expected NLI noise to zero.
  • Dialogue FPR reduced from 97.5% to 4.5% — bidirectional NLI + baseline=0.80.

Added

  • _summarization_factual_divergence() method with bidirectional NLI scoring
  • nli_summarization_baseline config field (default 0.20)
  • _detect_task_type() static method for dialogue vs summarization routing
  • 13 new tests in tests/test_summarization_bidir.py
  • Bidirectional FPR diagnostic benchmark with 6 baseline profiles

v3.4.0 (2026-03-09)

Fixed

  • Summarization FPR reduced from 95% to 25.5% — three-phase fix: MIN inner aggregation, direct NLI scoring (bypass vector store), w_logic=0 (eliminate h_logic==h_fact duplication), trimmed_mean outer aggregation.

Added

  • trimmed_mean outer aggregation for chunked NLI scoring
  • _use_prompt_as_premise flag — direct document→summary NLI scoring
  • Configurable nli_fact_retrieval_top_k and nli_use_prompt_as_premise config fields
  • Summarization FPR diagnostic benchmark (benchmarks/summarization_fpr_diag.py)
  • 5 new tests, TestWLogicZeroShortCircuit test class

Changed

  • Summarization profile: w_logic=0.0, w_fact=1.0, coherence_threshold=0.15, nli_fact_outer_agg="trimmed_mean", nli_use_prompt_as_premise=True
  • _heuristic_coherence short-circuits logical divergence when W_LOGIC < 1e-9

v3.3.0 (2026-03-07)

Added

  • Generated gRPC protobuf stubs from proto/director.proto
  • CoherenceAgent.aprocess() and CoherenceAgent.astream() async methods
  • CoherenceScorer.review_batch() — coalesced batch NLI (2 GPU calls when NLI available)
  • ReviewQueue — server-level continuous batching with configurable flush window
  • --cors-origins flag on director-ai serve

Changed

  • cors_origins default changed from "*" to "" (no CORS by default)
  • H_logical and H_factual computed in parallel via ThreadPoolExecutor (~40% latency reduction)

v3.2.0 (2026-03-07)

Added

  • BatchProcessor.process_batch_async() and review_batch_async() — async batch processing
  • __aiter__ on Bedrock, Gemini, Cohere guarded streams (parity with OpenAI/Anthropic)
  • VectorBackend.aadd() / aquery() async defaults via run_in_executor
  • LiteScorer.review() returning (bool, CoherenceScore) matching CoherenceScorer interface
  • Config validation: reranker_model / embedding_model non-empty when feature enabled

v3.1.0 (2026-03-07)

Added

  • Hybrid scorer hardening: NLI confidence margin fix, LLM judge verdict caching, retry with back-off
  • Enterprise modules: PostgresAuditSink, RedisGroundTruthStore
  • WASM edge runtime: CI pipeline, browser integration tutorial, overhead benchmark
  • Rust backend: PyO3 0.24 upgrade, SIMD micro-cycle vectorization
  • Vector backends: FAISS (dense search), Elasticsearch (hybrid BM25 + dense)
  • RAGTruth + FreshQA GPU benchmark, cross-platform latency profiling

v3.0.0 (2026-03-07)

Breaking Changes

  • Minimum Python 3.11 (dropped 3.10)
  • Enterprise classes moved: TenantRouter, Policy, AuditLoggerdirector_ai.enterprise
  • Removed deprecated 1.x aliases (calculate_factual_entropy, review_action, etc.)
  • Slimmed root __all__: internal classes removed from public API surface

Added

  • director_ai.enterprise package re-exporting all 5 enterprise classes
  • director-ai tune adaptive threshold calibration
  • Python 3.13 in CI matrix

v2.0.0 (2026-03-02)

Fixed

  • Case-sensitivity bug in GroundTruthStore.retrieve_context() — mixed-case keys now match
  • LLM judge error handling: bare except Exception replaced with structured try/except
  • SafetyKernel validates hard_limit in [0, 1] range
  • OTel setup_otel() is now thread-safe
  • case-studies.md code snippets corrected (wrong constructors, phantom methods)

Added

  • Named constants for LLM judge blending formula
  • .editorconfig, .pre-commit-config.yaml, py.typed PEP 561 marker
  • Documentation URL in pyproject.toml
  • Non-root user in Dockerfiles
  • Histogram bucket_counts() O(n log n) optimization
  • New tests: knowledge, kernel validation, ingest, cache
  • 12 inspect.getsource fragile tests replaced with behavioral equivalents

v1.9.0 (2026-03-02)

Added

  • Soft-halt mode: StreamingKernel(halt_mode="soft")
  • JSON structured logging: log_json=True
  • OpenTelemetry integration: otel_enabled=True
  • Request ID propagation: X-Request-ID header
  • 100-passage false-halt benchmark
  • Coverage threshold raised to 80%

v1.7.0 (2026-03-01)

Added

  • Domain presets: DirectorConfig.from_profile()
  • Structured halt evidence
  • Pluggable scorer backend: deberta, onnx, minicheck
  • Batched MiniCheck support
  • False-halt assertion in CI

v1.6.0 (2026-03-01)

Added

  • API key auth, correlation IDs, audit logging
  • Tenant routing, rate limiting
  • Streaming WebSocket oversight
  • E2E benchmarks (300 traces)
  • 835 tests

v1.5.0 — v1.5.1 (2026-03-01)

Added

  • Bidirectional chunked NLI
  • Prometheus metric compliance
  • Real streaming halt with evidence
  • RAG retrieval bench

v1.4.0 — v1.4.1 (2026-03-01)

Added

  • Batched NLI inference (10.8x speedup)
  • ONNX export + runtime (14.6 ms/pair GPU)
  • GPU Docker image, TensorRT provider

v1.3.0 (2026-03-01)

  • Default NLI model: FactCG-DeBERTa-v3-Large (75.8% balanced accuracy)

v1.2.0 — v1.2.1 (2026-02-27)

  • Score caching, LangGraph/Haystack/CrewAI integrations
  • MkDocs documentation, strict_mode, configurable weights

v1.1.0 (2026-02-27)

  • SDK Guard guard() for OpenAI/Anthropic
  • Streaming guards, HallucinationError

v1.0.0 (2026-02-26)

  • Production stable release
  • Enterprise modules: Policy, AuditLogger, TenantRouter, InputSanitizer
  • LangChain + LlamaIndex integrations