Changelog¶
See the full changelog in CHANGELOG.md on GitHub.
v3.9.4 (2026-03-20)¶
Fixed¶
- Domain profile thresholds: medical 0.75→0.30, finance 0.70→0.30, legal 0.68→0.30 (measured on PubMedQA and FinanceBench).
- Score calibration: rescales [0.25, 0.55] → [0, 1] when NLI available but no KB loaded.
- README claims verified: model attribution, hardware context, FPR 2.0%→10.5%.
- Docker: removed dead registry links, GPU Dockerfile fixed (
optimumdep). - Code scanning: blake2b for API key hash, pip deps pinned with hashes.
- All notebooks fixed: correct API signatures, field access patterns, thresholds.
__init__.py: all PUBLIC_API.md symbols now importable from top-level.
v3.9.3 (2026-03-19)¶
Fixed¶
- Rust scorer word-overlap heuristic tests.
- Rust FFI borrow lifetime fix.
- License tests use env-var signing key consistently.
v3.9.2 (2026-03-19)¶
Fixed¶
- License validation hardened (UUID, HMAC, expiry).
- Cache scope isolation prevents cross-session replay.
- Tenant-aware retrieval consistency across all vector store methods.
- Batch processor catches per-item exceptions gracefully.
- Config
from_profile()re-applies values after__post_init__override. - Redis enterprise store: tenant-prefixed keys, TTL, connection handling.
- Fine-tuning lazy imports for torch/transformers.
Added¶
- 353 new tests — coverage from 81% to 90%+.
v3.9.1 (2026-03-19)¶
Fixed¶
- Cross-tenant cache replay: cache key includes
tenant_id. - Batch/single scoring parity:
review_batch()routes throughreview(). - Vector fallback cross-tenant leak:
add_fact()prefixes tenant_id. - streaming_oversight crash:
ingest_token()→check_halt(). - Timeout kills stream: all streaming paths catch
TimeoutError.
v3.9.0 (2026-03-15)¶
Added¶
- VerifiedScorer: sentence-level multi-signal fact verification. 5 independent signals: NLI, entity consistency, numerical consistency, negation detection, traceability (fabrication).
POST /v1/verifyendpoint. - Document ingestion API:
POST /v1/knowledge/upload(PDF/DOCX/HTML),/ingest,/search, CRUD by doc ID. Chunker, parser, registry modules. - Mode selector:
--mode general|grounded|auto. Single field replaces 8+ manual config settings. - Dataset-type classifier:
DatasetTypeClassifierpredicts per-input NLI threshold from text features. - ColBERT backend: late-interaction retrieval via RAGatouille for higher retrieval recall.
- Domain embedding tuner:
POST /v1/knowledge/tune-embeddingsfine-tunes embeddings on ingested documents. - Calibrated abstention: returns neutral when retrieval confidence is below threshold.
- Readiness probe:
GET /v1/readyreturns 503 when scorer/NLI not loaded. HaltMonitorclass (renamed fromSafetyKernel, alias kept).- LLM judge confidence now scales blend weight.
clear_model_cache()for explicit GPU memory release.
Defaults Changed¶
- Hybrid retrieval (BM25 + dense + RRF) enabled by default.
- Cross-encoder reranker enabled by default.
- RAG claim decomposition enabled by default for all grounded scoring.
Infrastructure¶
- 180 files updated to canonical SPDX 5-line headers.
- Connection pooling on PostgreSQL audit sink.
- Redis-backed rate limiting when redis_url configured.
- Latency gate tightened to 5ms avg / 15ms p95.
- NLI pipeline consistency gate in regression suite.
v3.8.0 (2026-03-14)¶
Added¶
score()one-call convenience function.DirectorGuardFastAPI middleware.create_proxy_app()OpenAI-compatible guardrail proxy.- Config profiles: medical, finance, legal, creative, customer_support, summarization, lite.
from_env()environment variable configuration.
Security¶
- Input sanitizer: prompt injection, unicode escape, YAML injection detection.
- PII redactor for privacy mode.
- Constant-time API key comparison via HMAC.
v3.7.0 (2026-03-10)¶
Added¶
- Sentence-level attribution:
ClaimAttributiondataclass maps each summary claim to the source sentence with lowest divergence. Available inScoringEvidence.attributionsand the/v1/reviewAPI response. - Cost transparency:
ScoringEvidence.token_countandestimated_cost_usdtrack NLI token consumption per check. - Domain benchmarks:
medical_eval.py(MedNLI + PubMedQA),legal_eval.py(ContractNLI + CUAD/RAGBench),finance_eval.py(FinanceBench + Financial PhraseBank). - Fine-tuning pipeline:
finetune_nli(),FinetuneConfig,FinetuneResult. CLI:director-ai finetune train.jsonl. Install:pip install director-ai[finetune]. - Load testing benchmark: concurrent RPS measurement with P50/P95/P99 latency.
export_tensorrt()— pre-builds TRT engine cache from ONNX model.- CLI
director-ai exportsubcommand (--format onnx|tensorrt).
Performance¶
- ONNX CUDA: 4.5ms/pair median (2.4x vs PyTorch 10.9ms), L4 GPU.
- ONNX FP16: 4.2ms/pair. ONNX CPU: 4.1ms/pair (competitive at batch=4).
v3.6.0 (2026-03-10)¶
Fixed¶
- Summarization FPR reduced from 10.5% to 2.0% — Layer C (claim decomposition + coverage scoring) decomposes summaries into atomic claims, scores each against source via NLI, computes coverage. Blended with Layer A:
final = 0.4 * (1 - coverage) + 0.6 * layer_a. All three task types now below 5% FPR.
Added¶
NLIScorer.score_claim_coverage()method- Config:
nli_claim_coverage_enabled,nli_claim_support_threshold(0.6),nli_claim_coverage_alpha(0.4) ScoringEvidenceincludesclaim_coverage,per_claim_divergences,claims- 21 new tests (2084 total)
- Claim coverage FPR diagnostic benchmark
v3.5.0 (2026-03-10)¶
Fixed¶
- Summarization FPR reduced from 25.5% to 10.5% — bidirectional NLI scores both source→summary and summary→source, takes min. Baseline calibration (0.20) shifts expected NLI noise to zero.
- Dialogue FPR reduced from 97.5% to 4.5% — bidirectional NLI + baseline=0.80.
Added¶
_summarization_factual_divergence()method with bidirectional NLI scoringnli_summarization_baselineconfig field (default 0.20)_detect_task_type()static method for dialogue vs summarization routing- 13 new tests in
tests/test_summarization_bidir.py - Bidirectional FPR diagnostic benchmark with 6 baseline profiles
v3.4.0 (2026-03-09)¶
Fixed¶
- Summarization FPR reduced from 95% to 25.5% — three-phase fix: MIN inner aggregation, direct NLI scoring (bypass vector store), w_logic=0 (eliminate h_logic==h_fact duplication), trimmed_mean outer aggregation.
Added¶
trimmed_meanouter aggregation for chunked NLI scoring_use_prompt_as_premiseflag — direct document→summary NLI scoring- Configurable
nli_fact_retrieval_top_kandnli_use_prompt_as_premiseconfig fields - Summarization FPR diagnostic benchmark (
benchmarks/summarization_fpr_diag.py) - 5 new tests,
TestWLogicZeroShortCircuittest class
Changed¶
- Summarization profile:
w_logic=0.0, w_fact=1.0,coherence_threshold=0.15,nli_fact_outer_agg="trimmed_mean",nli_use_prompt_as_premise=True _heuristic_coherenceshort-circuits logical divergence whenW_LOGIC < 1e-9
v3.3.0 (2026-03-07)¶
Added¶
- Generated gRPC protobuf stubs from
proto/director.proto CoherenceAgent.aprocess()andCoherenceAgent.astream()async methodsCoherenceScorer.review_batch()— coalesced batch NLI (2 GPU calls when NLI available)ReviewQueue— server-level continuous batching with configurable flush window--cors-originsflag ondirector-ai serve
Changed¶
cors_originsdefault changed from"*"to""(no CORS by default)- H_logical and H_factual computed in parallel via
ThreadPoolExecutor(~40% latency reduction)
v3.2.0 (2026-03-07)¶
Added¶
BatchProcessor.process_batch_async()andreview_batch_async()— async batch processing__aiter__on Bedrock, Gemini, Cohere guarded streams (parity with OpenAI/Anthropic)VectorBackend.aadd()/aquery()async defaults viarun_in_executorLiteScorer.review()returning(bool, CoherenceScore)matchingCoherenceScorerinterface- Config validation:
reranker_model/embedding_modelnon-empty when feature enabled
v3.1.0 (2026-03-07)¶
Added¶
- Hybrid scorer hardening: NLI confidence margin fix, LLM judge verdict caching, retry with back-off
- Enterprise modules:
PostgresAuditSink,RedisGroundTruthStore - WASM edge runtime: CI pipeline, browser integration tutorial, overhead benchmark
- Rust backend: PyO3 0.24 upgrade, SIMD micro-cycle vectorization
- Vector backends: FAISS (dense search), Elasticsearch (hybrid BM25 + dense)
- RAGTruth + FreshQA GPU benchmark, cross-platform latency profiling
v3.0.0 (2026-03-07)¶
Breaking Changes¶
- Minimum Python 3.11 (dropped 3.10)
- Enterprise classes moved:
TenantRouter,Policy,AuditLogger→director_ai.enterprise - Removed deprecated 1.x aliases (
calculate_factual_entropy,review_action, etc.) - Slimmed root
__all__: internal classes removed from public API surface
Added¶
director_ai.enterprisepackage re-exporting all 5 enterprise classesdirector-ai tuneadaptive threshold calibration- Python 3.13 in CI matrix
v2.0.0 (2026-03-02)¶
Fixed¶
- Case-sensitivity bug in
GroundTruthStore.retrieve_context()— mixed-case keys now match - LLM judge error handling: bare
except Exceptionreplaced with structured try/except SafetyKernelvalidateshard_limitin [0, 1] range- OTel
setup_otel()is now thread-safe case-studies.mdcode snippets corrected (wrong constructors, phantom methods)
Added¶
- Named constants for LLM judge blending formula
.editorconfig,.pre-commit-config.yaml,py.typedPEP 561 marker- Documentation URL in
pyproject.toml - Non-root user in Dockerfiles
- Histogram
bucket_counts()O(n log n) optimization - New tests: knowledge, kernel validation, ingest, cache
- 12
inspect.getsourcefragile tests replaced with behavioral equivalents
v1.9.0 (2026-03-02)¶
Added¶
- Soft-halt mode:
StreamingKernel(halt_mode="soft") - JSON structured logging:
log_json=True - OpenTelemetry integration:
otel_enabled=True - Request ID propagation:
X-Request-IDheader - 100-passage false-halt benchmark
- Coverage threshold raised to 80%
v1.7.0 (2026-03-01)¶
Added¶
- Domain presets:
DirectorConfig.from_profile() - Structured halt evidence
- Pluggable scorer backend:
deberta,onnx,minicheck - Batched MiniCheck support
- False-halt assertion in CI
v1.6.0 (2026-03-01)¶
Added¶
- API key auth, correlation IDs, audit logging
- Tenant routing, rate limiting
- Streaming WebSocket oversight
- E2E benchmarks (300 traces)
- 835 tests
v1.5.0 — v1.5.1 (2026-03-01)¶
Added¶
- Bidirectional chunked NLI
- Prometheus metric compliance
- Real streaming halt with evidence
- RAG retrieval bench
v1.4.0 — v1.4.1 (2026-03-01)¶
Added¶
- Batched NLI inference (10.8x speedup)
- ONNX export + runtime (14.6 ms/pair GPU)
- GPU Docker image, TensorRT provider
v1.3.0 (2026-03-01)¶
- Default NLI model: FactCG-DeBERTa-v3-Large (75.8% balanced accuracy)
v1.2.0 — v1.2.1 (2026-02-27)¶
- Score caching, LangGraph/Haystack/CrewAI integrations
- MkDocs documentation, strict_mode, configurable weights
v1.1.0 (2026-02-27)¶
- SDK Guard
guard()for OpenAI/Anthropic - Streaming guards,
HallucinationError
v1.0.0 (2026-02-26)¶
- Production stable release
- Enterprise modules: Policy, AuditLogger, TenantRouter, InputSanitizer
- LangChain + LlamaIndex integrations