Knowledge Base Ingestion¶
Director-AI scores responses against a knowledge base (KB). The KB provides the "ground truth" that the scorer compares responses to. Without a KB, NLI still works (using the prompt as premise), but factual divergence (H_factual) degrades to a neutral 0.5. Loading domain facts substantially improves scoring discrimination.
Option 1: Simple Key-Value Store¶
For small KBs (< 1000 facts), use GroundTruthStore directly:
from director_ai import GroundTruthStore, CoherenceScorer
store = GroundTruthStore() # starts empty — add your own facts
store.add("boiling_point", "Water boils at 100°C at standard atmospheric pressure.")
store.add("speed_of_light", "The speed of light in vacuum is 299,792 km/s.")
store.add("capital_france", "The capital of France is Paris.")
scorer = CoherenceScorer(
threshold=0.5,
ground_truth_store=store,
)
approved, score = scorer.review(
"What temperature does water boil at?",
"Water boils at 100 degrees Celsius at standard pressure.",
)
# approved=True, score.score ≈ 0.9+
Facts are matched by key similarity to the prompt. The scorer retrieves the best-matching fact and computes word overlap (heuristic) or NLI entailment (when use_nli=True).
Option 2: Grounded Retrieval Recipe for RAG Pipelines¶
For larger KBs or when you need semantic search, build the store through
DirectorConfig. This exposes the default grounded recipe as an explicit
contract:
- dense retrieval from the configured vector backend
- BM25 sparse retrieval fused with dense retrieval by Reciprocal Rank Fusion
(
hybrid_rrf_k=60) - cross-encoder reranking when
director-ai[reranker]dependencies are installed - retrieval abstention for low-quality matches instead of false confidence
(
retrieval_abstention_threshold=0.3by default)
Use the in-memory backend for local development and test corpora. Use a persistent backend such as ChromaDB for production storage.
from director_ai.core.config import DirectorConfig
cfg = DirectorConfig(
mode="grounded",
vector_backend="chroma",
chroma_persist_dir="/data/chroma",
)
store = cfg.build_store()
# Ingest documents (splits into chunks, embeds, indexes)
store.ingest([
"Water boils at 100°C at standard atmospheric pressure.",
"The speed of light in vacuum is 299,792 km/s.",
"DNA has four nucleotide bases: adenine, thymine, guanine, cytosine.",
])
# Or add individual facts
store.add_fact("gravity", "Earth's gravitational acceleration is 9.81 m/s².")
scorer = cfg.build_scorer(store=store)
For user-supplied or externally synchronized facts, place
ConflictAwareKnowledgeGuard in front of the store so contradictory updates are
checked before they become retrievable context:
from director_ai import ConflictAwareKnowledgeGuard, KnowledgeFact
guard = ConflictAwareKnowledgeGuard(store, score_fn=contradiction_score)
result = guard.add_fact(
KnowledgeFact(
key="refund_policy",
value="Refunds are unavailable after delivery confirmation.",
metadata={"contradicts": "claim-refund-v1"},
)
)
if result.decision == "block":
raise RuntimeError("KB update rejected before retrieval admission")
Conflict reports carry hashes and evidence references, not raw fact text. Use this guard at ingestion boundaries where tenant users, signed facts, passport claims, or upstream synchronizers can introduce mutually incompatible facts.
Inspect the active recipe without exposing API keys:
Production deployments should still benchmark the configured corpus and domain thresholds before enforcement. The default recipe is a strong retrieval baseline, not a substitute for corpus-specific calibration.
Option 3: ChromaDB Persistent Backend¶
from director_ai.core.vector_store import ChromaBackend, VectorGroundTruthStore
backend = ChromaBackend(
collection_name="prod_facts",
persist_directory="/data/chroma",
embedding_model="BAAI/bge-large-en-v1.5",
)
store = VectorGroundTruthStore(backend=backend)
# Ingest once — data persists across restarts
store.ingest(your_documents)
# Later, just connect to the same persist_directory
Install ChromaDB:
Multi-Tenant KB¶
Use TenantRouter to isolate KBs per tenant:
from director_ai.enterprise import TenantRouter
router = TenantRouter()
# Each tenant gets an isolated KB
router.add_fact("acme_corp", "refund_policy", "Refunds within 30 days of purchase.")
router.add_fact("globex", "refund_policy", "No refunds after delivery confirmation.")
# Get a scorer scoped to one tenant
scorer = router.get_scorer("acme_corp", threshold=0.6)
# This scorer only sees acme_corp's facts
approved, score = scorer.review(
"What is the refund policy?",
"You can get a refund within 30 days.",
)
When using the Director-AI server with tenant_routing=True, pass the X-Tenant-ID header to route requests to the correct KB:
curl -X POST http://localhost:8080/v1/review \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: acme_corp" \
-d '{"prompt": "refund policy?", "response": "Refunds within 30 days."}'
Ingestion Best Practices¶
- Chunk size — keep documents under 500 tokens each. Long documents dilute retrieval precision.
- Deduplication — avoid ingesting the same content twice; it inflates retrieval scores without adding information.
- Metadata — use
add_fact(key, value)with meaningful keys for small KBs. The key helps with retrieval matching. - Incremental sync — keep a stable upstream document id and call
PUT /v1/knowledge/documents/{doc_id}when the upstream file changes. Director-AI stores a SHA-256content_hash; unchangedPUTrequests returnunchanged: truewithout deleting chunks or re-embedding. - Update strategy — use
PUT /v1/knowledge/documents/{doc_id}to replace one document andDELETE /v1/knowledge/documents/{doc_id}to remove its chunks. Re-ingest the full corpus only when the embedding model, chunking policy, or parser configuration changes. - Test coverage — run
python -m benchmarks.regression_suiteafter KB changes to verify the scorer still behaves correctly.