Knowledge Base Ingestion¶
Director-AI scores responses against a knowledge base (KB). The KB provides the "ground truth" that the scorer compares responses to. Without a KB, NLI still works (using the prompt as premise), but factual divergence (H_factual) degrades to a neutral 0.5. Loading domain facts substantially improves scoring discrimination.
Option 1: Simple Key-Value Store¶
For small KBs (< 1000 facts), use GroundTruthStore directly:
from director_ai import GroundTruthStore, CoherenceScorer
store = GroundTruthStore() # starts empty — add your own facts
store.add("boiling_point", "Water boils at 100°C at standard atmospheric pressure.")
store.add("speed_of_light", "The speed of light in vacuum is 299,792 km/s.")
store.add("capital_france", "The capital of France is Paris.")
scorer = CoherenceScorer(
threshold=0.5,
ground_truth_store=store,
)
approved, score = scorer.review(
"What temperature does water boil at?",
"Water boils at 100 degrees Celsius at standard pressure.",
)
# approved=True, score.score ≈ 0.9+
Facts are matched by key similarity to the prompt. The scorer retrieves the best-matching fact and computes word overlap (heuristic) or NLI entailment (when use_nli=True).
Option 2: Vector Store for RAG Pipelines¶
For larger KBs or when you need semantic search, use VectorGroundTruthStore:
from director_ai import VectorGroundTruthStore, CoherenceScorer
store = VectorGroundTruthStore()
# Ingest documents (splits into chunks, embeds, indexes)
store.ingest([
"Water boils at 100°C at standard atmospheric pressure.",
"The speed of light in vacuum is 299,792 km/s.",
"DNA has four nucleotide bases: adenine, thymine, guanine, cytosine.",
])
# Or add individual facts
store.add_fact("gravity", "Earth's gravitational acceleration is 9.81 m/s².")
scorer = CoherenceScorer(
threshold=0.6,
use_nli=True,
ground_truth_store=store,
)
The in-memory backend works for up to ~10K documents. For production, use ChromaDB.
Option 3: ChromaDB Persistent Backend¶
from director_ai.core.vector_store import ChromaBackend, VectorGroundTruthStore
backend = ChromaBackend(
collection_name="prod_facts",
persist_directory="/data/chroma",
embedding_model="BAAI/bge-large-en-v1.5",
)
store = VectorGroundTruthStore(backend=backend)
# Ingest once — data persists across restarts
store.ingest(your_documents)
# Later, just connect to the same persist_directory
Install ChromaDB:
Multi-Tenant KB¶
Use TenantRouter to isolate KBs per tenant:
from director_ai.enterprise import TenantRouter
router = TenantRouter()
# Each tenant gets an isolated KB
router.add_fact("acme_corp", "refund_policy", "Refunds within 30 days of purchase.")
router.add_fact("globex", "refund_policy", "No refunds after delivery confirmation.")
# Get a scorer scoped to one tenant
scorer = router.get_scorer("acme_corp", threshold=0.6)
# This scorer only sees acme_corp's facts
approved, score = scorer.review(
"What is the refund policy?",
"You can get a refund within 30 days.",
)
When using the Director-AI server with tenant_routing=True, pass the X-Tenant-ID header to route requests to the correct KB:
curl -X POST http://localhost:8080/v1/review \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: acme_corp" \
-d '{"prompt": "refund policy?", "response": "Refunds within 30 days."}'
Ingestion Best Practices¶
- Chunk size — keep documents under 500 tokens each. Long documents dilute retrieval precision.
- Deduplication — avoid ingesting the same content twice; it inflates retrieval scores without adding information.
- Metadata — use
add_fact(key, value)with meaningful keys for small KBs. The key helps with retrieval matching. - Update strategy — re-ingest the full corpus when facts change.
VectorGroundTruthStoredoes not support incremental deletion (yet). - Test coverage — run
python -m benchmarks.regression_suiteafter KB changes to verify the scorer still behaves correctly.