Document Ingestion Pipeline¶
DocumentIngestionPipeline is the reusable Python API behind document
ingestion workflows. It parses bytes, chunks text, writes chunks into a
VectorGroundTruthStore, and keeps a DocRegistry in sync for update and
delete operations.
from director_ai.core.ingestion import DocumentIngestionPipeline
from director_ai.core.retrieval.vector_store import VectorGroundTruthStore
store = VectorGroundTruthStore()
pipeline = DocumentIngestionPipeline(store=store)
result = pipeline.ingest_bytes(
b"Refund policy: 30 days.",
filename="policy.txt",
doc_id="refund-policy",
source="policy.txt",
tenant_id="acme",
)
print(result.chunk_ids)
Use update_text() for replacement sync. If the content hash is unchanged, the
pipeline returns unchanged=True and does not re-embed. If content changes, new
chunks are staged before old chunks are removed, so a failed replacement does
not silently orphan the document.
Use delete() to remove both registry metadata and vector-store chunks.
director_ai.core.ingestion.IngestionConfig
dataclass
¶
director_ai.core.ingestion.IngestionResult
dataclass
¶
IngestionResult(doc_id: str, source: str, tenant_id: str, chunk_ids: list[str], content_hash: str, unchanged: bool = False)
Metadata returned after ingesting or updating one document.
director_ai.core.ingestion.DeletedDocument
dataclass
¶
Metadata returned after deleting one ingested document.
director_ai.core.ingestion.DocumentIngestionPipeline
¶
DocumentIngestionPipeline(*, store: VectorGroundTruthStore, registry: DocRegistry | None = None, config: IngestionConfig | None = None)
Parse, chunk, store, update, and delete documents for a vector store.
ingest_bytes
¶
ingest_bytes(content: bytes, *, filename: str, doc_id: str | None = None, source: str | None = None, tenant_id: str = '', config: IngestionConfig | None = None) -> IngestionResult
Parse bytes by filename, then ingest the resulting text.
ingest_text
¶
ingest_text(text: str, *, doc_id: str | None = None, source: str = 'text', tenant_id: str = '', config: IngestionConfig | None = None) -> IngestionResult
Chunk and store a new text document.
update_text
¶
update_text(text: str, *, doc_id: str, source: str = 'text', tenant_id: str = '', config: IngestionConfig | None = None) -> IngestionResult
Replace a document's chunks while preserving registry identity.
delete
¶
Delete a registered document and all stored chunks.