Skip to content

Document Ingestion Pipeline

DocumentIngestionPipeline is the reusable Python API behind document ingestion workflows. It parses bytes, chunks text, writes chunks into a VectorGroundTruthStore, and keeps a DocRegistry in sync for update and delete operations.

from director_ai.core.ingestion import DocumentIngestionPipeline
from director_ai.core.retrieval.vector_store import VectorGroundTruthStore

store = VectorGroundTruthStore()
pipeline = DocumentIngestionPipeline(store=store)

result = pipeline.ingest_bytes(
    b"Refund policy: 30 days.",
    filename="policy.txt",
    doc_id="refund-policy",
    source="policy.txt",
    tenant_id="acme",
)

print(result.chunk_ids)

Use update_text() for replacement sync. If the content hash is unchanged, the pipeline returns unchanged=True and does not re-embed. If content changes, new chunks are staged before old chunks are removed, so a failed replacement does not silently orphan the document.

Use delete() to remove both registry metadata and vector-store chunks.

director_ai.core.ingestion.IngestionConfig dataclass

IngestionConfig(chunk_size: int = 512, overlap: int = 64, semantic: bool = False, similarity_threshold: float = 0.3)

Configuration for parser-to-vector-store ingestion.

__post_init__

__post_init__() -> None

Validate chunking parameters through ChunkConfig construction.

to_chunk_config

to_chunk_config() -> ChunkConfig

Return the retrieval chunker configuration.

director_ai.core.ingestion.IngestionResult dataclass

IngestionResult(doc_id: str, source: str, tenant_id: str, chunk_ids: list[str], content_hash: str, unchanged: bool = False)

Metadata returned after ingesting or updating one document.

chunk_count property

chunk_count: int

Return the number of chunks stored for this document.

director_ai.core.ingestion.DeletedDocument dataclass

DeletedDocument(doc_id: str, tenant_id: str, chunks_removed: int)

Metadata returned after deleting one ingested document.

director_ai.core.ingestion.DocumentIngestionPipeline

DocumentIngestionPipeline(*, store: VectorGroundTruthStore, registry: DocRegistry | None = None, config: IngestionConfig | None = None)

Parse, chunk, store, update, and delete documents for a vector store.

ingest_bytes

ingest_bytes(content: bytes, *, filename: str, doc_id: str | None = None, source: str | None = None, tenant_id: str = '', config: IngestionConfig | None = None) -> IngestionResult

Parse bytes by filename, then ingest the resulting text.

ingest_text

ingest_text(text: str, *, doc_id: str | None = None, source: str = 'text', tenant_id: str = '', config: IngestionConfig | None = None) -> IngestionResult

Chunk and store a new text document.

update_text

update_text(text: str, *, doc_id: str, source: str = 'text', tenant_id: str = '', config: IngestionConfig | None = None) -> IngestionResult

Replace a document's chunks while preserving registry identity.

delete

delete(doc_id: str, *, tenant_id: str = '') -> DeletedDocument

Delete a registered document and all stored chunks.