Skip to content

Synthetic Distillation

Synthetic distillation creates reviewed training examples from reviewed source events. Synthetic rows are useful for hard-negative coverage, but they are not real benchmark evidence and must not be reported as such.

Provenance Contract

SyntheticExample requires:

  • at least one reviewed source event ID
  • reviewer identity
  • generator ID
  • deterministic seed
  • an explicit synthetic=True training marker
  • benchmark_evidence=False

The default audit payload excludes generated prompt and response text. Use to_training_row() only inside the controlled training dataset builder.

from director_ai.core.self_evolving import SyntheticDistillationBuilder

builder = SyntheticDistillationBuilder(generator_id="deterministic-v1")
examples = builder.generate(
    reviewed_events,
    reviewer_id="reviewer-passport-1",
    seed=123,
    max_examples=32,
)

SyntheticDistillationManifest deduplicates generated examples, separates real and synthetic counts, and remains marked benchmark_evidence=False.

Use build_training_plan() to bind reviewed feedback, synthetic rows, manifest metadata, and a managed training job spec without submitting compute. The plan keeps the generated rows available for the controlled dataset writer, while to_dict() remains tenant-safe and omits prompt and response text.

plan = builder.build_training_plan(
    reviewed_events,
    reviewer_id="reviewer-passport-1",
    seed=123,
    max_examples=32,
    real_event_count=real_event_count,
    manifest_id="distill-20260513-a",
    dataset_uri="env://DIRECTOR_SYNTHETIC_DISTILLATION_DATASET",
    output_uri="env://DIRECTOR_SYNTHETIC_DISTILLATION_OUTPUT",
    base_model_ref="factcg-deberta-v3-large",
    schedule_id="nightly-reviewed-feedback",
)

rows = plan.training_rows()
training_job = plan.training_job

The training job is a request object only. It is not submitted by the builder, and dataset/output URIs with embedded credentials are rejected.

Full API

director_ai.core.self_evolving.synthetic_distillation.SyntheticExample dataclass

SyntheticExample(prompt: str, response: str, label: FeedbackLabel, source_event_ids: Sequence[str], reviewer_id: str, generator_id: str, seed: int)

Synthetic training example linked to reviewed source events.

dedupe_key property

dedupe_key: str

Normalised key used to prevent duplicate synthetic rows.

to_dict

to_dict(*, include_generated_text: bool = False) -> dict[str, Any]

Serialise tenant-safe audit metadata by default.

to_training_row

to_training_row() -> dict[str, Any]

Return the row shape consumed by training jobs.

director_ai.core.self_evolving.synthetic_distillation.SyntheticDistillationBuilder

SyntheticDistillationBuilder(*, generator_id: str)

Deterministically derive reviewed synthetic examples from feedback.

generate

generate(events: Iterable[FeedbackEvent], *, reviewer_id: str, seed: int, max_examples: int) -> tuple[SyntheticExample, ...]

Generate deterministic synthetic examples with source provenance.

build_training_plan

build_training_plan(events: Iterable[FeedbackEvent], *, reviewer_id: str, seed: int, max_examples: int, real_event_count: int, manifest_id: str, dataset_uri: str, output_uri: str, base_model_ref: str, schedule_id: str) -> SyntheticDistillationPlan

Build a synthetic distillation plan without submitting a job.

director_ai.core.self_evolving.synthetic_distillation.SyntheticDistillationManifest dataclass

SyntheticDistillationManifest(manifest_id: str, synthetic_event_count: int, real_event_count: int, label_counts: Mapping[str, int], source_event_ids: Sequence[str], generator_ids: Sequence[str], benchmark_evidence: bool = False)

Tenant-safe manifest for a mixed real/synthetic distillation set.

from_examples classmethod

from_examples(*, examples: Sequence[SyntheticExample], real_event_count: int, manifest_id: str) -> SyntheticDistillationManifest

Build a manifest after duplicate and provenance checks.

to_dict

to_dict() -> dict[str, Any]

Serialise manifest metadata without generated prompt text.

director_ai.core.self_evolving.synthetic_distillation.SyntheticDistillationPlan dataclass

SyntheticDistillationPlan(examples: Sequence[SyntheticExample], manifest: SyntheticDistillationManifest, training_job: TrainingJobSpec)

Reviewed synthetic rows, tenant-safe manifest, and managed job request.

training_rows

training_rows() -> tuple[dict[str, Any], ...]

Return synthetic rows for the controlled dataset writer.

to_dict

to_dict() -> dict[str, Any]

Serialise the plan without generated prompt or response text.