Skip to content

Evaluation Onboarding

This guide gives evaluators, pilots, and new contributors a concrete path from first install to evidence-backed deployment. Use it as the shared checklist for a first Director-AI trial.

Pilot Charter

Write this charter before tuning thresholds:

Field Question to answer
Workflow Which exact model output will Director-AI inspect?
Consequence What happens if this output is wrong?
Evidence source Which governed facts, documents, or labelled examples define correctness?
Control action Should the guard raise, annotate, halt, reject, or route to review?
Acceptance metric Which false-positive, false-negative, latency, and review-cost limits matter?
Owner Who approves threshold changes and production rollout?

The first pilot is successful when this charter has measured evidence, not when every optional integration has been enabled.

1. Choose The Pilot Shape

Pick one primary workflow before changing thresholds or adding integrations.

Pilot shape Best first artefact Success signal
SDK wrapper guard() around one existing client Unsupported claims are blocked or annotated
RAG assistant Vector-backed facts plus batch checks Rejections cite the right source chunks
Streaming app StreamingKernel or proxy stream guard Bad partial output halts before completion
Internal review queue Human review plus calibration store Review decisions improve threshold selection
Enterprise gateway REST/gRPC service behind auth and rate limits Tenant-safe audit events are produced

Keep the first pilot narrow. One workflow with good evidence is more useful than many integrations with unclear acceptance criteria.

2. Install The Right Extras

Need Install
In-process scoring and SDK wrapping pip install director-ai
NLI-backed factual scoring pip install "director-ai[nli]"
RAG and vector stores pip install "director-ai[nli,vector]"
REST service pip install "director-ai[nli,server]"
Full local evaluation stack pip install -e ".[dev,nli,vector,server]"

Start without optional runtimes unless the pilot requires them. Rust, ONNX, TensorRT, Go, Julia, Lean, gRPC, and WASM paths are deployment accelerators or specialised proof surfaces, not required for the first evaluation.

3. Run A First Check

from director_ai import score

result = score(
    "What is the refund policy?",
    "Refunds are available for 90 days.",
    facts={"policy": "Refunds are available for 30 days only."},
    threshold=0.3,
)

print(result.approved, result.score, result.evidence)

Expected outcome: the answer should be rejected or assigned a low coherence score because it contradicts the governed fact.

4. Add Domain Evidence

For a real pilot, replace toy facts with governed data:

  • product policy extracts;
  • medical, legal, finance, or scientific reference snippets;
  • customer-support macros;
  • approved retrieval chunks from an internal knowledge base;
  • labelled examples of acceptable and unacceptable answers.

Every benchmark or threshold decision should record:

  • dataset source and ownership;
  • split policy;
  • scorer configuration;
  • threshold;
  • false-positive and false-negative examples;
  • latency and hardware context;
  • review owner and approval date.

5. Pick A Control Action

Director-AI can fail closed or fail soft. Choose based on consequence:

Failure mode Use when
raise A wrong answer must not leave the application boundary
metadata The application decides how to display or route the risk
log Early observation without user-visible enforcement
HTTP reject Proxy or middleware owns the response boundary
Human review Low-confidence outputs need operator approval
Streaming halt Partial output must stop before completion

6. Validate The Pilot

Minimum evidence before a production pilot:

  • quickstart or API smoke passes in the target environment;
  • a labelled domain sample has been scored;
  • at least one known bad answer is rejected;
  • at least one known good answer is approved;
  • logs contain no secrets;
  • evidence records are understandable to a domain reviewer;
  • operational ownership is assigned for threshold changes and false-positive review.

7. Evidence Packet Template

Use this packet for internal review, procurement, or a customer pilot closeout:

Section Contents
Use case Workflow, users, output boundary, and consequence of a wrong answer
Dataset Source, ownership, split policy, examples excluded from tuning
Configuration Package version, scorer mode, threshold, runtime extras, KB version
Results Good-answer approvals, bad-answer rejections, false positives, false negatives
Operations Auth, tenant binding, metrics, logs, review queue, rollback owner
Claim boundary Which claims are supported by this pilot and which require more evidence

8. Continue By Role

Role Next page
Application developer Quickstart
RAG engineer KB Ingestion
Evaluation engineer BatchProcessor API
Platform operator Production Guide
Compliance reviewer Compliance Reporting
Commercial evaluator Product Overview

9. Business Evidence Package

For commercial pilots, pair technical checks with a short evidence packet:

  • one-page use-case statement (input, output, and consequence),
  • fixed test split and acceptance criteria,
  • false-positive and false-negative examples from production-like samples,
  • incident and rollback flow,
  • review owner and escalation path,
  • and a short post-pilot report with expected reduction in factual-risk incidents.

Publishing this package early is often the fastest way to get procurement and security teams aligned.