Evaluation Onboarding¶
This guide gives evaluators, pilots, and new contributors a concrete path from first install to evidence-backed deployment. Use it as the shared checklist for a first Director-AI trial.
Pilot Charter¶
Write this charter before tuning thresholds:
| Field | Question to answer |
|---|---|
| Workflow | Which exact model output will Director-AI inspect? |
| Consequence | What happens if this output is wrong? |
| Evidence source | Which governed facts, documents, or labelled examples define correctness? |
| Control action | Should the guard raise, annotate, halt, reject, or route to review? |
| Acceptance metric | Which false-positive, false-negative, latency, and review-cost limits matter? |
| Owner | Who approves threshold changes and production rollout? |
The first pilot is successful when this charter has measured evidence, not when every optional integration has been enabled.
1. Choose The Pilot Shape¶
Pick one primary workflow before changing thresholds or adding integrations.
| Pilot shape | Best first artefact | Success signal |
|---|---|---|
| SDK wrapper | guard() around one existing client |
Unsupported claims are blocked or annotated |
| RAG assistant | Vector-backed facts plus batch checks | Rejections cite the right source chunks |
| Streaming app | StreamingKernel or proxy stream guard |
Bad partial output halts before completion |
| Internal review queue | Human review plus calibration store | Review decisions improve threshold selection |
| Enterprise gateway | REST/gRPC service behind auth and rate limits | Tenant-safe audit events are produced |
Keep the first pilot narrow. One workflow with good evidence is more useful than many integrations with unclear acceptance criteria.
2. Install The Right Extras¶
| Need | Install |
|---|---|
| In-process scoring and SDK wrapping | pip install director-ai |
| NLI-backed factual scoring | pip install "director-ai[nli]" |
| RAG and vector stores | pip install "director-ai[nli,vector]" |
| REST service | pip install "director-ai[nli,server]" |
| Full local evaluation stack | pip install -e ".[dev,nli,vector,server]" |
Start without optional runtimes unless the pilot requires them. Rust, ONNX, TensorRT, Go, Julia, Lean, gRPC, and WASM paths are deployment accelerators or specialised proof surfaces, not required for the first evaluation.
3. Run A First Check¶
from director_ai import score
result = score(
"What is the refund policy?",
"Refunds are available for 90 days.",
facts={"policy": "Refunds are available for 30 days only."},
threshold=0.3,
)
print(result.approved, result.score, result.evidence)
Expected outcome: the answer should be rejected or assigned a low coherence score because it contradicts the governed fact.
4. Add Domain Evidence¶
For a real pilot, replace toy facts with governed data:
- product policy extracts;
- medical, legal, finance, or scientific reference snippets;
- customer-support macros;
- approved retrieval chunks from an internal knowledge base;
- labelled examples of acceptable and unacceptable answers.
Every benchmark or threshold decision should record:
- dataset source and ownership;
- split policy;
- scorer configuration;
- threshold;
- false-positive and false-negative examples;
- latency and hardware context;
- review owner and approval date.
5. Pick A Control Action¶
Director-AI can fail closed or fail soft. Choose based on consequence:
| Failure mode | Use when |
|---|---|
raise |
A wrong answer must not leave the application boundary |
metadata |
The application decides how to display or route the risk |
log |
Early observation without user-visible enforcement |
| HTTP reject | Proxy or middleware owns the response boundary |
| Human review | Low-confidence outputs need operator approval |
| Streaming halt | Partial output must stop before completion |
6. Validate The Pilot¶
Minimum evidence before a production pilot:
- quickstart or API smoke passes in the target environment;
- a labelled domain sample has been scored;
- at least one known bad answer is rejected;
- at least one known good answer is approved;
- logs contain no secrets;
- evidence records are understandable to a domain reviewer;
- operational ownership is assigned for threshold changes and false-positive review.
7. Evidence Packet Template¶
Use this packet for internal review, procurement, or a customer pilot closeout:
| Section | Contents |
|---|---|
| Use case | Workflow, users, output boundary, and consequence of a wrong answer |
| Dataset | Source, ownership, split policy, examples excluded from tuning |
| Configuration | Package version, scorer mode, threshold, runtime extras, KB version |
| Results | Good-answer approvals, bad-answer rejections, false positives, false negatives |
| Operations | Auth, tenant binding, metrics, logs, review queue, rollback owner |
| Claim boundary | Which claims are supported by this pilot and which require more evidence |
8. Continue By Role¶
| Role | Next page |
|---|---|
| Application developer | Quickstart |
| RAG engineer | KB Ingestion |
| Evaluation engineer | BatchProcessor API |
| Platform operator | Production Guide |
| Compliance reviewer | Compliance Reporting |
| Commercial evaluator | Product Overview |
9. Business Evidence Package¶
For commercial pilots, pair technical checks with a short evidence packet:
- one-page use-case statement (input, output, and consequence),
- fixed test split and acceptance criteria,
- false-positive and false-negative examples from production-like samples,
- incident and rollback flow,
- review owner and escalation path,
- and a short post-pilot report with expected reduction in factual-risk incidents.
Publishing this package early is often the fastest way to get procurement and security teams aligned.