Skip to content

CI Quality Gate

Run the Director-AI guardrail as a CI quality gate: score a labelled eval set on every pull request and fail the build when guard quality regresses. Same idea as gating a test suite — except the thing under test is your LLM app's factual behaviour.

The dataset

A JSONL file, one case per line — a prompt, the response to judge, and the label a correct guard should produce:

{"prompt": "What is the capital of France?", "response": "Paris is the capital of France.", "expected": "approve"}
{"prompt": "What is the capital of France?", "response": "The capital of France is Berlin.", "expected": "reject"}
  • expected: "approve" — a grounded answer the guard should let through.
  • expected: "reject" — a hallucination the guard should catch.
  • id is optional (defaults to the line number) and surfaces in the report.

The command

director-ai ci-gate --dataset cases.jsonl --min-accuracy 0.9

It scores every case with your configured scorer, compares the approve/reject decision to the label, prints a summary, and exits non-zero when a threshold is breached — so CI blocks the merge.

Flag Meaning
--dataset PATH JSONL cases file (required)
--min-accuracy R Minimum overall accuracy, 0–1 (default 0.9)
--min-catch-rate R Optional: minimum hallucination catch rate on reject cases
--max-false-halt R Optional: maximum false-halt rate on approve cases
--profile P Optional config profile (e.g. medical, finance)
--output PATH Optional: write the JSON report for a CI artefact

Exit codes: 0 pass, 1 threshold breached, 2 usage/data error.

Catch hallucinations, not just accuracy

--min-accuracy alone can be gamed by a guard that approves everything on a mostly-grounded set. Add --min-catch-rate to hold the guard's recall on the hallucination (reject) cases, and --max-false-halt to keep it from over-blocking grounded answers.

The GitHub Action

The repository ships a composite action, so a workflow is a few lines:

name: guardrail
on: [pull_request]

jobs:
  guardrail-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: anulum/director-ai@v1
        with:
          dataset: tests/guardrail_cases.jsonl
          min-accuracy: "0.9"
          min-catch-rate: "0.85"
          max-false-halt: "0.1"

Action inputs mirror the CLI flags, plus extras (the pip extra to install — defaults to nli so real NLI scoring is available), version (a director-ai version spec), python-version, and output.

Heuristic vs model-backed

Without the [nli] extra and a knowledge base, scoring falls back to heuristics and will miss most hallucinations. The action installs [nli] by default; for grounded checks, ingest your facts first (see KB ingestion).

The report

With --output, the gate writes a JSON report — counts, the metrics, the breached thresholds, and per-case outcomes — suitable for upload as a CI artefact or for trend tracking:

{
  "total": 2, "correct": 2, "accuracy": 1.0,
  "catch_rate": 1.0, "false_halt_rate": 0.0,
  "passed": true, "failures": [],
  "outcomes": [{"case_id": "1", "expected": "approve", "predicted": "approve", "score": 0.98, "correct": true}]
}