Skip to content

FrontierFail Seed Packet

The FrontierFail packet is an intake and regression seed for production-failure benchmarking. It is not an externally validated benchmark and must not be used as a public production-failure score.

Files

File Purpose
benchmarks/frontierfail_seed_packet.toml Packet metadata, claim boundary, required categories
benchmarks/frontierfail_cases.jsonl Seed regression cases
tools/validate_frontierfail_packet.py Schema, provenance, and benchmark-eligibility gate

Current Boundary

The seed cases use source_type = "synthetic_regression" and benchmark_eligible = false. They exist to keep the taxonomy, row schema, and downstream evaluation hooks stable while real sourced production cases are collected and redacted.

Rows may become benchmark eligible only when they are sourced from a sanitized production report or public incident with reviewable evidence. The validator rejects synthetic rows marked as benchmark eligible.

Required Categories

  • numeric_contradiction
  • fabricated_policy
  • unsupported_citation
  • cross_turn_contradiction
  • retrieval_misattribution

Validation

uv run --frozen python tools/validate_frontierfail_packet.py .

The gate fails when:

  • the packet claims public benchmark eligibility;
  • required categories have zero cases;
  • synthetic seed rows are marked benchmark eligible;
  • sourced benchmark-eligible rows lack reviewable evidence;
  • required row fields are missing or empty;
  • expected decisions fall outside the supported decision set.

This keeps FrontierFail useful for engineering regression work without blurring the line between seed fixtures and independently validated production-failure benchmarks.