Chain-of-thought arithmetic verification¶
Status: core (Apache-2.0), deterministic, on by default in reasoning checks. Every recognised equation is evaluated and compared to its asserted result — no model, no heuristic, no tolerance games beyond floating-point rounding.
A model can reason fluently and still get the sums wrong. Where
numeric verification checks numeric plausibility (percentage
changes, date logic, magnitudes), this checks whether the explicit arithmetic in
a chain-of-thought answer actually holds: 3 + 4 = 8 is flagged, 12 × 5 = 60
passes.
How it works¶
- Equations are extracted with a regex —
left = right,left equals right,left is equal to right— where the left side must carry at least one operator, so a plainx is 5is not mistaken for arithmetic. - The left side is evaluated and compared to the asserted right side within a
floating-point tolerance (
math.isclose). - Thousands separators (
1,000), currency ($3,000), parentheses, unary minus, and the Unicode operators× ÷ ·are all understood; scientific notation and anything outside the arithmetic character set make an expression unevaluable (and so simply uncounted, never falsely flagged).
It is wired into the reasoning-chain verifier, so step-logic and arithmetic are checked together:
from director_ai.core import verify_arithmetic, verify_reasoning_chain
verify_arithmetic("Step 2: 3 + 4 = 8.").errors # -> [3 + 4 = 8]
result = verify_reasoning_chain("Step 1: ... Step 2: 3 + 4 = 8. ...")
result.math_errors # arithmetic mistakes
result.chain_valid # False — a wrong sum is a chain issue
Disable the arithmetic pass with verify_reasoning_chain(text, check_math=False).
Polyglot backend¶
The evaluator runs through the Rust rust_eval_arithmetic kernel
(backfire_core::compute::eval_arithmetic) with a pure-Python ast fallback.
The two are bit-for-bit identical: the same IEEE-754 doubles, the same
precedence and left-associativity, the same character set, and any non-finite
result (division by zero, overflow) reported as NaN on both paths. The dispatch
is therefore purely a speed choice. See Rust Acceleration.
Measured¶
python -m benchmarks.math_consistency:
| Metric | Value |
|---|---|
| Detection accuracy (labelled CoT snippets, n=12) | 1.00 |
| Error precision / recall | 1.00 / 1.00 |
| Rust ↔ Python parity | exact (NaN included) |
| Rust speed-up over Python | ~16× |
Detection is exact because the check is deterministic arithmetic, not a learned
judgement — the only failure mode is an equation the extractor does not
recognise, which is counted as "not checked", never as "correct". Numbers come
from the committed benchmark and benchmarks/results/math_consistency.json.