SC-NeuroCore Benchmarks¶
Performance measurements for SC-NeuroCore. Historical core engine rows come from v3.13.3, FPGA synthesis additions from v3.14.0, and documentation/release evidence is current for v3.15.8. All Python numbers are CPU-only unless a GPU environment is named. Rust numbers use Criterion and must be read with their recorded SIMD/environment context.
Evidence boundary¶
This page is a curated benchmark report, not an authority for untracked local measurements or marketing estimates. Every quoted performance, power, utilisation, timing, or
hardware-efficiency number must remain traceable to a committed raw artefact:
JSON or CSV under benchmarks/results/, a named tool report under hdl/ or
build/, or a companion paper artefact with command and environment
provenance.
When a newer local run exists but its raw output is not committed, cite it only as local exploratory evidence. Do not promote it to README, roadmap, release, or paper claims until the raw artefact and environment record are committed.
The 2026-06-04 local Python/Rust precision benchmark artefacts were captured on a workstation under concurrent load and without an exclusive isolated CPU core set. Treat those medians as contract/regression context, not final throughput claims. Any production performance claim must be rerun on reserved isolated cores, with CPU affinity, host-load, governor, and frequency evidence recorded in the raw artefact.
The 2026-06-05 refreshed mixed-dense, block-floating, and precision-envelope
artefacts were executed with process affinity pinned to CPUs 8-9. The raw
JSON records host load, affinity, CPU governor, and sampled frequency context,
but the workstation still did not expose kernel-reserved isolated cores to this
user session. Treat the refreshed medians as local regression evidence and the
proof fields as the authoritative contract evidence.
The 2026-06-05 live-control AXI4-Lite/PCIe-MMIO rerun was executed with
process affinity pinned to CPUs 8-9, and the raw artefact records that the
process affinity matched the requested benchmark cpuset. The workstation did
not expose kernel-reserved isolated cores to this user session, so these numbers
remain local regression evidence rather than production throughput claims.
The module-level pytest throughput checks use load-tolerant smoke thresholds
unless SC_NEUROCORE_STRICT_THROUGHPUT=1 is set. Use strict mode only for
isolated-core benchmark captures and commit the raw artefact before treating the
numbers as release evidence.
The 2026-06-04 AER priority queue benchmark follows the same boundary. It was
run under a temporary runtime cpuset shield: system/user slices were moved off
the benchmark cores, the benchmark ran in its own benchmark.slice, and the raw
artefact records the process affinity plus cgroup effective CPU set. This is
stronger than taskset pinning but still distinct from boot-time
isolcpus/nohz_full kernel isolation. The artefact is
Python/SystemVerilog contract evidence for strict-priority ordering, FIFO ties,
backpressure drops, critical-deadline traps, and Yosys RTL elaboration; it is
not an FPGA throughput or latency measurement.
v3.14.0 additions: SHD FPGA synthesis on Zynq XC7Z020 — 1 317 LUT (2.5%), 848 FF (0.8%), WNS +4.048 ns at 100 MHz. See
hdl/reports/vivado_util_xc7z020_100mhz.rptfor full Vivado report.
1. Environment¶
| Field | Value |
|---|---|
| Date | 2026-03-15 |
| Git tag | v3.13.3 |
| OS | Windows 11 Pro 10.0.26200 |
| CPU | Intel Core i5-11600K (6C/12T, 3.9 GHz base, AVX-512, DL Boost) |
| RAM | 32 GB DDR4-3200 |
| Python | 3.12.5 |
| NumPy | 1.26.4 |
| Rust | 1.86.0 (stable) |
| SIMD tier | avx512-vpopcntdq |
| GPU (NVIDIA) | GeForce GTX 1060 6GB — Pascal sm_61, PyTorch 2.6.0+cu124 |
| GPU (AMD) | Radeon RX 6600 XT — no ROCm on Windows, torch-directml incompatible |
| CuPy | unavailable — CUDA Toolkit 13.1 dropped Pascal (sm_61) support |
2. Scalar Primitives (Python)¶
| Operation | Iterations | Latency (µs) | Throughput |
|---|---|---|---|
| LFSR step (16-bit) | 1,000,000 | 0.8 | 1.33 Mstep/s |
| Bitstream encoder step | 1,000,000 | 0.9 | 1.10 Mstep/s |
| LIF neuron step (Q8.8) | 1,000,000 | 0.9 | 1.07 Mstep/s |
3. Packed Bitstream Operations (Python/NumPy)¶
| Operation | Size | Iterations | Latency (µs) | Throughput |
|---|---|---|---|---|
| pack_bitstream 1-D | 1,024 | 10,000 | 8.7 | 0.12 Gbit/s |
| pack_bitstream 1-D | 65,536 | 2,000 | 123.1 | 0.53 Gbit/s |
| pack_bitstream 2-D | 64×1,024 | 2,000 | 121.6 | 0.54 Gbit/s |
| vec_and | 1,024 words | 50,000 | 1.6 | 41.0 Gbit/s |
| vec_popcount SWAR | 1,024 words | 50,000 | 30.2 | 2.17 Gbit/s |
4. Dense Layer Forward Pass (Python)¶
| Configuration | Iterations | Latency (µs) | Throughput |
|---|---|---|---|
| 16×8, L=256 | 500 | 352.7 | 0.09 GOP/s (SC) |
| 64×32, L=1,024 | 100 | 2,405.8 | 0.87 GOP/s (SC) |
AER Priority Queue Backpressure Contract (2026-06-04)¶
This benchmark covers the NEU-C.4 event-control contract: AER fanout packets with lower numeric priority must overtake best-effort packets while preserving FIFO order within equal priority classes. The committed artefact also records finite-capacity backpressure, sticky drop traps, sticky critical-deadline traps, CPU affinity, cgroup effective CPU set, host load, CPU governor, and Yosys elaboration time.
| Path | Workload | Result | Raw evidence |
|---|---|---|---|
| Python reference model | 4,096 deterministic events x 100 repeats | 4.138 us/event under runtime cpuset shield 10-11; priority_violations=0, fifo_tie_violations=0 |
benchmarks/results/local_python_2026-06-04_aer_priority_queue.json |
SystemVerilog sc_aer_priority_queue |
Yosys synthesis/elaboration | yosys.exit_code=0, 6,364 cells in the artefact |
benchmarks/results/local_python_2026-06-04_aer_priority_queue.json |
No Rust, Julia, Go, or Mojo counterpart exists for this HDL-only queue surface as of 2026-06-04. Cross-language comparison therefore means Python reference contract versus SystemVerilog RTL elaboration/simulation for this task.
Live-control AXI4-Lite / PCIe-MMIO Register Window (2026-06-05)¶
This benchmark covers the live-parameter update contract for hot-swappable weights and Kuramoto coupling parameters. Both protocols use the same CRC32-gated shadow-bank core: AXI4-Lite exposes the core directly, while the PCIe path emits a PCIe-MMIO register-window adapter that expects upstream PCIe hard IP to present decoded single-clock MMIO strobes.
| Path | Workload | Result | Raw evidence |
|---|---|---|---|
| Python update-sequence builder | 20,000 deterministic staged writes x 7 repeats | AXI4-Lite median 14.139 us/sequence; PCIe-MMIO median 14.216 us/sequence under process affinity 8-9 |
benchmarks/results/local_python_2026-06-04_live_control_updates.json |
| SystemVerilog AXI4-Lite core | Generated trap-capture simulation | trap_capture.passed=true; staged overflow and underflow traps latched without mutating active coefficients |
benchmarks/results/local_python_2026-06-04_live_control_updates.json |
| SystemVerilog PCIe-MMIO wrapper | Generated commit simulation | pcie_mmio_commit_capture.passed=true; partial write strobes raise sticky partial_write, stale CRC32 guard raises sticky checksum_mismatch, invalid bank selection and invalid active readback raise sticky invalid_selection, read-only bank writes raise sticky read_only_bank, and retargeting selection registers after shadow load cannot redirect the committed bank |
benchmarks/results/local_python_2026-06-04_live_control_updates.json |
No Rust, Julia, Go, or Mojo counterpart exists for this HDL bus-adapter surface as of 2026-06-04. Cross-language comparison therefore means Python control contract generation versus SystemVerilog RTL simulation for AXI4-Lite and PCIe-MMIO.
Mixed Q8.8/Q16.16 Dense Contract (2026-06-04)¶
This benchmark covers the deterministic dense mixed-precision contract: stored
Q8.8 weights, Q16.16 inputs and outputs, signed arithmetic product scaling, and
explicit saturation/overflow handling. The Python path is the deployment
reference and manifest writer; the Rust path is the low-latency integer mirror.
The Python path is run with QFormatMixed(scale_per_tensor=False) for this
benchmark so the Python, Rust, and HDL surfaces share the same raw Q8.8/Q16.16
arithmetic contract instead of Python-only per-tensor rescaling.
| Path | Workload | Median | Raw evidence |
|---|---|---|---|
Python CompiledMixedDense.forward_accumulator_codes |
64×32 dense, 2,000 calls × 7 repeats | 51.934 µs/call | benchmarks/results/local_python_2026-06-04_mixed_dense.json |
Python CompiledMixedDense.forward_with_overflow |
Same deterministic matrix/vector | 51.104 µs/call | benchmarks/results/local_python_2026-06-04_mixed_dense.json |
| NumPy float64 dot baseline | Same deterministic matrix/vector | 1.656 µs/call | benchmarks/results/local_python_2026-06-04_mixed_dense.json |
Rust mixed_dense_q88_q1616 |
64×32 dense, 20,000 calls × 7 repeats | 2.659 µs/call | benchmarks/results/local_rust_2026-06-04_mixed_dense.json |
HDL sc_mixed_precision_dense Yosys RTLIL stat |
Default 64×32 parameters | 12,708 cells, 2,048 multipliers | hdl/reports/yosys_mixed_precision_dense_2026-06-04.json |
The Python mixed path reconstructed the float64 dot product with maximum
absolute error 0.0 on the committed deterministic workload.
The Python and Rust artefacts both recorded safe-workload overflow count 0
and saturating-probe overflow count 32, matching the lane-level HDL
overflow_vector contract. The same artefacts now record conservative
precision-envelope telemetry: Python and Rust safe max absolute bound 531400,
and saturating-probe max absolute bound
17454214414336; the HDL exports the matching per-output abs_bounds_q1616
vector. The refreshed Python and Rust artefacts also prove the signed
symmetric fixed-point width contract: the safe workload requires 21 signed
total bits, 5 Q16.16 integer bits, and has 11 bits of headroom; the
saturating probe requires 45 signed total bits and 29 Q16.16 integer bits,
has -13 bits of headroom, and records saturation_required=true.
Block-Floating Dense Contract (2026-06-04)¶
This benchmark covers dense BFP16E3X32 weights with Q16.16 inputs and
saturated Q16.16 outputs. The shared exponent path preserves larger dynamic
range per block than canonical Q8.8 weights, at the cost of dynamic shifts in
the HDL datapath.
| Path | Workload | Median | Raw evidence |
|---|---|---|---|
Python CompiledBlockFloatingDense.forward_accumulator_codes |
64×32 dense, 2,000 calls × 7 repeats | 38.760 µs/call | benchmarks/results/local_python_2026-06-04_block_floating_dense.json |
Python CompiledBlockFloatingDense.forward_with_overflow |
Same deterministic matrix/vector | 42.356 µs/call | benchmarks/results/local_python_2026-06-04_block_floating_dense.json |
| NumPy float64 dot baseline | Same deterministic matrix/vector | 1.029 µs/call | benchmarks/results/local_python_2026-06-04_block_floating_dense.json |
Rust block_floating_dense_q16 |
64×32 dense, 20,000 calls × 7 repeats | 11.471 µs/call | benchmarks/results/local_rust_2026-06-04_block_floating_dense.json |
HDL sc_block_floating_dense Yosys RTLIL stat |
Parameterised 2×2, BLOCK_SIZE=2 elaboration copy |
96 cells, 4 multipliers | hdl/reports/yosys_block_floating_dense_2026-06-04.json |
The deterministic block-floating workload recorded maximum absolute error
0.22306060791015625 versus float64 dot. This reflects BFP16E3 block-scale
quantisation on the committed synthetic workload, not runtime nondeterminism.
The Python and Rust artefacts both recorded safe-workload overflow count 0
and saturating-probe overflow count 32, matching the lane-level HDL
overflow_vector contract. Both languages now compare the same deterministic
BFP contract: mantissa checksum -15, exponent checksum 0, exponent code
range [0, 0], safe max absolute bound 610816, and saturating-probe max
absolute bound 1125865547104256. The refreshed proof fields record safe
width 21 signed total bits, 5 Q16.16 integer bits, and 11 bits of
headroom; the saturating probe records 51 signed total bits, 35 Q16.16
integer bits, -19 bits of headroom, and saturation_required=true. The
64×32 payload records
parameter_count=2048 and block_exponent_count=64 in both the Python
manifest and Rust artefact. The HDL exports per-output
abs_bounds_q1616; the
full-size 64×32 block-floating Yosys frontend path is documented as toolchain
debt because Yosys 0.33 elaborates the default procedural loops during
read_verilog before chparam can reduce the dimensions.
The same Python and Rust artefacts now include a seeded BFP16E3X2 exponent
edge sweep. Both languages agree on safe exponent codes [0, 7, 0, 7], safe
Q16.16 output codes [1056736, -1069024], safe overflow and underflow counts
0, conservative safe bound 1069024, and headroom 2146414623. The
max-exponent saturation probe records exponent code [7], saturated output
code [2147483647], overflow count 1, underflow count 0, and conservative
bound 2251662376828928. The safe edge sweep requires 22 signed total
bits and 6 Q16.16 integer bits with 10 bits of headroom; the max-exponent
saturation probe requires 52 signed total bits and 36 Q16.16 integer bits
with -20 bits of headroom, proving that max shared-exponent payloads trap
rather than silently wrapping.
Precision Trap Reports (2026-06-04)¶
This benchmark covers the saturation telemetry path for mixed Q8.8/Q16.16 and block-floating dense outputs. The workload intentionally saturates all 32 output channels so the trap report has to preserve an exact overflow count rather than only a collapsed Boolean.
| Path | Workload | Median | Raw evidence |
|---|---|---|---|
Python mixed precision_trap_report |
64×32 dense, 2,000 calls × 7 repeats | 44.744 µs/call | benchmarks/results/local_python_2026-06-04_precision_traps.json |
Python BFP precision_trap_report |
64×32 dense, 2,000 calls × 7 repeats | 45.906 µs/call | benchmarks/results/local_python_2026-06-04_precision_traps.json |
Rust mixed PrecisionTrapReport |
64×32 dense, 20,000 calls × 7 repeats | 2.506 µs/call | benchmarks/results/local_rust_2026-06-04_precision_traps.json |
Rust BFP PrecisionTrapReport |
64×32 dense, 20,000 calls × 7 repeats | 8.777 µs/call | benchmarks/results/local_rust_2026-06-04_precision_traps.json |
HDL sc_precision_overflow_trap Yosys stat |
Default TRAP_WIDTH=1 |
3 cells, 8 wire bits | hdl/reports/yosys_precision_overflow_trap_2026-06-04.json |
The committed Python and Rust trap workloads both report
mixed_overflow_count=32 and bfp_overflow_count=32, matching the number of
output channels. The HDL trap primitive synthesises to one $adff, one
$mux, and one $or cell at the default width.
The 2026-06-05 rerun also records matched sub-LSB underflow probes:
mixed_underflow_count=32 and bfp_underflow_count=32 in both Python and
Rust, while the saturating overflow workloads retain underflow_count=0.
Precision Envelope Reports (2026-06-04)¶
This benchmark covers the conservative predeployment envelope path for mixed
Q8.8/Q16.16 and block-floating dense outputs. The Python and Rust workloads
use matched fixed-point codes and report the same maximum absolute bounds:
132850 for the mixed dense workload and 78032768 for the block-floating
workload.
| Path | Workload | Median | Raw evidence |
|---|---|---|---|
Python mixed precision_envelope_report |
64×32 dense, 2,000 calls × 7 repeats | 87.578 µs/call | benchmarks/results/local_python_2026-06-04_precision_envelopes.json |
Python BFP precision_envelope_report |
64×32 dense, 2,000 calls × 7 repeats | 90.475 µs/call | benchmarks/results/local_python_2026-06-04_precision_envelopes.json |
Rust mixed PrecisionEnvelopeReport |
64×32 dense, 20,000 calls × 7 repeats | 2.991 µs/call | benchmarks/results/local_rust_2026-06-04_precision_envelopes.json |
Rust BFP PrecisionEnvelopeReport |
64×32 dense, 20,000 calls × 7 repeats | 9.874 µs/call | benchmarks/results/local_rust_2026-06-04_precision_envelopes.json |
HDL sc_precision_envelope_guard Yosys stat |
Default N_OUTPUTS=32 |
67 cells, 1,701 wire bits | hdl/reports/yosys_precision_envelope_guard_2026-06-04.json |
Both Python and Rust envelope reports returned
conservative_overflow_free=true for the committed safe workload. The HDL
guard synthesises to two $adff, thirty-two $gt, thirty-two $mux, and one
$reduce_or cell at the default width.
The raw artefacts now include observed_underflow_free=true for the safe
workload and matched underflow probes with underflow_count=32 for mixed and
BFP dense paths in both Python and Rust. The refreshed manifests expose
proof_kind=signed_symmetric_fixed_point_width: mixed dense requires 19
signed total bits, 3 Q16.16 integer bits, and has 13 bits of headroom,
while block-floating dense requires 28 signed total bits, 12 Q16.16
integer bits, and has 4 bits of headroom.
5. Full Pipeline (Python)¶
encode → AND synapse → popcount → LIF neuron
| Configuration | Iterations | Latency (µs) | Throughput |
|---|---|---|---|
| 4 synapses, 256 steps | 200 | 1,830 | 139.9 Kstep/s |
| 16 synapses, 256 steps | 50 | 8,679 | 29.5 Kstep/s |
6. GPU Backend¶
6a. Local CPU fallback (NumPy, no CuPy)¶
| Operation | Iterations | Latency (µs) | Throughput |
|---|---|---|---|
| gpu_pack_bitstream (65,536) | 2,000 | 375.9 | 0.17 Gbit/s |
| gpu_vec_mac (64×32×16w) | 1,000 | 736.4 | 2.85 GOP/s |
6b. Cloud GPU — NVIDIA RTX A6000 (48 GB, CUDA 12.6)¶
Environment: JarvisLabs A6000, Xeon Silver 4216 (64 vCPU), PyTorch 2.6.0+cu124. 1000 ms simulation, 3 runs, AI regime (conn_prob=0.1).
| Neurons | Synapses | Wall (s) | Rate (Hz) | Syn events/s | Peak RSS |
|---|---|---|---|---|---|
| 1,000 | 100K | 1.55 | 99.0 | 3.2 M | 12 MB |
| 2,000 | 400K | 1.80 | 85.5 | 9.5 M | 24 MB |
| 5,000 | 2.5M | 2.74 | 63.6 | 29.0 M | 104 MB |
| 20,000 | 40M | 8.80 | 26.1 | 59.2 M | 775 MB |
| 50,000 | 250M | 35.4 | 14.7 | 51.9 M | 4,793 MB |
Source: benchmarks/results/jarvislabs_a6000/gpu_large_scale.json,
benchmarks/results/jarvislabs_a6000/scaling_4regime.json.
7. Rust Engine — Criterion Results¶
All benchmarks run with cargo bench --manifest-path engine/Cargo.toml on
AVX-512 hardware. Times are Criterion medians.
Updated 2026-04-05 with UpCloud EPYC 9575F measurements. Previous values (25.4 µs, 446 µs) were from an unidentified earlier run.
Bitstream Packing (1M bits = 1,048,576 bits)¶
| Variant | Time | Throughput | vs. Python |
|---|---|---|---|
pack (scalar) |
897 µs | 1.17 Gbit/s | 2.2× |
pack_fast (u64 chunks) |
286 µs | 3.67 Gbit/s | 7× |
pack_dispatch (AVX-512) |
8.85 µs | 113 Gbit/s | 46.6× |
Popcount (16,384 u64 words = 1M bits)¶
| Variant | Time | Throughput | vs. Python |
|---|---|---|---|
popcount_portable |
12.1 µs | 86.6 Mword/s | 2.5× |
popcount_simd (AVX-512) |
2.86 µs | 366 Mword/s | 10.6× |
Fused AND+Popcount (16 words)¶
| Variant | Time |
|---|---|
| Scalar (iter + count_ones) | 19.1 ns |
| SIMD dispatch (AVX-512) | 9.58 ns |
Encoder / Neuron¶
| Operation | Time | Throughput |
|---|---|---|
| LFSR encoder (64K steps) | 131 µs | 500 Mstep/s |
| LIF neuron (10K steps) | 47.9 µs | 209 Mstep/s |
| LIF neuron (100K steps) | 219 µs | 456 Mstep/s |
Bernoulli Encoding (1,024-bit packed streams)¶
| Variant | Time | Notes |
|---|---|---|
bernoulli_stream (unrolled) |
3.99 µs | generate bits then pack |
bernoulli_stream + pack |
4.79 µs | two-pass |
bernoulli_packed (ChaCha8) |
4.14 µs | direct packed generation |
bernoulli_packed_fast (ChaCha8) |
1.72 µs | optimized threshold loop |
bernoulli_packed_simd (ChaCha8) |
779 ns | SIMD comparison |
bernoulli_packed_simd (Xoshiro) |
398 ns | fastest: SIMD + fast PRNG |
encode_and_popcount (Xoshiro) |
285 ns | fused encode+AND+popcount |
Dense Layer (64 inputs → 32 neurons, L=1024)¶
| Variant | Time | vs. Python |
|---|---|---|
forward (baseline) |
1.22 ms | 2.0× |
forward_fast (packed) |
337 µs | 7.1× |
forward_fused (encode+AND+pop) |
1.67 ms | 1.4× |
forward_prepacked (pre-encoded) |
54.9 µs | 43.8× |
forward_batch (100 samples) |
13.7 ms | 17.6× per sample |
PRNG Fill (1,024 bytes)¶
| Generator | Time | Throughput |
|---|---|---|
| ChaCha8 | 320 ns | 3.13 GB/s |
| Xoshiro256++ | 191 ns | 5.24 GB/s |
Domain-Specific¶
| Operation | Time |
|---|---|
| Kuramoto solver (100 osc, 1000 steps) | 199 ms |
| Stochastic attention (10×16 → 20×32) | 138 µs |
| Graph layer (20 nodes, 8 features) | 253 µs |
8. v2 (Python) vs v3 (Rust) Speedup¶
SIMD tier: avx512-vpopcntdq. The v3 engine wraps Rust via PyO3.
| Operation | v2 (ms) | v3 (ms) | Speedup |
|---|---|---|---|
| pack_bitstream (1M bits) | 8.26 | 52.66 | 0.2× |
| popcount (1M bits) | 0.12 | 0.44 | 0.3× |
| LIF neuron (10K steps) | 8.58 | 4.48 | 1.9× |
| Dense forward (16→8, L=1024) | 0.49 | 0.18 | 2.7× |
| Dense forward (64→32, L=1024) | 4.48 | 4.05 | 1.1× |
| Dense forward (128→64, L=1024) | 8.02 | 1.10 | 7.3× |
| Attention (10×16 → 20×32) | 0.03 | 0.28 | 0.1× |
| Attention (50×32 → 100×64) | 0.10 | 2.19 | 0.0× |
Geometric mean speedup: 0.5× — across all operations, the Rust FFI path is slower than pure Python on average. PyO3 call overhead (argument marshalling, GIL release/acquire) adds ~50–200 µs per invocation, which dominates when the payload is small.
The Rust engine amortises FFI cost above ~64K bits per call. On payloads >1M bits the SIMD kernel wins decisively (Dense 128→64 at 7.3×). For small networks (<64 neurons), pure Python is faster. The Rust engine targets large-payload inference (>=128 neurons, L>=1024).
The pure-Rust Criterion numbers in Section 7 show true engine throughput without FFI overhead.
9. NeuroBench-Aligned Metrics¶
Aligned with the NeuroBench methodology (Yik et al., 2023; arXiv:2304.04640).
| Model | Neurons | SynOps | Act. Sparsity | Latency (µs) | Throughput (MOP/s) | Memory (B) |
|---|---|---|---|---|---|---|
| SCDenseLayer(8×4, L=256) | 4 | 409,600 | 0.00 | 1,293 | 6.3 | 256 |
| SCDenseLayer(16×8, L=512) | 8 | 1,966,080 | 0.00 | 2,446 | 26.8 | 1,024 |
| VectorizedSCLayer(16×8, L=512) | 8 | 3,276,800 | 0.00 | 348 | 188.1 | 1,024 |
| VectorizedSCLayer(64×32, L=1024) | 32 | 41,943,040 | 0.00 | 2,476 | 847.0 | 16,384 |
Activation sparsity is 0.00 because SC outputs are graded probabilities, not binary spikes — every neuron produces a non-zero output on every step.
10. SNN Comparison: Brunel Balanced Network¶
4-Variant Translator Benchmark¶
1000 neurons (800E/200I), conn_prob=0.1, adapted params (weight_exc=5.0 mV,
external_rate=200 Hz), 1000 ms simulation. Delta-PSC semantics: synaptic
events applied as instantaneous voltage jumps (v += w), matching Brian2's
on_pre="v_post += w".
| Variant | Spikes | Rate (Hz) | Brian2 Ratio | Wall (s) |
|---|---|---|---|---|
| Brian2 reference | 1,057,908 | 1057.9 | 1.00 | 1.11 |
| V1 StochasticLIF | 1,725,955 | 1726.0 | 1.63 | 30.23 |
| V2 RateMatched | N/A | 0.049 (prob) | — | 51.81 |
| V3 FixedPoint Q8.8 | 1,722,195 | 1722.2 | 1.63 | 15.41 |
| V4 Hybrid SC+LIF | 1,888,351 | 1888.4 | 1.78 | 46.23 |
Variant descriptions¶
- V1 StochasticLIF: Bug-fixed delta-PSC wiring. Previous benchmark passed
input through
R * I * dt(diluted by dt=0.1) and omittedv_reset. Fixed: synaptic events asneuron.v += weight, Poisson drive as voltage kicks,v_reset=10.0passed correctly. - V2 RateMatched: VectorizedSCLayer in probability domain. Weights mapped
to
p = w / v_threshold. 100-neuron subset, bitstream_length=1024. Not spike-comparable; mean output probability = 0.0488. - V3 FixedPoint Q8.8: Hardware-faithful FixedPointLIFNeuron. Params mapped to Q8.8 integers (scale=256). Rate 1.63x Brian2 (higher due to different noise model).
- V4 Hybrid SC+LIF: BitstreamSynapse AND gates → popcount → voltage → StochasticLIFNeuron. Higher rate due to stochastic amplification in the bitstream encoding.
Historical note (v3.9.0, resolved)¶
Prior to v3.10.0, three wiring bugs prevented the Brunel network
from firing. All three were fixed in v3.10.0:
1. v_reset never passed (defaulted to 0.0 instead of 10.0)
2. Delta-PSC diluted through R * I * dt instead of direct v += w
3. Poisson drive fed as steady current instead of voltage kicks
10b. 20-Variant Brunel Translator Suite¶
Adapted Brunel parameters (weight_exc=5.0, ext_rate=200 Hz), 1000 neurons, 1000 ms simulation. Brian2 2.10.1 reference: 748,777 spikes, 748.8 Hz.
| # | Variant | Spikes | Rate (Hz) | Brian2 Ratio | Wall (s) | Note |
|---|---|---|---|---|---|---|
| — | Brian2 reference | 748,777 | 748.8 | 1.00 | 1.60 | |
| V1 | StochasticLIF | 1,725,955 | 1726.0 | 2.31 | 49.33 | delta-PSC baseline |
| V2 | RateMatched | — | 0.0488 (prob) | — | 80.98 | probability domain |
| V3 | FixedPoint Q8.8 | 1,722,195 | 1722.2 | 2.30 | 20.79 | hardware-faithful |
| V4 | Hybrid SC+LIF | 1,571,994 | 1572.0 | 2.10 | 42.46 | bitstream synapse |
| V5 | Izhikevich | 15,331 | 15.3 | 0.02 | 11.49 | burst dynamics |
| V6 | Homeostatic LIF | 1,727,113 | 1727.1 | 2.31 | 39.36 | adaptive threshold |
| V7 | Noisy LIF | 1,714,361 | 1714.4 | 2.29 | 44.62 | noise_std=1.0 |
| V8 | Refractory LIF | 114,317 | 114.3 | 0.15 | 26.56 | 5-step refractory |
| V9 | Post-kick LIF | 1,671,636 | 1671.6 | 2.23 | 36.16 | Brian2 timing |
| V10 | Exact-leak LIF | 1,713,399 | 1713.4 | 2.29 | 32.78 | exp(-dt/tau) |
| V11 | Q16.12 FixedPoint | 464,644 | 464.6 | 0.62 | 18.19 | 32-bit, 12 frac |
| V12 | STDP LIF | 1,689,552 | 1689.6 | 2.26 | 758.21 | 2000 STDP synapses |
| V13 | DotProduct LIF | 497,647 | 9952.9 | — | 261.40 | n=50, bl=256 |
| V14 | Sobol bitstream | 780,390 | 780.4 | 1.04 | 220.46 | low-discrepancy |
| V15 | JAX vectorized | — | — | — | — | skipped: JAX not installed |
| V16 | Recurrent reservoir | — | 0.9997 (prob) | — | 16.21 | probability domain |
| V17 | Memristive defects | — | 48.7 | — | 51.62 | stuck=1%, var=5% |
| V18 | Numba JIT | 1,685,521 | 1685.5 | 2.25 | 5.20 | 9.5× vs V1 |
| V19 | PyTorch CUDA | 1,725,955 | 1726.0 | 2.31 | 5.70 | GTX 1060 6GB |
| V20 | Vectorized NumPy | 1,725,955 | 1726.0 | 2.31 | 10.27 | batch update |
| V21 | Sparse Numba (CSR) | 1,685,521 | 1685.5 | 2.25 | 0.49 | 10% connectivity |
Acceleration comparison (1000 neurons, 1000 ms)¶
| Backend | Wall (s) | Speedup vs V1 |
|---|---|---|
| V1 per-neuron Python | 49.33 | 1.0× |
| V20 vectorized NumPy | 10.27 | 4.8× |
| V18 Numba JIT | 5.20 | 9.5× |
| V19 PyTorch CUDA (GTX 1060) | 5.70 | 8.7× |
| V21 Sparse Numba (CSR) | 0.49 | 100.7× |
| Brian2 (Cython) | 1.60 | 30.8× |
10K neuron scaling¶
| Backend | Wall (s) | Memory |
|---|---|---|
| V18 Numba JIT (dense) | 15.3 | 800 MB (N²×8) |
| V21 Sparse Numba (CSR) | 22.6 | 80 MB (10% nnz) |
| Brian2 (C++ codegen) | 9.6 | sparse (internal) |
At 10K, Brian2's compiled C++ sparse codegen wins. V21 CSR reduces memory 10× but scattered index access prevents SIMD vectorization. The Rust SIMD CSR engine (planned) targets this gap.
Variant notes¶
- V5 Izhikevich: Low spike rate expected — Izhikevich dynamics (quadratic nonlinearity, v range -65 to +30) respond differently to delta-PSC drive. Tonic baseline current of 5.0 added for sub-threshold depolarization.
- V8 Refractory: 5-step (0.5 ms) dead time reduces max firing rate to ~2000 Hz, cutting observed rate by 15×.
- V11 Q16.12: Higher precision fixed-point produces fewer spikes than Q8.8 due to more accurate leak computation (less rounding-induced depolarization).
- V12 STDP: Online weight learning with 2000 STDP synapses. 15× slower due to per-synapse process_step() calls.
- V14 Sobol: Low-discrepancy bitstream achieves 1.04× Brian2 ratio — closest match to reference among all spiking variants.
- V18/V19/V20: Acceleration variants show 5–10× speedup over per-neuron Python loop. Numba JIT and PyTorch CUDA achieve similar wall times on this workload (1000 neurons); GPU advantage grows with N.
- V21 Sparse Numba: scipy.sparse CSR connectivity. At 1K (10% connectivity): 100× faster than V1, 3× faster than V18 dense. At 10K: 1.5× slower than V18 due to scattered CSR index access, but uses 10× less memory (80 MB vs 800 MB).
11. Advanced Module Performance¶
| Module | Configuration | Latency (100 runs) | Per-run | Key metric |
|---|---|---|---|---|
| Quantum-Classical Hybrid | 64 qubits, L=1024 | 76.8 ms | 0.77 ms | cos²(θ/2) error < 0.03 |
| Event-Based GNN | 100 nodes, 5% density | 6.6 ms | 0.07 ms | 17× sparse reduction |
| Stochastic Transformer | d=64, 4 heads, L=512 | 1,691 ms | 16.9 ms | 196× energy vs FP32 MAC |
| BCI Decoder | 64 ch, 1s signal | 19.5 ms | 0.20 ms | Native bitstream encoding |
| DVS Input Layer | 128×128, 1000 events | 1,249 ms | 12.5 ms | 492× data reduction |
| Chaotic RNG | 100K samples | 13.5 ms | — | 7.42 Msample/s |
| Predictive World Model | 32-dim state, 50-step | 34.5 ms | 0.34 ms | 1000× sample efficiency |
12. FPGA Resource Utilization¶
Synthesis tooling (tools/yosys_synth.py) targets Xilinx 7-series via Yosys
synth_xilinx. Yosys is not installed on this machine; run when available:
python tools/yosys_synth.py --json benchmarks/results/yosys_synth.json --markdown
CI runs this command with --allow-skips so timeout-limited hosted runners
still publish benchmarks/results/yosys_synth.json as evidence. A skipped
module is not a hardware timing claim; it records that the bounded CI runner
could not complete that module inside the configured synthesis timeout. Local
or release-grade FPGA evidence should rerun without --allow-skips on an
isolated synthesis host.
Target modules: sc_bitstream_encoder, sc_lif_neuron, sc_bitstream_synapse,
sc_dotproduct_to_current, sc_firing_rate_bank, sc_dense_layer_core,
sc_neurocore_top.
Estimated: sc_bitstream_encoder < 100 LUTs (pending Yosys validation).
13. Bitstream Length Scaling (32x16 Dense)¶
Fixed network: 32 inputs, 16 neurons. Mean of 5 runs per length. Expected: roughly linear scaling (2x L = 2x time).
| L | Mean Time (ms) | Throughput (Mbit/s) |
|---|---|---|
| 128 | 0.43 | 151 |
| 256 | 0.66 | 197 |
| 512 | 1.05 | 250 |
| 1024 | 1.14 | 459 |
| 2048 | 1.36 | 773 |
| 4096 | 4.22 | 497 |
Scaling is sub-linear up to L=2048 due to NumPy vectorization amortizing fixed overhead. At L=4096, packed array allocation begins to dominate.
14. Memory Footprint (L=1024)¶
Peak allocation measured via tracemalloc (includes layer construction
and one forward pass). Weight matrix size is the float64 weight array only.
| Config | Weight Matrix (MB) | Peak Alloc (MB) | Forward Time (ms) |
|---|---|---|---|
| 32x16 (tiny) | 0.004 | 4.63 | 3.1 |
| 64x32 (small) | 0.016 | 18.33 | 7.5 |
| 128x64 (medium) | 0.062 | 73.13 | 13.2 |
| 256x128 (large) | 0.250 | 292.31 | 26.2 |
Peak allocation scales as O(N_neurons * N_inputs * L / 8) bytes for the packed bitstream arrays, which dominate the weight matrix by ~1000x.
15. Reproducing¶
# Python benchmark suite (quick ~15s, full ~120s)
python benchmarks/benchmark_suite.py --full --markdown
# Rust Criterion benchmarks (~5 min)
cargo bench --manifest-path engine/Cargo.toml
# v2 vs v3 comparison (requires Rust wheel)
PYTHONPATH=src python benchmarks/bench_v2_vs_v3.py
# NeuroBench-aligned metrics
python benchmarks/neurobench_harness.py --json benchmarks/results/neurobench.json --markdown
# SNN comparison — 20 variants (requires brian2: pip install brian2)
python benchmarks/snn_comparison.py --all --adapted --sim-ms 1000 \
--json benchmarks/results/snn_translator_20v.json --markdown
# Advanced modules
python benchmarks/benchmark_advanced_modules.py
# FPGA synthesis (requires yosys in PATH)
python tools/yosys_synth.py --json benchmarks/results/yosys_synth.json --markdown
# CI artifact mode: record timeout skips without treating hosted-runner
# synthesis incompletion as a design failure.
python tools/yosys_synth.py --json benchmarks/results/yosys_synth.json --markdown --allow-skips
12. snnTorch Head-to-Head Comparison¶
Artifact: benchmarks/results/snntorch_vs_sc_microbench.json
Three-way comparison: SC-NeuroCore (NumPy), SC-NeuroCore (Rust SIMD), snnTorch 0.9.4.
| Test | SC NumPy (us/step) | SC Rust SIMD (us/step) | snnTorch (us/step) |
|---|---|---|---|
| Single neuron (1000 steps) | 3.7 | — | 876 |
| Dense 100->50 (500 steps) | 2,280 | 1,059 | 1,103 |
| Scale 500->500 (100 steps) | 158,741 | 17,473 | 35,998 |
| Scale 1000->1000 (50 steps) | 602,730 | 28,882 | 9,421 |
Paradigm difference: SC-NeuroCore performs bit-true stochastic computation (uint64 popcount on packed bitstreams, L=256-512 bits per value). snnTorch does float32 matrix multiply. SC-NeuroCore is hardware-faithful (maps directly to Verilog RTL); snnTorch is GPU-optimized but not synthesizable.
- At small scale (1-100 neurons), SC-NeuroCore's zero-overhead Python step is 237x faster than snnTorch's PyTorch dispatch overhead.
- At medium scale (500 neurons), Rust SIMD engine is 2x faster than snnTorch.
- At large scale (1000+), snnTorch's O(n^2) float matmul beats bitstream packing at O(n^2 * L).
- Rust engine provides 9-21x speedup over Python SC at all scales.
python benchmarks/snntorch_vs_sc_microbench.py --runs 5 --scales 100 500 1000
16. Spike Codec Library (2026-03-25)¶
Compression ratios for the spike codec library. All codecs lossless. Measured on (2000 x 64) rasters at various firing rates.
ISI Codec vs General-Purpose Compressors¶
Auto entropy selection (varint for sparse, Huffman for dense):
| Firing Rate | ISI (auto) | zlib-9 | lzma | ISI Advantage |
|---|---|---|---|---|
| 0.1% | 401x | 359x | 194x | +12% over zlib |
| 1% | 78x | 65x | 48x | +20% over zlib |
| 5% | 24x | 19x | 20x | +28% over zlib |
| 10% | 16x | 12x | 13x | +30% over zlib |
| 30% | 8.8x | 7.0x | 7.8x | +24% over zlib |
Context Predictor on Structured Data¶
Periodic bursting (32ch, 5-spike bursts every 50 steps):
| Predictor | Ratio | Accuracy |
|---|---|---|
| ISI (no prediction) | 8.6x | — |
| EMA | 8.5x | 90.0% |
| Context (Markov) | 25.5x | 97.8% |
Realistic SpikeInterface Benchmarks¶
SpikeInterface ground-truth recordings with physiological ISI distributions:
| Scenario | Channels | Units | Firing Rate | Best Ratio |
|---|---|---|---|---|
| Neuropixels-like | 96 | 10 | 1-5 Hz | 457x |
| BCI-scale | 256 | 50 | 0.5-3 Hz | 756x |
| High-density | 384 | 100 | 1-10 Hz | 317x |
All above Neuralink 200x target.
Yosys Synthesis (gate counts)¶
Generic gate-level synthesis via Yosys 0.63:
| Verilog Module | Cells | Function |
|---|---|---|
sc_bitstream_encoder.v |
115 | LFSR predictor (bit-true with Python/Rust) |
sc_cordiv.v |
2 | Stochastic division |
sc_dotproduct_to_current.v |
448 | AND accumulation + popcount |
sc_aer_encoder.v |
1,423 | Priority encoder for AER |
sc_event_neuron.v |
2,135 | Event-driven LIF |
sc_lif_neuron.v |
3,134 | Q8.8 fixed-point LIF |
1024-channel codec estimate: ~406K gates, ~0.02 mm^2 at 7nm.
WaveformCodec: Raw Electrode Compression¶
End-to-end pipeline: raw 10-bit ADC -> spike detect -> template match -> compress. Measured on synthetic 1024-channel, 1 second at 20 kHz:
| Metric | Value |
|---|---|
| Raw data | 40,960,000 bytes (328 Mbit/s) |
| Compressed (q=4) | 1,703,435 bytes (13.6 Mbit/s) |
| Compression ratio | 24x |
| Spikes detected | 3,087 |
| Templates learned | 16 |
| Bluetooth capacity | 15 Mbit/s |
| Fits in uplink | YES |
Scaling (4-bit background quantization):
| Channels | Raw Mbit/s | Compressed Mbit/s | Fits BT |
|---|---|---|---|
| 128 | 26 | 1.0 | YES |
| 256 | 51 | 2.0 | YES |
| 384 | 77 | 3.0 | YES |
| 1024 | 205 | 8.0 | YES |
| 3072 | 614 | 23.9 | NO |
Competitive comparison (raw waveform compression):
| Method | Compression | Notes |
|---|---|---|
| MuSCoRE (2023) | 50-100x | Multi-scale decomposition, academic |
| CREST (2022) | 10-50x | Raw electrode, academic |
| SC-NeuroCore WaveformCodec | 24x | Spike-aware pipeline, open source |
| Delta + arithmetic (standard) | 5-15x | No spike awareness |
Notes¶
- Python benchmarks run in
--fullmode (10x iterations vs quick). - Rust benchmarks use Criterion defaults (100 samples, 3s warmup).
- v2 vs v3 comparison shows PyO3 FFI overhead for small payloads; Section 7 reports true Rust throughput without FFI.
- Brian2 installed with numpy 2.4.2 (its requirement); benchmarks run after downgrading to numpy 1.26.4 for sc-neurocore compatibility.
Timing-aware formal framework (2026-06-04)¶
Artefact: benchmarks/results/local_python_2026-06-04_timing_formal_framework.json.
This benchmark exercises the NEU-C.2 timing formal framework across the active polyglot proof surfaces. The run was executed under runtime core isolation with host_context.cgroup_effective_cpuset=10-11 and runtime_cpuset_shield_claimed=true; the workstation load average during the run was 1.96, 2.40, 3.13. These timings should not be compared against unloaded baselines unless the same isolated-core condition is reproduced.
| Surface | Operation | Result |
|---|---|---|
| SystemVerilog | Dense-layer timing monitors proved through SymbiYosys/cvc5 | pass, 1.476097 s |
| Python | TimingProperty construction and proof orchestration | 16 properties |
| nuXmv | bounded transition-model emission | 16 models, 0.000016258 s, runtime unavailable locally |
| Kind 2 | Lustre bounded-node emission | 16 models, 0.000012825 s, runtime unavailable locally |
Evidence boundary: this is formal proof and model-emitter evidence, not hardware throughput evidence. hardware_measurement_claimed=false remains intentional.
ADC-to-spike quantiser (2026-06-04)¶
Artefact: benchmarks/results/local_python_2026-06-04_adc_to_spike_quantiser.json.
This benchmark exercises the NEU-C.5 ADC-to-spike sensor-ingress contract across the active Python and SystemVerilog surfaces. The run used runtime core isolation with host_context.cgroup_effective_cpuset=10-11 and runtime_cpuset_shield_claimed=true; load average during the run was 3.71, 3.80, 3.48.
| Surface | Operation | Result |
|---|---|---|
| Python | Bit-true ADC decimation and deterministic AER rate-code reference | 3704.696 ns/sample over 409600 samples |
| SystemVerilog | SymbiYosys/cvc5 formal proof | pass, 4.579 s |
| SystemVerilog | Yosys generic synthesis estimate | 7675 cells, 11.861 s |
Evidence boundary: this is local contract, formal, and synthesis-estimate evidence. hardware_measurement_claimed=false remains intentional until board-level isolated hardware evidence is captured.
DCLS Q8.8 RTL contract (2026-06-04)¶
Artefacts: benchmarks/results/local_python_2026-06-04_dcls_q88.json and
benchmarks/results/local_rust_2026-06-04_dcls_q88.json.
This benchmark exercises the NEU-C.6 DCLS Q8.8 scalar layer contract across
Python, PyTorch, Rust, and SystemVerilog. The Python/SystemVerilog run used
runtime core isolation with host_context.cgroup_effective_cpuset=10-11 and
runtime_cpuset_shield_claimed=true; load average during the run was
3.43, 3.66, 3.20. The Rust run used the same isolated unit and recorded CPU
affinity 10-11.
| Surface | Operation | Result |
|---|---|---|
| Python | Bit-true DCLS Q8.8 tent-kernel reference | 6349.497 ns/sample over 409600 samples |
| PyTorch | Quantised deterministic parity reference | 5/5 cases passed, max accumulator diff 0 |
| Rust | Bit-true DCLS Q8.8 reference | 40.184 ns/sample median over 409600 samples x 7 repeats |
| SystemVerilog | SymbiYosys/cvc5 bounded formal check | pass, 1.533 s |
| SystemVerilog | Yosys generic synthesis estimate | 106003 cells, 105.897 s |
Evidence boundary: this is local contract, bounded-formal, and
synthesis-estimate evidence. hardware_measurement_claimed=false remains
intentional. The Vivado ZU3EG WNS/utilisation contract is gated behind
MIF_VIVADO_CI=1 and is not claimed until the self-hosted Vivado runner
archives a passing timing summary.
UltraScale+ target contract (2026-06-04)¶
This benchmark exercises the NEU-C.1 Zynq UltraScale+ target contract across the
Python Vivado-project generator and Rust SystemVerilog emitter/resource model.
Both runs used runtime core isolation: system/user/init slices were moved off
the benchmark cores, the benchmark ran in benchmark.slice, and the raw
artefacts record CPU affinity or cgroup cpuset evidence for CPUs 10-11.
| Surface | Contract | Result |
|---|---|---|
| Python + Vivado Tcl | Manifest validation and deterministic ZU3EG/ZU9EG batch Tcl generation | 122678.065 ns/manifest median over 2 manifests x 2000 iterations x 7 repeats |
| Rust | Target-aware SystemVerilog emission and conservative resource reporting for a 64x32 dense graph | 130835.757 ns/emit median over 2000 iterations x 7 repeats |
The Rust report estimates 2048 DSPs for a one-DSP-per-MAC 64x32 dense graph.
That exceeds the ZU3EG budget of 360, while the BRAM estimate 2 fits the
budget 216. This over-budget DSP result is intentional fail-closed evidence:
SC-NeuroCore must not claim that this unfurled graph fits ZU3EG until a folded
or time-multiplexed dense implementation is added and validated.
The generated Tcl and Rust target metadata use the Zynq UltraScale+ DSP48E2
primitive baseline. The checked-in XDC files are clock/timing baselines only;
they intentionally avoid PACKAGE_PIN and LOC constraints until a
board-revision pin manifest is verified.
UltraScale+ dense folding (2026-06-04)¶
This benchmark exercises the resource-safe fold plan added after the ZU3EG target benchmark proved that an unfurled 64x32 dense layer would require 2,048 DSPs. The shared Python and Rust planners use row-group folding: five output rows are processed per cycle with all 64 input lanes live, using 320 DSPs per compute cycle and completing the 32 output rows in seven cycles.
| Surface | Contract | Result |
|---|---|---|
| Python + SystemVerilog | Planner parity plus bounded 8x8 HDL elaboration for sc_dense_folded_q88_core |
2447.444 ns/plan median over 20000 iterations x 7 repeats; Yosys reports 240 generic cells |
| Rust | SvTarget::dense_fold_plan(64, 32) |
6.661 ns/plan median over 20000 iterations x 7 repeats |
Both runs used the runtime cpuset shield on CPUs 10-11. The Yosys evidence is a bounded parameterised elaboration check, not a Vivado ZU3EG utilisation report. The folded HDL core implements deterministic Q8.8-weight/Q16.16-MAC dense execution and is covered by Icarus simulation; it must be selected deliberately by deployment code and does not silently replace the existing stochastic dense path.