SC-NeuroCore Benchmarks¶

Performance measurements for SC-NeuroCore. Historical core engine rows come from v3.13.3, FPGA synthesis additions from v3.14.0, and documentation/release evidence is current for v3.15.8. All Python numbers are CPU-only unless a GPU environment is named. Rust numbers use Criterion and must be read with their recorded SIMD/environment context.

Evidence boundary¶

This page is a curated benchmark report, not an authority for untracked local measurements or marketing estimates. Every quoted performance, power, utilisation, timing, or hardware-efficiency number must remain traceable to a committed raw artefact: JSON or CSV under benchmarks/results/, a named tool report under hdl/ or build/, or a companion paper artefact with command and environment provenance.

When a newer local run exists but its raw output is not committed, cite it only as local exploratory evidence. Do not promote it to README, roadmap, release, or paper claims until the raw artefact and environment record are committed.

The 2026-06-04 local Python/Rust precision benchmark artefacts were captured on a workstation under concurrent load and without an exclusive isolated CPU core set. Treat those medians as contract/regression context, not final throughput claims. Any production performance claim must be rerun on reserved isolated cores, with CPU affinity, host-load, governor, and frequency evidence recorded in the raw artefact.

The 2026-06-05 refreshed mixed-dense, block-floating, and precision-envelope artefacts were executed with process affinity pinned to CPUs 8-9. The raw JSON records host load, affinity, CPU governor, and sampled frequency context, but the workstation still did not expose kernel-reserved isolated cores to this user session. Treat the refreshed medians as local regression evidence and the proof fields as the authoritative contract evidence.

The 2026-06-05 live-control AXI4-Lite/PCIe-MMIO rerun was executed with process affinity pinned to CPUs 8-9, and the raw artefact records that the process affinity matched the requested benchmark cpuset. The workstation did not expose kernel-reserved isolated cores to this user session, so these numbers remain local regression evidence rather than production throughput claims.

The module-level pytest throughput checks use load-tolerant smoke thresholds unless SC_NEUROCORE_STRICT_THROUGHPUT=1 is set. Use strict mode only for isolated-core benchmark captures and commit the raw artefact before treating the numbers as release evidence.

The 2026-06-04 AER priority queue benchmark follows the same boundary. It was run under a temporary runtime cpuset shield: system/user slices were moved off the benchmark cores, the benchmark ran in its own benchmark.slice, and the raw artefact records the process affinity plus cgroup effective CPU set. This is stronger than taskset pinning but still distinct from boot-time isolcpus/nohz_full kernel isolation. The artefact is Python/SystemVerilog contract evidence for strict-priority ordering, FIFO ties, backpressure drops, critical-deadline traps, and Yosys RTL elaboration; it is not an FPGA throughput or latency measurement.

v3.14.0 additions: SHD FPGA synthesis on Zynq XC7Z020 — 1 317 LUT (2.5%), 848 FF (0.8%), WNS +4.048 ns at 100 MHz. See hdl/reports/vivado_util_xc7z020_100mhz.rpt for full Vivado report.

1. Environment¶

Field	Value
Date	2026-03-15
Git tag	v3.13.3
OS	Windows 11 Pro 10.0.26200
CPU	Intel Core i5-11600K (6C/12T, 3.9 GHz base, AVX-512, DL Boost)
RAM	32 GB DDR4-3200
Python	3.12.5
NumPy	1.26.4
Rust	1.86.0 (stable)
SIMD tier	avx512-vpopcntdq
GPU (NVIDIA)	GeForce GTX 1060 6GB — Pascal sm_61, PyTorch 2.6.0+cu124
GPU (AMD)	Radeon RX 6600 XT — no ROCm on Windows, torch-directml incompatible
CuPy	unavailable — CUDA Toolkit 13.1 dropped Pascal (sm_61) support

2. Scalar Primitives (Python)¶

Operation	Iterations	Latency (µs)	Throughput
LFSR step (16-bit)	1,000,000	0.8	1.33 Mstep/s
Bitstream encoder step	1,000,000	0.9	1.10 Mstep/s
LIF neuron step (Q8.8)	1,000,000	0.9	1.07 Mstep/s

3. Packed Bitstream Operations (Python/NumPy)¶

Operation	Size	Iterations	Latency (µs)	Throughput
pack_bitstream 1-D	1,024	10,000	8.7	0.12 Gbit/s
pack_bitstream 1-D	65,536	2,000	123.1	0.53 Gbit/s
pack_bitstream 2-D	64×1,024	2,000	121.6	0.54 Gbit/s
vec_and	1,024 words	50,000	1.6	41.0 Gbit/s
vec_popcount SWAR	1,024 words	50,000	30.2	2.17 Gbit/s

4. Dense Layer Forward Pass (Python)¶

Configuration	Iterations	Latency (µs)	Throughput
16×8, L=256	500	352.7	0.09 GOP/s (SC)
64×32, L=1,024	100	2,405.8	0.87 GOP/s (SC)

AER Priority Queue Backpressure Contract (2026-06-04)¶

This benchmark covers the NEU-C.4 event-control contract: AER fanout packets with lower numeric priority must overtake best-effort packets while preserving FIFO order within equal priority classes. The committed artefact also records finite-capacity backpressure, sticky drop traps, sticky critical-deadline traps, CPU affinity, cgroup effective CPU set, host load, CPU governor, and Yosys elaboration time.

Path	Workload	Result	Raw evidence
Python reference model	4,096 deterministic events x 100 repeats	4.138 us/event under runtime cpuset shield `10-11`; `priority_violations=0`, `fifo_tie_violations=0`	`benchmarks/results/local_python_2026-06-04_aer_priority_queue.json`
SystemVerilog `sc_aer_priority_queue`	Yosys synthesis/elaboration	`yosys.exit_code=0`, 6,364 cells in the artefact	`benchmarks/results/local_python_2026-06-04_aer_priority_queue.json`

No Rust, Julia, Go, or Mojo counterpart exists for this HDL-only queue surface as of 2026-06-04. Cross-language comparison therefore means Python reference contract versus SystemVerilog RTL elaboration/simulation for this task.

Live-control AXI4-Lite / PCIe-MMIO Register Window (2026-06-05)¶

This benchmark covers the live-parameter update contract for hot-swappable weights and Kuramoto coupling parameters. Both protocols use the same CRC32-gated shadow-bank core: AXI4-Lite exposes the core directly, while the PCIe path emits a PCIe-MMIO register-window adapter that expects upstream PCIe hard IP to present decoded single-clock MMIO strobes.

Path	Workload	Result	Raw evidence
Python update-sequence builder	20,000 deterministic staged writes x 7 repeats	AXI4-Lite median `14.139 us/sequence`; PCIe-MMIO median `14.216 us/sequence` under process affinity `8-9`	`benchmarks/results/local_python_2026-06-04_live_control_updates.json`
SystemVerilog AXI4-Lite core	Generated trap-capture simulation	`trap_capture.passed=true`; staged overflow and underflow traps latched without mutating active coefficients	`benchmarks/results/local_python_2026-06-04_live_control_updates.json`
SystemVerilog PCIe-MMIO wrapper	Generated commit simulation	`pcie_mmio_commit_capture.passed=true`; partial write strobes raise sticky `partial_write`, stale CRC32 guard raises sticky `checksum_mismatch`, invalid bank selection and invalid active readback raise sticky `invalid_selection`, read-only bank writes raise sticky `read_only_bank`, and retargeting selection registers after shadow load cannot redirect the committed bank	`benchmarks/results/local_python_2026-06-04_live_control_updates.json`

No Rust, Julia, Go, or Mojo counterpart exists for this HDL bus-adapter surface as of 2026-06-04. Cross-language comparison therefore means Python control contract generation versus SystemVerilog RTL simulation for AXI4-Lite and PCIe-MMIO.

Mixed Q8.8/Q16.16 Dense Contract (2026-06-04)¶

This benchmark covers the deterministic dense mixed-precision contract: stored Q8.8 weights, Q16.16 inputs and outputs, signed arithmetic product scaling, and explicit saturation/overflow handling. The Python path is the deployment reference and manifest writer; the Rust path is the low-latency integer mirror. The Python path is run with QFormatMixed(scale_per_tensor=False) for this benchmark so the Python, Rust, and HDL surfaces share the same raw Q8.8/Q16.16 arithmetic contract instead of Python-only per-tensor rescaling.

Path	Workload	Median	Raw evidence
Python `CompiledMixedDense.forward_accumulator_codes`	64×32 dense, 2,000 calls × 7 repeats	51.934 µs/call	`benchmarks/results/local_python_2026-06-04_mixed_dense.json`
Python `CompiledMixedDense.forward_with_overflow`	Same deterministic matrix/vector	51.104 µs/call	`benchmarks/results/local_python_2026-06-04_mixed_dense.json`
NumPy float64 dot baseline	Same deterministic matrix/vector	1.656 µs/call	`benchmarks/results/local_python_2026-06-04_mixed_dense.json`
Rust `mixed_dense_q88_q1616`	64×32 dense, 20,000 calls × 7 repeats	2.659 µs/call	`benchmarks/results/local_rust_2026-06-04_mixed_dense.json`
HDL `sc_mixed_precision_dense` Yosys RTLIL stat	Default 64×32 parameters	12,708 cells, 2,048 multipliers	`hdl/reports/yosys_mixed_precision_dense_2026-06-04.json`

The Python mixed path reconstructed the float64 dot product with maximum absolute error 0.0 on the committed deterministic workload. The Python and Rust artefacts both recorded safe-workload overflow count 0 and saturating-probe overflow count 32, matching the lane-level HDL overflow_vector contract. The same artefacts now record conservative precision-envelope telemetry: Python and Rust safe max absolute bound 531400, and saturating-probe max absolute bound 17454214414336; the HDL exports the matching per-output abs_bounds_q1616 vector. The refreshed Python and Rust artefacts also prove the signed symmetric fixed-point width contract: the safe workload requires 21 signed total bits, 5 Q16.16 integer bits, and has 11 bits of headroom; the saturating probe requires 45 signed total bits and 29 Q16.16 integer bits, has -13 bits of headroom, and records saturation_required=true.

Block-Floating Dense Contract (2026-06-04)¶

This benchmark covers dense BFP16E3X32 weights with Q16.16 inputs and saturated Q16.16 outputs. The shared exponent path preserves larger dynamic range per block than canonical Q8.8 weights, at the cost of dynamic shifts in the HDL datapath.

Path	Workload	Median	Raw evidence
Python `CompiledBlockFloatingDense.forward_accumulator_codes`	64×32 dense, 2,000 calls × 7 repeats	38.760 µs/call	`benchmarks/results/local_python_2026-06-04_block_floating_dense.json`
Python `CompiledBlockFloatingDense.forward_with_overflow`	Same deterministic matrix/vector	42.356 µs/call	`benchmarks/results/local_python_2026-06-04_block_floating_dense.json`
NumPy float64 dot baseline	Same deterministic matrix/vector	1.029 µs/call	`benchmarks/results/local_python_2026-06-04_block_floating_dense.json`
Rust `block_floating_dense_q16`	64×32 dense, 20,000 calls × 7 repeats	11.471 µs/call	`benchmarks/results/local_rust_2026-06-04_block_floating_dense.json`
HDL `sc_block_floating_dense` Yosys RTLIL stat	Parameterised 2×2, `BLOCK_SIZE=2` elaboration copy	96 cells, 4 multipliers	`hdl/reports/yosys_block_floating_dense_2026-06-04.json`

The deterministic block-floating workload recorded maximum absolute error 0.22306060791015625 versus float64 dot. This reflects BFP16E3 block-scale quantisation on the committed synthetic workload, not runtime nondeterminism. The Python and Rust artefacts both recorded safe-workload overflow count 0 and saturating-probe overflow count 32, matching the lane-level HDL overflow_vector contract. Both languages now compare the same deterministic BFP contract: mantissa checksum -15, exponent checksum 0, exponent code range [0, 0], safe max absolute bound 610816, and saturating-probe max absolute bound 1125865547104256. The refreshed proof fields record safe width 21 signed total bits, 5 Q16.16 integer bits, and 11 bits of headroom; the saturating probe records 51 signed total bits, 35 Q16.16 integer bits, -19 bits of headroom, and saturation_required=true. The 64×32 payload records parameter_count=2048 and block_exponent_count=64 in both the Python manifest and Rust artefact. The HDL exports per-output abs_bounds_q1616; the full-size 64×32 block-floating Yosys frontend path is documented as toolchain debt because Yosys 0.33 elaborates the default procedural loops during read_verilog before chparam can reduce the dimensions.

The same Python and Rust artefacts now include a seeded BFP16E3X2 exponent edge sweep. Both languages agree on safe exponent codes [0, 7, 0, 7], safe Q16.16 output codes [1056736, -1069024], safe overflow and underflow counts 0, conservative safe bound 1069024, and headroom 2146414623. The max-exponent saturation probe records exponent code [7], saturated output code [2147483647], overflow count 1, underflow count 0, and conservative bound 2251662376828928. The safe edge sweep requires 22 signed total bits and 6 Q16.16 integer bits with 10 bits of headroom; the max-exponent saturation probe requires 52 signed total bits and 36 Q16.16 integer bits with -20 bits of headroom, proving that max shared-exponent payloads trap rather than silently wrapping.

Precision Trap Reports (2026-06-04)¶

This benchmark covers the saturation telemetry path for mixed Q8.8/Q16.16 and block-floating dense outputs. The workload intentionally saturates all 32 output channels so the trap report has to preserve an exact overflow count rather than only a collapsed Boolean.

Path	Workload	Median	Raw evidence
Python mixed `precision_trap_report`	64×32 dense, 2,000 calls × 7 repeats	44.744 µs/call	`benchmarks/results/local_python_2026-06-04_precision_traps.json`
Python BFP `precision_trap_report`	64×32 dense, 2,000 calls × 7 repeats	45.906 µs/call	`benchmarks/results/local_python_2026-06-04_precision_traps.json`
Rust mixed `PrecisionTrapReport`	64×32 dense, 20,000 calls × 7 repeats	2.506 µs/call	`benchmarks/results/local_rust_2026-06-04_precision_traps.json`
Rust BFP `PrecisionTrapReport`	64×32 dense, 20,000 calls × 7 repeats	8.777 µs/call	`benchmarks/results/local_rust_2026-06-04_precision_traps.json`
HDL `sc_precision_overflow_trap` Yosys stat	Default `TRAP_WIDTH=1`	3 cells, 8 wire bits	`hdl/reports/yosys_precision_overflow_trap_2026-06-04.json`

The committed Python and Rust trap workloads both report mixed_overflow_count=32 and bfp_overflow_count=32, matching the number of output channels. The HDL trap primitive synthesises to one $adff, one $mux, and one $or cell at the default width. The 2026-06-05 rerun also records matched sub-LSB underflow probes: mixed_underflow_count=32 and bfp_underflow_count=32 in both Python and Rust, while the saturating overflow workloads retain underflow_count=0.

Precision Envelope Reports (2026-06-04)¶

This benchmark covers the conservative predeployment envelope path for mixed Q8.8/Q16.16 and block-floating dense outputs. The Python and Rust workloads use matched fixed-point codes and report the same maximum absolute bounds: 132850 for the mixed dense workload and 78032768 for the block-floating workload.

Path	Workload	Median	Raw evidence
Python mixed `precision_envelope_report`	64×32 dense, 2,000 calls × 7 repeats	87.578 µs/call	`benchmarks/results/local_python_2026-06-04_precision_envelopes.json`
Python BFP `precision_envelope_report`	64×32 dense, 2,000 calls × 7 repeats	90.475 µs/call	`benchmarks/results/local_python_2026-06-04_precision_envelopes.json`
Rust mixed `PrecisionEnvelopeReport`	64×32 dense, 20,000 calls × 7 repeats	2.991 µs/call	`benchmarks/results/local_rust_2026-06-04_precision_envelopes.json`
Rust BFP `PrecisionEnvelopeReport`	64×32 dense, 20,000 calls × 7 repeats	9.874 µs/call	`benchmarks/results/local_rust_2026-06-04_precision_envelopes.json`
HDL `sc_precision_envelope_guard` Yosys stat	Default `N_OUTPUTS=32`	67 cells, 1,701 wire bits	`hdl/reports/yosys_precision_envelope_guard_2026-06-04.json`

Both Python and Rust envelope reports returned conservative_overflow_free=true for the committed safe workload. The HDL guard synthesises to two $adff, thirty-two $gt, thirty-two $mux, and one $reduce_or cell at the default width. The raw artefacts now include observed_underflow_free=true for the safe workload and matched underflow probes with underflow_count=32 for mixed and BFP dense paths in both Python and Rust. The refreshed manifests expose proof_kind=signed_symmetric_fixed_point_width: mixed dense requires 19 signed total bits, 3 Q16.16 integer bits, and has 13 bits of headroom, while block-floating dense requires 28 signed total bits, 12 Q16.16 integer bits, and has 4 bits of headroom.

5. Full Pipeline (Python)¶

encode → AND synapse → popcount → LIF neuron

Configuration	Iterations	Latency (µs)	Throughput
4 synapses, 256 steps	200	1,830	139.9 Kstep/s
16 synapses, 256 steps	50	8,679	29.5 Kstep/s

6. GPU Backend¶

6a. Local CPU fallback (NumPy, no CuPy)¶

Operation	Iterations	Latency (µs)	Throughput
gpu_pack_bitstream (65,536)	2,000	375.9	0.17 Gbit/s
gpu_vec_mac (64×32×16w)	1,000	736.4	2.85 GOP/s

6b. Cloud GPU — NVIDIA RTX A6000 (48 GB, CUDA 12.6)¶

Environment: JarvisLabs A6000, Xeon Silver 4216 (64 vCPU), PyTorch 2.6.0+cu124. 1000 ms simulation, 3 runs, AI regime (conn_prob=0.1).

Neurons	Synapses	Wall (s)	Rate (Hz)	Syn events/s	Peak RSS
1,000	100K	1.55	99.0	3.2 M	12 MB
2,000	400K	1.80	85.5	9.5 M	24 MB
5,000	2.5M	2.74	63.6	29.0 M	104 MB
20,000	40M	8.80	26.1	59.2 M	775 MB
50,000	250M	35.4	14.7	51.9 M	4,793 MB

Source: benchmarks/results/jarvislabs_a6000/gpu_large_scale.json, benchmarks/results/jarvislabs_a6000/scaling_4regime.json.

7. Rust Engine — Criterion Results¶

All benchmarks run with cargo bench --manifest-path engine/Cargo.toml on AVX-512 hardware. Times are Criterion medians.

Updated 2026-04-05 with UpCloud EPYC 9575F measurements. Previous values (25.4 µs, 446 µs) were from an unidentified earlier run.

Bitstream Packing (1M bits = 1,048,576 bits)¶

Variant	Time	Throughput	vs. Python
`pack` (scalar)	897 µs	1.17 Gbit/s	2.2×
`pack_fast` (u64 chunks)	286 µs	3.67 Gbit/s	7×
`pack_dispatch` (AVX-512)	8.85 µs	113 Gbit/s	46.6×

Popcount (16,384 u64 words = 1M bits)¶

Variant	Time	Throughput	vs. Python
`popcount_portable`	12.1 µs	86.6 Mword/s	2.5×
`popcount_simd` (AVX-512)	2.86 µs	366 Mword/s	10.6×

Fused AND+Popcount (16 words)¶

Variant	Time
Scalar (iter + count_ones)	19.1 ns
SIMD dispatch (AVX-512)	9.58 ns

Encoder / Neuron¶

Operation	Time	Throughput
LFSR encoder (64K steps)	131 µs	500 Mstep/s
LIF neuron (10K steps)	47.9 µs	209 Mstep/s
LIF neuron (100K steps)	219 µs	456 Mstep/s

Bernoulli Encoding (1,024-bit packed streams)¶

Variant	Time	Notes
`bernoulli_stream` (unrolled)	3.99 µs	generate bits then pack
`bernoulli_stream` + pack	4.79 µs	two-pass
`bernoulli_packed` (ChaCha8)	4.14 µs	direct packed generation
`bernoulli_packed_fast` (ChaCha8)	1.72 µs	optimized threshold loop
`bernoulli_packed_simd` (ChaCha8)	779 ns	SIMD comparison
`bernoulli_packed_simd` (Xoshiro)	398 ns	fastest: SIMD + fast PRNG
`encode_and_popcount` (Xoshiro)	285 ns	fused encode+AND+popcount

Dense Layer (64 inputs → 32 neurons, L=1024)¶

Variant	Time	vs. Python
`forward` (baseline)	1.22 ms	2.0×
`forward_fast` (packed)	337 µs	7.1×
`forward_fused` (encode+AND+pop)	1.67 ms	1.4×
`forward_prepacked` (pre-encoded)	54.9 µs	43.8×
`forward_batch` (100 samples)	13.7 ms	17.6× per sample

PRNG Fill (1,024 bytes)¶

Generator	Time	Throughput
ChaCha8	320 ns	3.13 GB/s
Xoshiro256++	191 ns	5.24 GB/s

Domain-Specific¶

Operation	Time
Kuramoto solver (100 osc, 1000 steps)	199 ms
Stochastic attention (10×16 → 20×32)	138 µs
Graph layer (20 nodes, 8 features)	253 µs

8. v2 (Python) vs v3 (Rust) Speedup¶

SIMD tier: avx512-vpopcntdq. The v3 engine wraps Rust via PyO3.

Operation	v2 (ms)	v3 (ms)	Speedup
pack_bitstream (1M bits)	8.26	52.66	0.2×
popcount (1M bits)	0.12	0.44	0.3×
LIF neuron (10K steps)	8.58	4.48	1.9×
Dense forward (16→8, L=1024)	0.49	0.18	2.7×
Dense forward (64→32, L=1024)	4.48	4.05	1.1×
Dense forward (128→64, L=1024)	8.02	1.10	7.3×
Attention (10×16 → 20×32)	0.03	0.28	0.1×
Attention (50×32 → 100×64)	0.10	2.19	0.0×

Geometric mean speedup: 0.5× — across all operations, the Rust FFI path is slower than pure Python on average. PyO3 call overhead (argument marshalling, GIL release/acquire) adds ~50–200 µs per invocation, which dominates when the payload is small.

The Rust engine amortises FFI cost above ~64K bits per call. On payloads >1M bits the SIMD kernel wins decisively (Dense 128→64 at 7.3×). For small networks (<64 neurons), pure Python is faster. The Rust engine targets large-payload inference (>=128 neurons, L>=1024).

The pure-Rust Criterion numbers in Section 7 show true engine throughput without FFI overhead.

9. NeuroBench-Aligned Metrics¶

Aligned with the NeuroBench methodology (Yik et al., 2023; arXiv:2304.04640).

Model	Neurons	SynOps	Latency (µs)	Throughput (MOP/s)	Memory (B)
SCDenseLayer(8×4, L=256)	4	409,600	1,293	6.3	256
SCDenseLayer(16×8, L=512)	8	1,966,080	2,446	26.8	1,024
VectorizedSCLayer(16×8, L=512)	8	3,276,800	348	188.1	1,024
VectorizedSCLayer(64×32, L=1024)	32	41,943,040	2,476	847.0	16,384

Activation sparsity is 0.00 because SC outputs are graded probabilities, not binary spikes — every neuron produces a non-zero output on every step.

10. SNN Comparison: Brunel Balanced Network¶

4-Variant Translator Benchmark¶

1000 neurons (800E/200I), conn_prob=0.1, adapted params (weight_exc=5.0 mV, external_rate=200 Hz), 1000 ms simulation. Delta-PSC semantics: synaptic events applied as instantaneous voltage jumps (v += w), matching Brian2's on_pre="v_post += w".

Variant	Spikes	Rate (Hz)	Brian2 Ratio	Wall (s)
Brian2 reference	1,057,908	1057.9	1.00	1.11
V1 StochasticLIF	1,725,955	1726.0	1.63	30.23
V2 RateMatched	N/A	0.049 (prob)	—	51.81
V3 FixedPoint Q8.8	1,722,195	1722.2	1.63	15.41
V4 Hybrid SC+LIF	1,888,351	1888.4	1.78	46.23

Variant descriptions¶

V1 StochasticLIF: Bug-fixed delta-PSC wiring. Previous benchmark passed input through R * I * dt (diluted by dt=0.1) and omitted v_reset. Fixed: synaptic events as neuron.v += weight, Poisson drive as voltage kicks, v_reset=10.0 passed correctly.
V2 RateMatched: VectorizedSCLayer in probability domain. Weights mapped to p = w / v_threshold. 100-neuron subset, bitstream_length=1024. Not spike-comparable; mean output probability = 0.0488.
V3 FixedPoint Q8.8: Hardware-faithful FixedPointLIFNeuron. Params mapped to Q8.8 integers (scale=256). Rate 1.63x Brian2 (higher due to different noise model).
V4 Hybrid SC+LIF: BitstreamSynapse AND gates → popcount → voltage → StochasticLIFNeuron. Higher rate due to stochastic amplification in the bitstream encoding.

Historical note (v3.9.0, resolved)¶

Prior to v3.10.0, three wiring bugs prevented the Brunel network from firing. All three were fixed in v3.10.0: 1. v_reset never passed (defaulted to 0.0 instead of 10.0) 2. Delta-PSC diluted through R * I * dt instead of direct v += w 3. Poisson drive fed as steady current instead of voltage kicks

10b. 20-Variant Brunel Translator Suite¶

Adapted Brunel parameters (weight_exc=5.0, ext_rate=200 Hz), 1000 neurons, 1000 ms simulation. Brian2 2.10.1 reference: 748,777 spikes, 748.8 Hz.

#	Variant	Spikes	Rate (Hz)	Brian2 Ratio	Wall (s)	Note
—	Brian2 reference	748,777	748.8	1.00	1.60
V1	StochasticLIF	1,725,955	1726.0	2.31	49.33	delta-PSC baseline
V2	RateMatched	—	0.0488 (prob)	—	80.98	probability domain
V3	FixedPoint Q8.8	1,722,195	1722.2	2.30	20.79	hardware-faithful
V4	Hybrid SC+LIF	1,571,994	1572.0	2.10	42.46	bitstream synapse
V5	Izhikevich	15,331	15.3	0.02	11.49	burst dynamics
V6	Homeostatic LIF	1,727,113	1727.1	2.31	39.36	adaptive threshold
V7	Noisy LIF	1,714,361	1714.4	2.29	44.62	noise_std=1.0
V8	Refractory LIF	114,317	114.3	0.15	26.56	5-step refractory
V9	Post-kick LIF	1,671,636	1671.6	2.23	36.16	Brian2 timing
V10	Exact-leak LIF	1,713,399	1713.4	2.29	32.78	exp(-dt/tau)
V11	Q16.12 FixedPoint	464,644	464.6	0.62	18.19	32-bit, 12 frac
V12	STDP LIF	1,689,552	1689.6	2.26	758.21	2000 STDP synapses
V13	DotProduct LIF	497,647	9952.9	—	261.40	n=50, bl=256
V14	Sobol bitstream	780,390	780.4	1.04	220.46	low-discrepancy
V15	JAX vectorized	—	—	—	—	skipped: JAX not installed
V16	Recurrent reservoir	—	0.9997 (prob)	—	16.21	probability domain
V17	Memristive defects	—	48.7	—	51.62	stuck=1%, var=5%
V18	Numba JIT	1,685,521	1685.5	2.25	5.20	9.5× vs V1
V19	PyTorch CUDA	1,725,955	1726.0	2.31	5.70	GTX 1060 6GB
V20	Vectorized NumPy	1,725,955	1726.0	2.31	10.27	batch update
V21	Sparse Numba (CSR)	1,685,521	1685.5	2.25	0.49	10% connectivity

Acceleration comparison (1000 neurons, 1000 ms)¶

Backend	Wall (s)	Speedup vs V1
V1 per-neuron Python	49.33	1.0×
V20 vectorized NumPy	10.27	4.8×
V18 Numba JIT	5.20	9.5×
V19 PyTorch CUDA (GTX 1060)	5.70	8.7×
V21 Sparse Numba (CSR)	0.49	100.7×
Brian2 (Cython)	1.60	30.8×

10K neuron scaling¶

Backend	Wall (s)	Memory
V18 Numba JIT (dense)	15.3	800 MB (N²×8)
V21 Sparse Numba (CSR)	22.6	80 MB (10% nnz)
Brian2 (C++ codegen)	9.6	sparse (internal)

At 10K, Brian2's compiled C++ sparse codegen wins. V21 CSR reduces memory 10× but scattered index access prevents SIMD vectorization. The Rust SIMD CSR engine (planned) targets this gap.

Variant notes¶

V5 Izhikevich: Low spike rate expected — Izhikevich dynamics (quadratic nonlinearity, v range -65 to +30) respond differently to delta-PSC drive. Tonic baseline current of 5.0 added for sub-threshold depolarization.
V8 Refractory: 5-step (0.5 ms) dead time reduces max firing rate to ~2000 Hz, cutting observed rate by 15×.
V11 Q16.12: Higher precision fixed-point produces fewer spikes than Q8.8 due to more accurate leak computation (less rounding-induced depolarization).
V12 STDP: Online weight learning with 2000 STDP synapses. 15× slower due to per-synapse process_step() calls.
V14 Sobol: Low-discrepancy bitstream achieves 1.04× Brian2 ratio — closest match to reference among all spiking variants.
V18/V19/V20: Acceleration variants show 5–10× speedup over per-neuron Python loop. Numba JIT and PyTorch CUDA achieve similar wall times on this workload (1000 neurons); GPU advantage grows with N.
V21 Sparse Numba: scipy.sparse CSR connectivity. At 1K (10% connectivity): 100× faster than V1, 3× faster than V18 dense. At 10K: 1.5× slower than V18 due to scattered CSR index access, but uses 10× less memory (80 MB vs 800 MB).

11. Advanced Module Performance¶

Module	Configuration	Latency (100 runs)	Per-run	Key metric
Quantum-Classical Hybrid	64 qubits, L=1024	76.8 ms	0.77 ms	cos²(θ/2) error < 0.03
Event-Based GNN	100 nodes, 5% density	6.6 ms	0.07 ms	17× sparse reduction
Stochastic Transformer	d=64, 4 heads, L=512	1,691 ms	16.9 ms	196× energy vs FP32 MAC
BCI Decoder	64 ch, 1s signal	19.5 ms	0.20 ms	Native bitstream encoding
DVS Input Layer	128×128, 1000 events	1,249 ms	12.5 ms	492× data reduction
Chaotic RNG	100K samples	13.5 ms	—	7.42 Msample/s
Predictive World Model	32-dim state, 50-step	34.5 ms	0.34 ms	1000× sample efficiency

12. FPGA Resource Utilization¶

Synthesis tooling (tools/yosys_synth.py) targets Xilinx 7-series via Yosys synth_xilinx. Yosys is not installed on this machine; run when available:

Bash

python tools/yosys_synth.py --json benchmarks/results/yosys_synth.json --markdown

CI runs this command with --allow-skips so timeout-limited hosted runners still publish benchmarks/results/yosys_synth.json as evidence. A skipped module is not a hardware timing claim; it records that the bounded CI runner could not complete that module inside the configured synthesis timeout. Local or release-grade FPGA evidence should rerun without --allow-skips on an isolated synthesis host.

Target modules: sc_bitstream_encoder, sc_lif_neuron, sc_bitstream_synapse, sc_dotproduct_to_current, sc_firing_rate_bank, sc_dense_layer_core, sc_neurocore_top.

Estimated: sc_bitstream_encoder < 100 LUTs (pending Yosys validation).

13. Bitstream Length Scaling (32x16 Dense)¶

Fixed network: 32 inputs, 16 neurons. Mean of 5 runs per length. Expected: roughly linear scaling (2x L = 2x time).

L	Mean Time (ms)	Throughput (Mbit/s)
128	0.43	151
256	0.66	197
512	1.05	250
1024	1.14	459
2048	1.36	773
4096	4.22	497

Scaling is sub-linear up to L=2048 due to NumPy vectorization amortizing fixed overhead. At L=4096, packed array allocation begins to dominate.

14. Memory Footprint (L=1024)¶

Peak allocation measured via tracemalloc (includes layer construction and one forward pass). Weight matrix size is the float64 weight array only.

Config	Weight Matrix (MB)	Peak Alloc (MB)	Forward Time (ms)
32x16 (tiny)	0.004	4.63	3.1
64x32 (small)	0.016	18.33	7.5
128x64 (medium)	0.062	73.13	13.2
256x128 (large)	0.250	292.31	26.2

Peak allocation scales as O(N_neurons * N_inputs * L / 8) bytes for the packed bitstream arrays, which dominate the weight matrix by ~1000x.

15. Reproducing¶

Bash

# Python benchmark suite (quick ~15s, full ~120s)
python benchmarks/benchmark_suite.py --full --markdown

# Rust Criterion benchmarks (~5 min)
cargo bench --manifest-path engine/Cargo.toml

# v2 vs v3 comparison (requires Rust wheel)
PYTHONPATH=src python benchmarks/bench_v2_vs_v3.py

# NeuroBench-aligned metrics
python benchmarks/neurobench_harness.py --json benchmarks/results/neurobench.json --markdown

# SNN comparison — 20 variants (requires brian2: pip install brian2)
python benchmarks/snn_comparison.py --all --adapted --sim-ms 1000 \
    --json benchmarks/results/snn_translator_20v.json --markdown

# Advanced modules
python benchmarks/benchmark_advanced_modules.py

# FPGA synthesis (requires yosys in PATH)
python tools/yosys_synth.py --json benchmarks/results/yosys_synth.json --markdown

# CI artifact mode: record timeout skips without treating hosted-runner
# synthesis incompletion as a design failure.
python tools/yosys_synth.py --json benchmarks/results/yosys_synth.json --markdown --allow-skips

12. snnTorch Head-to-Head Comparison¶

Artifact: benchmarks/results/snntorch_vs_sc_microbench.json

Three-way comparison: SC-NeuroCore (NumPy), SC-NeuroCore (Rust SIMD), snnTorch 0.9.4.

Test	SC NumPy (us/step)	SC Rust SIMD (us/step)	snnTorch (us/step)
Single neuron (1000 steps)	3.7	—	876
Dense 100->50 (500 steps)	2,280	1,059	1,103
Scale 500->500 (100 steps)	158,741	17,473	35,998
Scale 1000->1000 (50 steps)	602,730	28,882	9,421

Paradigm difference: SC-NeuroCore performs bit-true stochastic computation (uint64 popcount on packed bitstreams, L=256-512 bits per value). snnTorch does float32 matrix multiply. SC-NeuroCore is hardware-faithful (maps directly to Verilog RTL); snnTorch is GPU-optimized but not synthesizable.

At small scale (1-100 neurons), SC-NeuroCore's zero-overhead Python step is 237x faster than snnTorch's PyTorch dispatch overhead.
At medium scale (500 neurons), Rust SIMD engine is 2x faster than snnTorch.
At large scale (1000+), snnTorch's O(n^2) float matmul beats bitstream packing at O(n^2 * L).
Rust engine provides 9-21x speedup over Python SC at all scales.

Bash

python benchmarks/snntorch_vs_sc_microbench.py --runs 5 --scales 100 500 1000

16. Spike Codec Library (2026-03-25)¶

Compression ratios for the spike codec library. All codecs lossless. Measured on (2000 x 64) rasters at various firing rates.

ISI Codec vs General-Purpose Compressors¶

Auto entropy selection (varint for sparse, Huffman for dense):

Firing Rate	ISI (auto)	zlib-9	lzma	ISI Advantage
0.1%	401x	359x	194x	+12% over zlib
1%	78x	65x	48x	+20% over zlib
5%	24x	19x	20x	+28% over zlib
10%	16x	12x	13x	+30% over zlib
30%	8.8x	7.0x	7.8x	+24% over zlib

Context Predictor on Structured Data¶

Periodic bursting (32ch, 5-spike bursts every 50 steps):

Predictor	Ratio	Accuracy
ISI (no prediction)	8.6x	—
EMA	8.5x	90.0%
Context (Markov)	25.5x	97.8%

Realistic SpikeInterface Benchmarks¶

SpikeInterface ground-truth recordings with physiological ISI distributions:

Scenario	Channels	Units	Firing Rate	Best Ratio
Neuropixels-like	96	10	1-5 Hz	457x
BCI-scale	256	50	0.5-3 Hz	756x
High-density	384	100	1-10 Hz	317x

All above Neuralink 200x target.

Yosys Synthesis (gate counts)¶

Generic gate-level synthesis via Yosys 0.63:

Verilog Module	Cells	Function
`sc_bitstream_encoder.v`	115	LFSR predictor (bit-true with Python/Rust)
`sc_cordiv.v`	2	Stochastic division
`sc_dotproduct_to_current.v`	448	AND accumulation + popcount
`sc_aer_encoder.v`	1,423	Priority encoder for AER
`sc_event_neuron.v`	2,135	Event-driven LIF
`sc_lif_neuron.v`	3,134	Q8.8 fixed-point LIF

1024-channel codec estimate: ~406K gates, ~0.02 mm^2 at 7nm.

WaveformCodec: Raw Electrode Compression¶

End-to-end pipeline: raw 10-bit ADC -> spike detect -> template match -> compress. Measured on synthetic 1024-channel, 1 second at 20 kHz:

Metric	Value
Raw data	40,960,000 bytes (328 Mbit/s)
Compressed (q=4)	1,703,435 bytes (13.6 Mbit/s)
Compression ratio	24x
Spikes detected	3,087
Templates learned	16
Bluetooth capacity	15 Mbit/s
Fits in uplink	YES

Scaling (4-bit background quantization):

Channels	Raw Mbit/s	Compressed Mbit/s	Fits BT
128	26	1.0	YES
256	51	2.0	YES
384	77	3.0	YES
1024	205	8.0	YES
3072	614	23.9	NO

Competitive comparison (raw waveform compression):

Method	Compression	Notes
MuSCoRE (2023)	50-100x	Multi-scale decomposition, academic
CREST (2022)	10-50x	Raw electrode, academic
SC-NeuroCore WaveformCodec	24x	Spike-aware pipeline, open source
Delta + arithmetic (standard)	5-15x	No spike awareness

Notes¶

Python benchmarks run in --full mode (10x iterations vs quick).
Rust benchmarks use Criterion defaults (100 samples, 3s warmup).
v2 vs v3 comparison shows PyO3 FFI overhead for small payloads; Section 7 reports true Rust throughput without FFI.
Brian2 installed with numpy 2.4.2 (its requirement); benchmarks run after downgrading to numpy 1.26.4 for sc-neurocore compatibility.

Timing-aware formal framework (2026-06-04)¶

Artefact: benchmarks/results/local_python_2026-06-04_timing_formal_framework.json.

This benchmark exercises the NEU-C.2 timing formal framework across the active polyglot proof surfaces. The run was executed under runtime core isolation with host_context.cgroup_effective_cpuset=10-11 and runtime_cpuset_shield_claimed=true; the workstation load average during the run was 1.96, 2.40, 3.13. These timings should not be compared against unloaded baselines unless the same isolated-core condition is reproduced.

Surface	Operation	Result
SystemVerilog	Dense-layer timing monitors proved through SymbiYosys/cvc5	pass, `1.476097` s
Python	TimingProperty construction and proof orchestration	16 properties
nuXmv	bounded transition-model emission	16 models, `0.000016258` s, runtime unavailable locally
Kind 2	Lustre bounded-node emission	16 models, `0.000012825` s, runtime unavailable locally

Evidence boundary: this is formal proof and model-emitter evidence, not hardware throughput evidence. hardware_measurement_claimed=false remains intentional.

ADC-to-spike quantiser (2026-06-04)¶

Artefact: benchmarks/results/local_python_2026-06-04_adc_to_spike_quantiser.json.

This benchmark exercises the NEU-C.5 ADC-to-spike sensor-ingress contract across the active Python and SystemVerilog surfaces. The run used runtime core isolation with host_context.cgroup_effective_cpuset=10-11 and runtime_cpuset_shield_claimed=true; load average during the run was 3.71, 3.80, 3.48.

Surface	Operation	Result
Python	Bit-true ADC decimation and deterministic AER rate-code reference	`3704.696` ns/sample over `409600` samples
SystemVerilog	SymbiYosys/cvc5 formal proof	pass, `4.579` s
SystemVerilog	Yosys generic synthesis estimate	`7675` cells, `11.861` s

Evidence boundary: this is local contract, formal, and synthesis-estimate evidence. hardware_measurement_claimed=false remains intentional until board-level isolated hardware evidence is captured.

DCLS Q8.8 RTL contract (2026-06-04)¶

Artefacts: benchmarks/results/local_python_2026-06-04_dcls_q88.json and benchmarks/results/local_rust_2026-06-04_dcls_q88.json.

This benchmark exercises the NEU-C.6 DCLS Q8.8 scalar layer contract across Python, PyTorch, Rust, and SystemVerilog. The Python/SystemVerilog run used runtime core isolation with host_context.cgroup_effective_cpuset=10-11 and runtime_cpuset_shield_claimed=true; load average during the run was 3.43, 3.66, 3.20. The Rust run used the same isolated unit and recorded CPU affinity 10-11.

Surface	Operation	Result
Python	Bit-true DCLS Q8.8 tent-kernel reference	`6349.497` ns/sample over `409600` samples
PyTorch	Quantised deterministic parity reference	5/5 cases passed, max accumulator diff `0`
Rust	Bit-true DCLS Q8.8 reference	`40.184` ns/sample median over `409600` samples x 7 repeats
SystemVerilog	SymbiYosys/cvc5 bounded formal check	pass, `1.533` s
SystemVerilog	Yosys generic synthesis estimate	`106003` cells, `105.897` s

Evidence boundary: this is local contract, bounded-formal, and synthesis-estimate evidence. hardware_measurement_claimed=false remains intentional. The Vivado ZU3EG WNS/utilisation contract is gated behind MIF_VIVADO_CI=1 and is not claimed until the self-hosted Vivado runner archives a passing timing summary.

UltraScale+ target contract (2026-06-04)¶

This benchmark exercises the NEU-C.1 Zynq UltraScale+ target contract across the Python Vivado-project generator and Rust SystemVerilog emitter/resource model. Both runs used runtime core isolation: system/user/init slices were moved off the benchmark cores, the benchmark ran in benchmark.slice, and the raw artefacts record CPU affinity or cgroup cpuset evidence for CPUs 10-11.

Surface	Contract	Result
Python + Vivado Tcl	Manifest validation and deterministic ZU3EG/ZU9EG batch Tcl generation	`122678.065` ns/manifest median over 2 manifests x 2000 iterations x 7 repeats
Rust	Target-aware SystemVerilog emission and conservative resource reporting for a 64x32 dense graph	`130835.757` ns/emit median over 2000 iterations x 7 repeats

The Rust report estimates 2048 DSPs for a one-DSP-per-MAC 64x32 dense graph. That exceeds the ZU3EG budget of 360, while the BRAM estimate 2 fits the budget 216. This over-budget DSP result is intentional fail-closed evidence: SC-NeuroCore must not claim that this unfurled graph fits ZU3EG until a folded or time-multiplexed dense implementation is added and validated.

The generated Tcl and Rust target metadata use the Zynq UltraScale+ DSP48E2 primitive baseline. The checked-in XDC files are clock/timing baselines only; they intentionally avoid PACKAGE_PIN and LOC constraints until a board-revision pin manifest is verified.

UltraScale+ dense folding (2026-06-04)¶

This benchmark exercises the resource-safe fold plan added after the ZU3EG target benchmark proved that an unfurled 64x32 dense layer would require 2,048 DSPs. The shared Python and Rust planners use row-group folding: five output rows are processed per cycle with all 64 input lanes live, using 320 DSPs per compute cycle and completing the 32 output rows in seven cycles.

Surface	Contract	Result
Python + SystemVerilog	Planner parity plus bounded 8x8 HDL elaboration for `sc_dense_folded_q88_core`	`2447.444` ns/plan median over 20000 iterations x 7 repeats; Yosys reports 240 generic cells
Rust	`SvTarget::dense_fold_plan(64, 32)`	`6.661` ns/plan median over 20000 iterations x 7 repeats

Both runs used the runtime cpuset shield on CPUs 10-11. The Yosys evidence is a bounded parameterised elaboration check, not a Vivado ZU3EG utilisation report. The folded HDL core implements deterministic Q8.8-weight/Q16.16-MAC dense execution and is covered by Icarus simulation; it must be selected deliberately by deployment code and does not silently replace the existing stochastic dense path.