Skip to content

SC-NeuroCore Benchmarks

Performance measurements for sc-neurocore v3.13.3. All Python numbers are CPU-only (NumPy backend). Rust numbers use Criterion with AVX-512 SIMD.


1. Environment

Field Value
Date 2026-03-15
Git tag v3.13.3
OS Windows 11 Pro 10.0.26200
CPU Intel Core i5-11600K (6C/12T, 3.9 GHz base, AVX-512, DL Boost)
RAM 32 GB DDR4-3200
Python 3.12.5
NumPy 1.26.4
Rust 1.86.0 (stable)
SIMD tier avx512-vpopcntdq
GPU (NVIDIA) GeForce GTX 1060 6GB — Pascal sm_61, PyTorch 2.6.0+cu124
GPU (AMD) Radeon RX 6600 XT — no ROCm on Windows, torch-directml incompatible
CuPy unavailable — CUDA Toolkit 13.1 dropped Pascal (sm_61) support

2. Scalar Primitives (Python)

Operation Iterations Latency (µs) Throughput
LFSR step (16-bit) 1,000,000 0.8 1.33 Mstep/s
Bitstream encoder step 1,000,000 0.9 1.10 Mstep/s
LIF neuron step (Q8.8) 1,000,000 0.9 1.07 Mstep/s

3. Packed Bitstream Operations (Python/NumPy)

Operation Size Iterations Latency (µs) Throughput
pack_bitstream 1-D 1,024 10,000 8.7 0.12 Gbit/s
pack_bitstream 1-D 65,536 2,000 123.1 0.53 Gbit/s
pack_bitstream 2-D 64×1,024 2,000 121.6 0.54 Gbit/s
vec_and 1,024 words 50,000 1.6 41.0 Gbit/s
vec_popcount SWAR 1,024 words 50,000 30.2 2.17 Gbit/s

4. Dense Layer Forward Pass (Python)

Configuration Iterations Latency (µs) Throughput
16×8, L=256 500 352.7 0.09 GOP/s (SC)
64×32, L=1,024 100 2,405.8 0.87 GOP/s (SC)

5. Full Pipeline (Python)

encode → AND synapse → popcount → LIF neuron

Configuration Iterations Latency (µs) Throughput
4 synapses, 256 steps 200 1,830 139.9 Kstep/s
16 synapses, 256 steps 50 8,679 29.5 Kstep/s

6. GPU Backend

6a. Local CPU fallback (NumPy, no CuPy)

Operation Iterations Latency (µs) Throughput
gpu_pack_bitstream (65,536) 2,000 375.9 0.17 Gbit/s
gpu_vec_mac (64×32×16w) 1,000 736.4 2.85 GOP/s

6b. Cloud GPU — NVIDIA RTX A6000 (48 GB, CUDA 12.6)

Environment: JarvisLabs A6000, Xeon Silver 4216 (64 vCPU), PyTorch 2.6.0+cu124. 1000 ms simulation, 3 runs, AI regime (conn_prob=0.1).

Neurons Synapses Wall (s) Rate (Hz) Syn events/s Peak RSS
1,000 100K 1.55 99.0 3.2 M 12 MB
2,000 400K 1.80 85.5 9.5 M 24 MB
5,000 2.5M 2.74 63.6 29.0 M 104 MB
20,000 40M 8.80 26.1 59.2 M 775 MB
50,000 250M 35.4 14.7 51.9 M 4,793 MB

Source: benchmarks/results/jarvislabs_a6000/gpu_large_scale.json, benchmarks/results/jarvislabs_a6000/scaling_4regime.json.


7. Rust Engine — Criterion Results

All benchmarks run with cargo bench --manifest-path engine/Cargo.toml on AVX-512 hardware. Times are Criterion medians.

Bitstream Packing (1M bits = 1,048,576 bits)

Variant Time Throughput vs. Python
pack (scalar) 897 µs 1.17 Gbit/s 2.2×
pack_fast (u64 chunks) 286 µs 3.67 Gbit/s
pack_dispatch (AVX-512) 25.4 µs 41.3 Gbit/s 79×

Popcount (16,384 u64 words = 1M bits)

Variant Time Throughput vs. Python
popcount_portable 12.1 µs 86.6 Mword/s 2.5×
popcount_simd (AVX-512) 2.86 µs 366 Mword/s 10.6×

Fused AND+Popcount (16 words)

Variant Time
Scalar (iter + count_ones) 19.1 ns
SIMD dispatch (AVX-512) 9.58 ns

Encoder / Neuron

Operation Time Throughput
LFSR encoder (64K steps) 131 µs 500 Mstep/s
LIF neuron (10K steps) 47.9 µs 209 Mstep/s
LIF neuron (100K steps) 446 µs 224 Mstep/s

Bernoulli Encoding (1,024-bit packed streams)

Variant Time Notes
bernoulli_stream (unrolled) 3.99 µs generate bits then pack
bernoulli_stream + pack 4.79 µs two-pass
bernoulli_packed (ChaCha8) 4.14 µs direct packed generation
bernoulli_packed_fast (ChaCha8) 1.72 µs optimized threshold loop
bernoulli_packed_simd (ChaCha8) 779 ns SIMD comparison
bernoulli_packed_simd (Xoshiro) 398 ns fastest: SIMD + fast PRNG
encode_and_popcount (Xoshiro) 285 ns fused encode+AND+popcount

Dense Layer (64 inputs → 32 neurons, L=1024)

Variant Time vs. Python
forward (baseline) 1.22 ms 2.0×
forward_fast (packed) 337 µs 7.1×
forward_fused (encode+AND+pop) 1.67 ms 1.4×
forward_prepacked (pre-encoded) 54.9 µs 43.8×
forward_batch (100 samples) 13.7 ms 17.6× per sample

PRNG Fill (1,024 bytes)

Generator Time Throughput
ChaCha8 320 ns 3.13 GB/s
Xoshiro256++ 191 ns 5.24 GB/s

Domain-Specific

Operation Time
Kuramoto solver (100 osc, 1000 steps) 199 ms
Stochastic attention (10×16 → 20×32) 138 µs
Graph layer (20 nodes, 8 features) 253 µs

8. v2 (Python) vs v3 (Rust) Speedup

SIMD tier: avx512-vpopcntdq. The v3 engine wraps Rust via PyO3.

Operation v2 (ms) v3 (ms) Speedup
pack_bitstream (1M bits) 8.26 52.66 0.2×
popcount (1M bits) 0.12 0.44 0.3×
LIF neuron (10K steps) 8.58 4.48 1.9×
Dense forward (16→8, L=1024) 0.49 0.18 2.7×
Dense forward (64→32, L=1024) 4.48 4.05 1.1×
Dense forward (128→64, L=1024) 8.02 1.10 7.3×
Attention (10×16 → 20×32) 0.03 0.28 0.1×
Attention (50×32 → 100×64) 0.10 2.19 0.0×

Geometric mean speedup: 0.5× — across all operations, the Rust FFI path is slower than pure Python on average. PyO3 call overhead (argument marshalling, GIL release/acquire) adds ~50–200 µs per invocation, which dominates when the payload is small.

The Rust engine amortises FFI cost above ~64K bits per call. On payloads >1M bits the SIMD kernel wins decisively (Dense 128→64 at 7.3×). For small networks (<64 neurons), pure Python is faster. The Rust engine targets large-payload inference (>=128 neurons, L>=1024).

The pure-Rust Criterion numbers in Section 7 show true engine throughput without FFI overhead.


9. NeuroBench-Aligned Metrics

Aligned with the NeuroBench methodology (Yik et al., 2023; arXiv:2304.04640).

Model Neurons SynOps Act. Sparsity Latency (µs) Throughput (MOP/s) Memory (B)
SCDenseLayer(8×4, L=256) 4 409,600 0.00 1,293 6.3 256
SCDenseLayer(16×8, L=512) 8 1,966,080 0.00 2,446 26.8 1,024
VectorizedSCLayer(16×8, L=512) 8 3,276,800 0.00 348 188.1 1,024
VectorizedSCLayer(64×32, L=1024) 32 41,943,040 0.00 2,476 847.0 16,384

Activation sparsity is 0.00 because SC outputs are graded probabilities, not binary spikes — every neuron produces a non-zero output on every step.


10. SNN Comparison: Brunel Balanced Network

4-Variant Translator Benchmark

1000 neurons (800E/200I), conn_prob=0.1, adapted params (weight_exc=5.0 mV, external_rate=200 Hz), 1000 ms simulation. Delta-PSC semantics: synaptic events applied as instantaneous voltage jumps (v += w), matching Brian2's on_pre="v_post += w".

Variant Spikes Rate (Hz) Brian2 Ratio Wall (s)
Brian2 reference 1,057,908 1057.9 1.00 1.11
V1 StochasticLIF 1,725,955 1726.0 1.63 30.23
V2 RateMatched N/A 0.049 (prob) 51.81
V3 FixedPoint Q8.8 1,722,195 1722.2 1.63 15.41
V4 Hybrid SC+LIF 1,888,351 1888.4 1.78 46.23

Variant descriptions

  • V1 StochasticLIF: Bug-fixed delta-PSC wiring. Previous benchmark passed input through R * I * dt (diluted by dt=0.1) and omitted v_reset. Fixed: synaptic events as neuron.v += weight, Poisson drive as voltage kicks, v_reset=10.0 passed correctly.
  • V2 RateMatched: VectorizedSCLayer in probability domain. Weights mapped to p = w / v_threshold. 100-neuron subset, bitstream_length=1024. Not spike-comparable; mean output probability = 0.0488.
  • V3 FixedPoint Q8.8: Hardware-faithful FixedPointLIFNeuron. Params mapped to Q8.8 integers (scale=256). Rate 1.63x Brian2 (higher due to different noise model).
  • V4 Hybrid SC+LIF: BitstreamSynapse AND gates → popcount → voltage → StochasticLIFNeuron. Higher rate due to stochastic amplification in the bitstream encoding.

Historical note (v3.9.0, resolved)

Prior to v3.10.0, three wiring bugs prevented the Brunel network from firing. All three were fixed in v3.10.0: 1. v_reset never passed (defaulted to 0.0 instead of 10.0) 2. Delta-PSC diluted through R * I * dt instead of direct v += w 3. Poisson drive fed as steady current instead of voltage kicks

10b. 20-Variant Brunel Translator Suite

Adapted Brunel parameters (weight_exc=5.0, ext_rate=200 Hz), 1000 neurons, 1000 ms simulation. Brian2 2.10.1 reference: 748,777 spikes, 748.8 Hz.

# Variant Spikes Rate (Hz) Brian2 Ratio Wall (s) Note
Brian2 reference 748,777 748.8 1.00 1.60
V1 StochasticLIF 1,725,955 1726.0 2.31 49.33 delta-PSC baseline
V2 RateMatched 0.0488 (prob) 80.98 probability domain
V3 FixedPoint Q8.8 1,722,195 1722.2 2.30 20.79 hardware-faithful
V4 Hybrid SC+LIF 1,571,994 1572.0 2.10 42.46 bitstream synapse
V5 Izhikevich 15,331 15.3 0.02 11.49 burst dynamics
V6 Homeostatic LIF 1,727,113 1727.1 2.31 39.36 adaptive threshold
V7 Noisy LIF 1,714,361 1714.4 2.29 44.62 noise_std=1.0
V8 Refractory LIF 114,317 114.3 0.15 26.56 5-step refractory
V9 Post-kick LIF 1,671,636 1671.6 2.23 36.16 Brian2 timing
V10 Exact-leak LIF 1,713,399 1713.4 2.29 32.78 exp(-dt/tau)
V11 Q16.12 FixedPoint 464,644 464.6 0.62 18.19 32-bit, 12 frac
V12 STDP LIF 1,689,552 1689.6 2.26 758.21 2000 STDP synapses
V13 DotProduct LIF 497,647 9952.9 261.40 n=50, bl=256
V14 Sobol bitstream 780,390 780.4 1.04 220.46 low-discrepancy
V15 JAX vectorized skipped: JAX not installed
V16 Recurrent reservoir 0.9997 (prob) 16.21 probability domain
V17 Memristive defects 48.7 51.62 stuck=1%, var=5%
V18 Numba JIT 1,685,521 1685.5 2.25 5.20 9.5× vs V1
V19 PyTorch CUDA 1,725,955 1726.0 2.31 5.70 GTX 1060 6GB
V20 Vectorized NumPy 1,725,955 1726.0 2.31 10.27 batch update
V21 Sparse Numba (CSR) 1,685,521 1685.5 2.25 0.49 10% connectivity

Acceleration comparison (1000 neurons, 1000 ms)

Backend Wall (s) Speedup vs V1
V1 per-neuron Python 49.33 1.0×
V20 vectorized NumPy 10.27 4.8×
V18 Numba JIT 5.20 9.5×
V19 PyTorch CUDA (GTX 1060) 5.70 8.7×
V21 Sparse Numba (CSR) 0.49 100.7×
Brian2 (Cython) 1.60 30.8×

10K neuron scaling

Backend Wall (s) Memory
V18 Numba JIT (dense) 15.3 800 MB (N²×8)
V21 Sparse Numba (CSR) 22.6 80 MB (10% nnz)
Brian2 (C++ codegen) 9.6 sparse (internal)

At 10K, Brian2's compiled C++ sparse codegen wins. V21 CSR reduces memory 10× but scattered index access prevents SIMD vectorization. The Rust SIMD CSR engine (planned) targets this gap.

Variant notes

  • V5 Izhikevich: Low spike rate expected — Izhikevich dynamics (quadratic nonlinearity, v range -65 to +30) respond differently to delta-PSC drive. Tonic baseline current of 5.0 added for sub-threshold depolarization.
  • V8 Refractory: 5-step (0.5 ms) dead time reduces max firing rate to ~2000 Hz, cutting observed rate by 15×.
  • V11 Q16.12: Higher precision fixed-point produces fewer spikes than Q8.8 due to more accurate leak computation (less rounding-induced depolarization).
  • V12 STDP: Online weight learning with 2000 STDP synapses. 15× slower due to per-synapse process_step() calls.
  • V14 Sobol: Low-discrepancy bitstream achieves 1.04× Brian2 ratio — closest match to reference among all spiking variants.
  • V18/V19/V20: Acceleration variants show 5–10× speedup over per-neuron Python loop. Numba JIT and PyTorch CUDA achieve similar wall times on this workload (1000 neurons); GPU advantage grows with N.
  • V21 Sparse Numba: scipy.sparse CSR connectivity. At 1K (10% connectivity): 100× faster than V1, 3× faster than V18 dense. At 10K: 1.5× slower than V18 due to scattered CSR index access, but uses 10× less memory (80 MB vs 800 MB).

11. Advanced Module Performance

Module Configuration Latency (100 runs) Per-run Key metric
Quantum-Classical Hybrid 64 qubits, L=1024 76.8 ms 0.77 ms cos²(θ/2) error < 0.03
Event-Based GNN 100 nodes, 5% density 6.6 ms 0.07 ms 17× sparse reduction
Stochastic Transformer d=64, 4 heads, L=512 1,691 ms 16.9 ms 196× energy vs FP32 MAC
BCI Decoder 64 ch, 1s signal 19.5 ms 0.20 ms Native bitstream encoding
DVS Input Layer 128×128, 1000 events 1,249 ms 12.5 ms 492× data reduction
Chaotic RNG 100K samples 13.5 ms 7.42 Msample/s
Predictive World Model 32-dim state, 50-step 34.5 ms 0.34 ms 1000× sample efficiency

12. FPGA Resource Utilization

Synthesis tooling (tools/yosys_synth.py) targets Xilinx 7-series via Yosys synth_xilinx. Yosys is not installed on this machine; run when available:

python tools/yosys_synth.py --json benchmarks/results/yosys_synth.json --markdown

Target modules: sc_bitstream_encoder, sc_lif_neuron, sc_bitstream_synapse, sc_dotproduct_to_current, sc_firing_rate_bank, sc_dense_layer_core, sc_neurocore_top.

Estimated: sc_bitstream_encoder < 100 LUTs (pending Yosys validation).


13. Bitstream Length Scaling (32x16 Dense)

Fixed network: 32 inputs, 16 neurons. Mean of 5 runs per length. Expected: roughly linear scaling (2x L = 2x time).

L Mean Time (ms) Throughput (Mbit/s)
128 0.43 151
256 0.66 197
512 1.05 250
1024 1.14 459
2048 1.36 773
4096 4.22 497

Scaling is sub-linear up to L=2048 due to NumPy vectorization amortizing fixed overhead. At L=4096, packed array allocation begins to dominate.


14. Memory Footprint (L=1024)

Peak allocation measured via tracemalloc (includes layer construction and one forward pass). Weight matrix size is the float64 weight array only.

Config Weight Matrix (MB) Peak Alloc (MB) Forward Time (ms)
32x16 (tiny) 0.004 4.63 3.1
64x32 (small) 0.016 18.33 7.5
128x64 (medium) 0.062 73.13 13.2
256x128 (large) 0.250 292.31 26.2

Peak allocation scales as O(N_neurons * N_inputs * L / 8) bytes for the packed bitstream arrays, which dominate the weight matrix by ~1000x.


15. Reproducing

# Python benchmark suite (quick ~15s, full ~120s)
python benchmarks/benchmark_suite.py --full --markdown

# Rust Criterion benchmarks (~5 min)
cargo bench --manifest-path engine/Cargo.toml

# v2 vs v3 comparison (requires Rust wheel)
PYTHONPATH=src python benchmarks/bench_v2_vs_v3.py

# NeuroBench-aligned metrics
python benchmarks/neurobench_harness.py --json benchmarks/results/neurobench.json --markdown

# SNN comparison — 20 variants (requires brian2: pip install brian2)
python benchmarks/snn_comparison.py --all --adapted --sim-ms 1000 \
    --json benchmarks/results/snn_translator_20v.json --markdown

# Advanced modules
python benchmarks/benchmark_advanced_modules.py

# FPGA synthesis (requires yosys in PATH)
python tools/yosys_synth.py --json benchmarks/results/yosys_synth.json --markdown

12. snnTorch Head-to-Head Comparison

Artifact: benchmarks/results/snntorch_vs_sc_microbench.json

Three-way comparison: SC-NeuroCore (NumPy), SC-NeuroCore (Rust SIMD), snnTorch 0.9.4.

Test SC NumPy (us/step) SC Rust SIMD (us/step) snnTorch (us/step)
Single neuron (1000 steps) 3.7 876
Dense 100->50 (500 steps) 2,280 1,059 1,103
Scale 500->500 (100 steps) 158,741 17,473 35,998
Scale 1000->1000 (50 steps) 602,730 28,882 9,421

Paradigm difference: SC-NeuroCore performs bit-true stochastic computation (uint64 popcount on packed bitstreams, L=256-512 bits per value). snnTorch does float32 matrix multiply. SC-NeuroCore is hardware-faithful (maps directly to Verilog RTL); snnTorch is GPU-optimized but not synthesizable.

  • At small scale (1-100 neurons), SC-NeuroCore's zero-overhead Python step is 237x faster than snnTorch's PyTorch dispatch overhead.
  • At medium scale (500 neurons), Rust SIMD engine is 2x faster than snnTorch.
  • At large scale (1000+), snnTorch's O(n^2) float matmul beats bitstream packing at O(n^2 * L).
  • Rust engine provides 9-21x speedup over Python SC at all scales.
python benchmarks/snntorch_vs_sc_microbench.py --runs 5 --scales 100 500 1000

16. Spike Codec Library (2026-03-25)

Compression ratios for the spike codec library. All codecs lossless. Measured on (2000 x 64) rasters at various firing rates.

ISI Codec vs General-Purpose Compressors

Auto entropy selection (varint for sparse, Huffman for dense):

Firing Rate ISI (auto) zlib-9 lzma ISI Advantage
0.1% 401x 359x 194x +12% over zlib
1% 78x 65x 48x +20% over zlib
5% 24x 19x 20x +28% over zlib
10% 16x 12x 13x +30% over zlib
30% 8.8x 7.0x 7.8x +24% over zlib

Context Predictor on Structured Data

Periodic bursting (32ch, 5-spike bursts every 50 steps):

Predictor Ratio Accuracy
ISI (no prediction) 8.6x
EMA 8.5x 90.0%
Context (Markov) 25.5x 97.8%

Realistic SpikeInterface Benchmarks

SpikeInterface ground-truth recordings with physiological ISI distributions:

Scenario Channels Units Firing Rate Best Ratio
Neuropixels-like 96 10 1-5 Hz 457x
BCI-scale 256 50 0.5-3 Hz 756x
High-density 384 100 1-10 Hz 317x

All above Neuralink 200x target.

Yosys Synthesis (gate counts)

Generic gate-level synthesis via Yosys 0.63:

Verilog Module Cells Function
sc_bitstream_encoder.v 115 LFSR predictor (bit-true with Python/Rust)
sc_cordiv.v 2 Stochastic division
sc_dotproduct_to_current.v 448 AND accumulation + popcount
sc_aer_encoder.v 1,423 Priority encoder for AER
sc_event_neuron.v 2,135 Event-driven LIF
sc_lif_neuron.v 3,134 Q8.8 fixed-point LIF

1024-channel codec estimate: ~406K gates, ~0.02 mm^2 at 7nm.

WaveformCodec: Raw Electrode Compression

End-to-end pipeline: raw 10-bit ADC -> spike detect -> template match -> compress. Measured on synthetic 1024-channel, 1 second at 20 kHz:

Metric Value
Raw data 40,960,000 bytes (328 Mbit/s)
Compressed (q=4) 1,703,435 bytes (13.6 Mbit/s)
Compression ratio 24x
Spikes detected 3,087
Templates learned 16
Bluetooth capacity 15 Mbit/s
Fits in uplink YES

Scaling (4-bit background quantization):

Channels Raw Mbit/s Compressed Mbit/s Fits BT
128 26 1.0 YES
256 51 2.0 YES
384 77 3.0 YES
1024 205 8.0 YES
3072 614 23.9 NO

Competitive comparison (raw waveform compression):

Method Compression Notes
MuSCoRE (2023) 50-100x Multi-scale decomposition, academic
CREST (2022) 10-50x Raw electrode, academic
SC-NeuroCore WaveformCodec 24x Spike-aware pipeline, open source
Delta + arithmetic (standard) 5-15x No spike awareness

Notes

  • Python benchmarks run in --full mode (10x iterations vs quick).
  • Rust benchmarks use Criterion defaults (100 samples, 3s warmup).
  • v2 vs v3 comparison shows PyO3 FFI overhead for small payloads; Section 7 reports true Rust throughput without FFI.
  • Brian2 installed with numpy 2.4.2 (its requirement); benchmarks run after downgrading to numpy 1.26.4 for sc-neurocore compatibility.