© 1998–2026 Miroslav Šotek. All rights reserved. Contact: www.anulum.li | protoscience@anulum.li ORCID: https://orcid.org/0009-0009-3560-0851 License: GNU AFFERO GENERAL PUBLIC LICENSE v3 Commercial Licensing: Available
SC-NeuroCore v3 Benchmark Report¶
Version: 3.14.0 Date: 2026-04-13 (core engine benchmarks from 2026-03-15, FPGA added) Previous: 3.6.0 (2026-02-10) SIMD Tier: avx512-vpopcntdq
Baseline Definition and Routing Note¶
v2in this report means the SC-NeuroCore v2 Python reference path measured by the same benchmark harness.- External framework baselines (Norse/Sinabs/Lava CPU) are not yet included in this file and must be added before claiming ecosystem-level superiority.
- For low-latency use (single sample or micro-batch), prefer
DenseLayer.forward_fast. - For throughput use (batch >= 10), prefer
DenseLayer.forward_batch_numpy.
Fused Dense, Fast PRNG, and Batch Forward Results¶
Measured via examples/03_benchmark_report.py on this machine.
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 11.538 | 35.448 | 0.3x | 6x |
| pack (numpy, 1000K) | 11.538 | 0.129 | 89.3x | 6x |
| popcount (list, 1000K) | 109.023 | 151.322 | 0.7x | 20x |
| popcount (numpy, 1000K) | 109.023 | 1.989 | 54.8x | 20x |
| dense forward (64->32, L=1024) | 3.728 | 1.598 | 2.3x | 70x |
| dense fast (64->32, L=1024) | 3.728 | 0.299 | 12.4x | 70x |
| dense prepacked (64->32, L=1024) | 3.728 | 0.282 | 13.2x | 70x |
| dense prepacked numpy (64->32, L=1024) | 3.728 | 0.110 | 33.9x | 70x |
| dense numpy (64->32, L=1024) | 3.728 | 0.647 | 5.8x | 70x |
| dense fused (64->32, L=1024) | 4.664 | 0.380 | 12.3x | 70x |
| dense batch (100x64->32, L=1024) | 289.305 | 6.893 | 42.0x | 70x |
| LIF (per-call, 100K) | 126.313 | 25.525 | 4.9x | 400x |
| LIF (batch, 100K) | 126.313 | 0.905 | 139.6x | 400x |
| LIF multi (100x100K) | 12911.296 | 25.196 | 512.4x | 400x |
Criterion Diagnosis for Fused Dense and Fast PRNG¶
Measured via targeted commands:
PowerShell
cargo bench --bench full_bench dense_forward_fused
cargo bench --bench full_bench encode_and_popcount
cargo bench --bench full_bench dense_forward_batch
cargo bench --bench full_bench prng_xoshiro
| Benchmark | Time (95% CI) |
|---|---|
| dense_forward_fused_64x32 | 1.1268 ms - 1.9825 ms |
| bernoulli_encode_and_popcount_1024 | 342.59 ns - 408.10 ns |
| dense_forward_batch_64x32_x100 | 21.842 ms - 28.753 ms |
| prng_xoshiro_fill_1024 | 1.5879 us - 1.7596 us |
Interpretation: - Fused encode+AND+popcount path is functionally correct and benchmarked end-to-end. - Batched dense API reduces Python-level overhead substantially vs per-sample loops. - Multi-neuron LIF remains above the Blueprint 400x target on this host.
SIMD Dense Inner Loop Results (Reference)¶
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 10.337 | 37.799 | 0.3x | 6x |
| pack (numpy, 1000K) | 10.337 | 0.069 | 149.3x | 6x |
| popcount (list, 1000K) | 96.956 | 135.444 | 0.7x | 20x |
| popcount (numpy, 1000K) | 96.956 | 1.563 | 62.0x | 20x |
| dense forward (64->32, L=1024) | 2.953 | 0.683 | 4.3x | 70x |
| dense fast (64->32, L=1024) | 2.953 | 0.171 | 17.3x | 70x |
| dense prepacked (64->32, L=1024) | 2.953 | 0.092 | 31.9x | 70x |
| dense prepacked numpy (64->32, L=1024) | 2.953 | 0.033 | 90.2x | 70x |
| dense numpy (64->32, L=1024) | 2.953 | 0.118 | 25.1x | 70x |
| LIF (per-call, 100K) | 106.451 | 23.925 | 4.4x | 400x |
| LIF (batch, 100K) | 106.451 | 0.897 | 118.7x | 400x |
| LIF multi (100x100K) | 13349.151 | 31.783 | 420.0x | 400x |
SIMD Pack Dispatch Results (Reference)¶
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 16.918 | 45.964 | 0.4x | 6x |
| pack (numpy, 1000K) | 16.918 | 0.133 | 127.0x | 6x |
| popcount (list, 1000K) | 94.333 | 138.951 | 0.7x | 20x |
| popcount (numpy, 1000K) | 94.333 | 1.303 | 72.4x | 20x |
| dense forward (64->32, L=1024) | 7.077 | 19.442 | 0.4x | 70x |
| dense fast (64->32, L=1024) | 7.077 | 17.781 | 0.4x | 70x |
| dense prepacked (64->32, L=1024) | 7.077 | 5.453 | 1.3x | 70x |
| dense prepacked numpy (64->32, L=1024) | 7.077 | 6.125 | 1.2x | 70x |
| dense numpy (64->32, L=1024) | 7.077 | 6.727 | 1.1x | 70x |
| LIF (per-call, 100K) | 139.417 | 27.015 | 5.2x | 400x |
| LIF (batch, 100K) | 139.417 | 0.992 | 140.5x | 400x |
| LIF multi (100x100K) | 15442.319 | 90.480 | 170.7x | 400x |
Fast Bernoulli Encoding Results (Reference)¶
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 10.807 | 62.841 | 0.2x | 6x |
| pack (numpy, 1000K) | 10.807 | 9.415 | 1.1x | 6x |
| popcount (list, 1000K) | 118.885 | 144.767 | 0.8x | 20x |
| popcount (numpy, 1000K) | 118.885 | 1.866 | 63.7x | 20x |
| dense forward (64->32, L=1024) | 6.971 | 8.034 | 0.9x | 70x |
| dense fast (64->32, L=1024) | 6.971 | 6.125 | 1.1x | 70x |
| dense prepacked (64->32, L=1024) | 6.971 | 3.599 | 1.9x | 70x |
| dense prepacked numpy (64->32, L=1024) | 6.971 | 0.085 | 81.6x | 70x |
| dense numpy (64->32, L=1024) | 6.971 | 4.908 | 1.4x | 70x |
| LIF (per-call, 100K) | 143.202 | 35.008 | 4.1x | 400x |
| LIF (batch, 100K) | 143.202 | 1.404 | 102.0x | 400x |
Dense Forward Optimization Results (Reference)¶
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 15.208 | 54.526 | 0.3x | 6x |
| pack (numpy, 1000K) | 15.208 | 10.315 | 1.5x | 6x |
| popcount (list, 1000K) | 108.495 | 316.783 | 0.3x | 20x |
| popcount (numpy, 1000K) | 108.495 | 1.242 | 87.4x | 20x |
| dense forward (64->32, L=1024) | 4.173 | 20.570 | 0.2x | 70x |
| dense fast (64->32, L=1024) | 4.173 | 4.318 | 1.0x | 70x |
| dense prepacked (64->32, L=1024) | 4.173 | 0.562 | 7.4x | 70x |
| LIF (per-call, 100K) | 240.266 | 61.585 | 3.9x | 400x |
| LIF (batch, 100K) | 240.266 | 1.496 | 160.6x | 400x |