© 1998–2026 Miroslav Šotek. All rights reserved. Contact: www.anulum.li | protoscience@anulum.li ORCID: https://orcid.org/0009-0009-3560-0851 License: GNU AFFERO GENERAL PUBLIC LICENSE v3 Commercial Licensing: Available
SC-NeuroCore v3 Benchmark Report¶
Version: 3.15.8 Date: 2026-04-13 (core engine benchmarks from 2026-03-15, FPGA added) Previous: 3.6.0 (2026-02-10) SIMD Tier: avx512-vpopcntdq
Baseline Definition and Routing Note¶
v2in this report means the SC-NeuroCore v2 Python reference path measured by the same benchmark harness.- External framework baselines (Norse/Sinabs/Lava CPU) are not yet included in this file and must be added before making ecosystem-level claims.
- For low-latency use (single sample or micro-batch), prefer
DenseLayer.forward_fast. - For throughput use (batch >= 10), prefer
DenseLayer.forward_batch_numpy. - This report is release evidence only for rows backed by committed benchmark artefacts or named tool reports. New local benchmark numbers must not be promoted into public claims until the raw JSON, CSV, or companion paper artefact plus environment provenance is committed.
Fused Dense, Fast PRNG, and Batch Forward Results¶
Measured via examples/03_benchmark_report.py on this machine.
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 11.538 | 35.448 | 0.3x | 6x |
| pack (numpy, 1000K) | 11.538 | 0.129 | 89.3x | 6x |
| popcount (list, 1000K) | 109.023 | 151.322 | 0.7x | 20x |
| popcount (numpy, 1000K) | 109.023 | 1.989 | 54.8x | 20x |
| dense forward (64->32, L=1024) | 3.728 | 1.598 | 2.3x | 70x |
| dense fast (64->32, L=1024) | 3.728 | 0.299 | 12.4x | 70x |
| dense prepacked (64->32, L=1024) | 3.728 | 0.282 | 13.2x | 70x |
| dense prepacked numpy (64->32, L=1024) | 3.728 | 0.110 | 33.9x | 70x |
| dense numpy (64->32, L=1024) | 3.728 | 0.647 | 5.8x | 70x |
| dense fused (64->32, L=1024) | 4.664 | 0.380 | 12.3x | 70x |
| dense batch (100x64->32, L=1024) | 289.305 | 6.893 | 42.0x | 70x |
| LIF (per-call, 100K) | 126.313 | 25.525 | 4.9x | 400x |
| LIF (batch, 100K) | 126.313 | 0.905 | 139.6x | 400x |
| LIF multi (100x100K) | 12911.296 | 25.196 | 512.4x | 400x |
Criterion Diagnosis for Fused Dense and Fast PRNG¶
Measured via targeted commands:
PowerShell
cargo bench --bench full_bench dense_forward_fused
cargo bench --bench full_bench encode_and_popcount
cargo bench --bench full_bench dense_forward_batch
cargo bench --bench full_bench prng_xoshiro
| Benchmark | Time (95% CI) |
|---|---|
| dense_forward_fused_64x32 | 1.1268 ms - 1.9825 ms |
| bernoulli_encode_and_popcount_1024 | 342.59 ns - 408.10 ns |
| dense_forward_batch_64x32_x100 | 21.842 ms - 28.753 ms |
| prng_xoshiro_fill_1024 | 1.5879 us - 1.7596 us |
Interpretation: - Fused encode+AND+popcount path is functionally correct and benchmarked end-to-end. - Batched dense API reduces Python-level overhead substantially vs per-sample loops. - Multi-neuron LIF remains above the Blueprint 400x target on this host.
SIMD Dense Inner Loop Results (Reference)¶
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 10.337 | 37.799 | 0.3x | 6x |
| pack (numpy, 1000K) | 10.337 | 0.069 | 149.3x | 6x |
| popcount (list, 1000K) | 96.956 | 135.444 | 0.7x | 20x |
| popcount (numpy, 1000K) | 96.956 | 1.563 | 62.0x | 20x |
| dense forward (64->32, L=1024) | 2.953 | 0.683 | 4.3x | 70x |
| dense fast (64->32, L=1024) | 2.953 | 0.171 | 17.3x | 70x |
| dense prepacked (64->32, L=1024) | 2.953 | 0.092 | 31.9x | 70x |
| dense prepacked numpy (64->32, L=1024) | 2.953 | 0.033 | 90.2x | 70x |
| dense numpy (64->32, L=1024) | 2.953 | 0.118 | 25.1x | 70x |
| LIF (per-call, 100K) | 106.451 | 23.925 | 4.4x | 400x |
| LIF (batch, 100K) | 106.451 | 0.897 | 118.7x | 400x |
| LIF multi (100x100K) | 13349.151 | 31.783 | 420.0x | 400x |
SIMD Pack Dispatch Results (Reference)¶
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 16.918 | 45.964 | 0.4x | 6x |
| pack (numpy, 1000K) | 16.918 | 0.133 | 127.0x | 6x |
| popcount (list, 1000K) | 94.333 | 138.951 | 0.7x | 20x |
| popcount (numpy, 1000K) | 94.333 | 1.303 | 72.4x | 20x |
| dense forward (64->32, L=1024) | 7.077 | 19.442 | 0.4x | 70x |
| dense fast (64->32, L=1024) | 7.077 | 17.781 | 0.4x | 70x |
| dense prepacked (64->32, L=1024) | 7.077 | 5.453 | 1.3x | 70x |
| dense prepacked numpy (64->32, L=1024) | 7.077 | 6.125 | 1.2x | 70x |
| dense numpy (64->32, L=1024) | 7.077 | 6.727 | 1.1x | 70x |
| LIF (per-call, 100K) | 139.417 | 27.015 | 5.2x | 400x |
| LIF (batch, 100K) | 139.417 | 0.992 | 140.5x | 400x |
| LIF multi (100x100K) | 15442.319 | 90.480 | 170.7x | 400x |
Fast Bernoulli Encoding Results (Reference)¶
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 10.807 | 62.841 | 0.2x | 6x |
| pack (numpy, 1000K) | 10.807 | 9.415 | 1.1x | 6x |
| popcount (list, 1000K) | 118.885 | 144.767 | 0.8x | 20x |
| popcount (numpy, 1000K) | 118.885 | 1.866 | 63.7x | 20x |
| dense forward (64->32, L=1024) | 6.971 | 8.034 | 0.9x | 70x |
| dense fast (64->32, L=1024) | 6.971 | 6.125 | 1.1x | 70x |
| dense prepacked (64->32, L=1024) | 6.971 | 3.599 | 1.9x | 70x |
| dense prepacked numpy (64->32, L=1024) | 6.971 | 0.085 | 81.6x | 70x |
| dense numpy (64->32, L=1024) | 6.971 | 4.908 | 1.4x | 70x |
| LIF (per-call, 100K) | 143.202 | 35.008 | 4.1x | 400x |
| LIF (batch, 100K) | 143.202 | 1.404 | 102.0x | 400x |
Dense Forward Optimization Results (Reference)¶
| Operation | v2 (ms) | v3 (ms) | Speedup | Target |
|---|---|---|---|---|
| pack (list, 1000K) | 15.208 | 54.526 | 0.3x | 6x |
| pack (numpy, 1000K) | 15.208 | 10.315 | 1.5x | 6x |
| popcount (list, 1000K) | 108.495 | 316.783 | 0.3x | 20x |
| popcount (numpy, 1000K) | 108.495 | 1.242 | 87.4x | 20x |
| dense forward (64->32, L=1024) | 4.173 | 20.570 | 0.2x | 70x |
| dense fast (64->32, L=1024) | 4.173 | 4.318 | 1.0x | 70x |
| dense prepacked (64->32, L=1024) | 4.173 | 0.562 | 7.4x | 70x |
| LIF (per-call, 100K) | 240.266 | 61.585 | 3.9x | 400x |
| LIF (batch, 100K) | 240.266 | 1.496 | 160.6x | 400x |