Rust Engine and Performance Optimization¶

SC-NeuroCore ships a Rust backend (sc_neurocore_engine) with SIMD-accelerated popcount kernels and Rayon parallelism across neurons and inputs.

1. Installation and Verification¶

Requires Rust 1.75+ and maturin 1.x.

cd engine && maturin develop --release
sc-neurocore info

sc-neurocore 3.13.3
Python 3.12.x
Rust engine: 3.13.3 (avx2)
NumPy: 2.x.x

--release enables fat LTO + single codegen unit (engine/Cargo.toml [profile.release]). Debug builds are 10-50x slower.

2. SIMD Tiers¶

simd_tier() probes CPU features at runtime and returns the highest tier:

Tier	Accelerated Operations
`avx512-vpopcntdq`	512-bit pack, native VPOPCNTDQ, fused AND+popcount, f64 FMA
`avx512bw`	512-bit pack, lookup-table popcount, f64 ops
`avx512f`	f64 dot/sum/max/scale only; popcount falls to AVX2 path
`avx2`	256-bit Harley-Seal popcount, FMA f64 dot
`popcnt`	hardware `popcnt` per u64, scalar everything else
`neon`	AArch64 128-bit pack/popcount/f64
`portable`	Rust `count_ones()` intrinsic, scalar loops

import sc_neurocore_engine as engine

engine.simd_tier()          # "avx2"
engine.set_num_threads(4)   # pin rayon pool; 0 = all cores (call before first parallel op)

3. Forward-Pass Hierarchy¶

The Rust VectorizedSCLayer exposes six forward methods. All produce identical stochastic dot-product results; they differ in FFI overhead and parallelism strategy.

from sc_neurocore_engine.layers import VectorizedSCLayer
import sc_neurocore_engine as engine
import numpy as np

layer = VectorizedSCLayer(n_inputs=64, n_neurons=32, length=2048)
x = np.random.uniform(0, 1, 64)

forward(x.tolist()) -- sequential encode, parallel accumulate (Rayon when n_neurons >= 8). Two list copies across FFI.

forward_fast(x.tolist(), seed=42) -- parallel encode (Rayon when n_inputs >= 128) + parallel accumulate. Best list-input single-sample path.

forward_numpy(x, seed=42) -- zero-copy float64 input via PyReadonlyArray1. Preferred for single samples from NumPy.

forward_prepacked(packed) / forward_prepacked_numpy(packed) -- skip encoding entirely. Pre-encode once with batch_encode_numpy, reuse across layers:

packed = engine.batch_encode_numpy(x, length=2048, seed=42)
out = layer.forward_prepacked_numpy(packed)  # zero-copy variant

forward_batch_numpy(batch, seed=42) -- (N, n_inputs) float64 array, single FFI call, Rayon across samples and neurons:

batch = np.random.uniform(0, 1, (100, 64))
out = layer.forward_batch_numpy(batch, seed=42)  # (100, 32)

Selection guide:

Scenario	Method
Single sample, list input	`forward_fast`
Single sample, NumPy input	`forward_numpy`
Same input, multiple layers	`forward_prepacked_numpy`
Batch of N samples	`forward_batch_numpy`

4. Python Fallback and GPU Path¶

Without the Rust engine, sc_neurocore.layers.VectorizedSCLayer uses packed NumPy uint64 SWAR popcount (10-100x slower than Rust).

With CuPy installed, set use_gpu=True for GPU offload via gpu_vec_mac (broadcast AND + SWAR popcount on CUDA). Effective for large layers (256+ neurons, 4096+ bit streams); for small networks Rust SIMD beats GPU round-trip.

from sc_neurocore import VectorizedSCLayer
from sc_neurocore.accel import to_device, to_host, HAS_CUPY

layer = VectorizedSCLayer(n_inputs=256, n_neurons=128, length=4096, use_gpu=True)

5. Benchmarking¶

Python v2-only suite (pack, popcount, dense, GPU, scaling, memory):

python benchmarks/benchmark_suite.py --markdown   # writes benchmarks/results/BENCHMARKS.md
python benchmarks/benchmark_suite.py --full        # 10x iterations

v2 vs v3 head-to-head (all six forward variants, per-op speedup):

python examples/10_benchmark_report.py

Rust-native Criterion (no Python overhead):

cd engine
cargo bench --bench bitstream_bench
cargo bench --bench full_bench
cargo bench --bench scaling_bench

Bitstream length scaling¶

Throughput (Mbit/s) is constant across L. The suite sweeps a 32x16 dense layer at L = 128, 256, 512, 1024, 2048, 4096; wall time grows linearly.

Batch size effects (`forward_batch_numpy`, 64->32, L=1024)¶

Batch	Time	Regime
1	~0.1 ms	FFI-dominated
10	~0.3 ms	Rayon saturates
100	~1.5 ms	linear
1000	~15 ms	memory-bandwidth bound

6. Rayon Thresholds¶

Constant	Value	Effect
`RAYON_ENCODE_THRESHOLD`	128	parallel input encoding when `n_inputs >= 128`
`RAYON_NEURON_THRESHOLD`	8	parallel accumulation when `n_neurons >= 8`

Below these, sequential Rust loops avoid thread-pool synchronization cost.

7. Representative Speedups¶

AMD Zen 4 (AVX-512), Rust 1.82, Python 3.12:

Operation	Python (ms)	Rust (ms)	Speedup
pack 1M bits	12.3	0.18	68x
popcount 1M words	8.7	0.09	97x
dense 64->32, L=1024 (list)	45.2	0.52	87x
dense 64->32, L=1024 (numpy)	45.2	0.31	146x
dense batch 100x64->32	4520	18.1	250x
LIF 100K steps (batch)	312	0.74	422x

On AVX2, expect 30-60% of AVX-512 throughput for popcount-dominated workloads. ARM NEON provides 2-4x over portable scalar.