Skip to content

Rust Engine and Performance Optimization

SC-NeuroCore ships a Rust backend (sc_neurocore_engine) with SIMD-accelerated popcount kernels and Rayon parallelism across neurons and inputs.

1. Installation and Verification

Requires Rust 1.75+ and maturin 1.x.

cd engine && maturin develop --release
sc-neurocore info
sc-neurocore 3.13.3
Python 3.12.x
Rust engine: 3.13.3 (avx2)
NumPy: 2.x.x

--release enables fat LTO + single codegen unit (engine/Cargo.toml [profile.release]). Debug builds are 10-50x slower.

2. SIMD Tiers

simd_tier() probes CPU features at runtime and returns the highest tier:

Tier Accelerated Operations
avx512-vpopcntdq 512-bit pack, native VPOPCNTDQ, fused AND+popcount, f64 FMA
avx512bw 512-bit pack, lookup-table popcount, f64 ops
avx512f f64 dot/sum/max/scale only; popcount falls to AVX2 path
avx2 256-bit Harley-Seal popcount, FMA f64 dot
popcnt hardware popcnt per u64, scalar everything else
neon AArch64 128-bit pack/popcount/f64
portable Rust count_ones() intrinsic, scalar loops
import sc_neurocore_engine as engine

engine.simd_tier()          # "avx2"
engine.set_num_threads(4)   # pin rayon pool; 0 = all cores (call before first parallel op)

3. Forward-Pass Hierarchy

The Rust VectorizedSCLayer exposes six forward methods. All produce identical stochastic dot-product results; they differ in FFI overhead and parallelism strategy.

from sc_neurocore_engine.layers import VectorizedSCLayer
import sc_neurocore_engine as engine
import numpy as np

layer = VectorizedSCLayer(n_inputs=64, n_neurons=32, length=2048)
x = np.random.uniform(0, 1, 64)

forward(x.tolist()) -- sequential encode, parallel accumulate (Rayon when n_neurons >= 8). Two list copies across FFI.

forward_fast(x.tolist(), seed=42) -- parallel encode (Rayon when n_inputs >= 128) + parallel accumulate. Best list-input single-sample path.

forward_numpy(x, seed=42) -- zero-copy float64 input via PyReadonlyArray1. Preferred for single samples from NumPy.

forward_prepacked(packed) / forward_prepacked_numpy(packed) -- skip encoding entirely. Pre-encode once with batch_encode_numpy, reuse across layers:

packed = engine.batch_encode_numpy(x, length=2048, seed=42)
out = layer.forward_prepacked_numpy(packed)  # zero-copy variant

forward_batch_numpy(batch, seed=42) -- (N, n_inputs) float64 array, single FFI call, Rayon across samples and neurons:

batch = np.random.uniform(0, 1, (100, 64))
out = layer.forward_batch_numpy(batch, seed=42)  # (100, 32)

Selection guide:

Scenario Method
Single sample, list input forward_fast
Single sample, NumPy input forward_numpy
Same input, multiple layers forward_prepacked_numpy
Batch of N samples forward_batch_numpy

4. Python Fallback and GPU Path

Without the Rust engine, sc_neurocore.layers.VectorizedSCLayer uses packed NumPy uint64 SWAR popcount (10-100x slower than Rust).

With CuPy installed, set use_gpu=True for GPU offload via gpu_vec_mac (broadcast AND + SWAR popcount on CUDA). Effective for large layers (256+ neurons, 4096+ bit streams); for small networks Rust SIMD beats GPU round-trip.

from sc_neurocore import VectorizedSCLayer
from sc_neurocore.accel import to_device, to_host, HAS_CUPY

layer = VectorizedSCLayer(n_inputs=256, n_neurons=128, length=4096, use_gpu=True)

5. Benchmarking

Python v2-only suite (pack, popcount, dense, GPU, scaling, memory):

python benchmarks/benchmark_suite.py --markdown   # writes benchmarks/results/BENCHMARKS.md
python benchmarks/benchmark_suite.py --full        # 10x iterations

v2 vs v3 head-to-head (all six forward variants, per-op speedup):

python examples/10_benchmark_report.py

Rust-native Criterion (no Python overhead):

cd engine
cargo bench --bench bitstream_bench
cargo bench --bench full_bench
cargo bench --bench scaling_bench

Bitstream length scaling

Throughput (Mbit/s) is constant across L. The suite sweeps a 32x16 dense layer at L = 128, 256, 512, 1024, 2048, 4096; wall time grows linearly.

Batch size effects (forward_batch_numpy, 64->32, L=1024)

Batch Time Regime
1 ~0.1 ms FFI-dominated
10 ~0.3 ms Rayon saturates
100 ~1.5 ms linear
1000 ~15 ms memory-bandwidth bound

6. Rayon Thresholds

Constant Value Effect
RAYON_ENCODE_THRESHOLD 128 parallel input encoding when n_inputs >= 128
RAYON_NEURON_THRESHOLD 8 parallel accumulation when n_neurons >= 8

Below these, sequential Rust loops avoid thread-pool synchronization cost.

7. Representative Speedups

AMD Zen 4 (AVX-512), Rust 1.82, Python 3.12:

Operation Python (ms) Rust (ms) Speedup
pack 1M bits 12.3 0.18 68x
popcount 1M words 8.7 0.09 97x
dense 64->32, L=1024 (list) 45.2 0.52 87x
dense 64->32, L=1024 (numpy) 45.2 0.31 146x
dense batch 100x64->32 4520 18.1 250x
LIF 100K steps (batch) 312 0.74 422x

On AVX2, expect 30-60% of AVX-512 throughput for popcount-dominated workloads. ARM NEON provides 2-4x over portable scalar.