© 1998–2026 Miroslav Šotek. All rights reserved. Contact: www.anulum.li | protoscience@anulum.li ORCID: https://orcid.org/0009-0009-3560-0851 License: GNU AFFERO GENERAL PUBLIC LICENSE v3 Commercial Licensing: Available
SC-NeuroCore v3 Migration Guide¶
Status¶
Phase 1 scaffolding is in place:
- Rust engine crate in
engine/ - Python bridge package in
bridge/sc_neurocore_engine/ - v2-vs-v3 equivalence tests in
tests/equivalence/
Build (Local)¶
cd 03_CODE/sc-neurocore/engine
maturin develop --release
Quick Sanity Check¶
python -c "import sc_neurocore_engine; print(sc_neurocore_engine.__version__); print(sc_neurocore_engine.simd_tier())"
Equivalence Tests¶
cd 03_CODE/sc-neurocore
$env:PYTHONPATH=\"src;bridge\"
python -m pytest tests/equivalence -v --tb=short
Notes¶
- v2 package under
src/sc_neurocore/remains untouched. - v3 bridge is a drop-in import path for hot kernels and fixed-point neuron APIs.
- Encoder and LIF in v3 currently follow strict blueprint operation ordering (step-then-compare encoder, refractory override after threshold evaluation).
Phase 2 Features (February 2026)¶
Surrogate Gradients¶
SC-NeuroCore v3 introduces backpropagation support for stochastic computing layers via surrogate gradients:
SurrogateLif- LIF neuron with differentiable backward passDifferentiableDenseLayer- SC layer with weight gradient computation- Supported surrogates: FastSigmoid, SuperSpike, ArcTan, StraightThrough
Stochastic Attention¶
- Rate-mode: bit-exact match with v2 (atol < 1e-12)
- SC-mode: bitstream-based matrix multiply (new v3 capability)
- Multi-head support (Phase 3)
Graph Neural Network¶
- Rate-mode: bit-exact match with v2 (atol < 1e-12)
- SC-mode: bitstream-based message passing (Phase 3)
Kuramoto Oscillator Solver¶
- High-performance phase-difference coupling
- SSGF-compatible extended solver with geometry + PGBO terms
- Pre-allocated scratch arrays, rayon parallelism
- Box-Muller noise generation with ChaCha8Rng
Phase 3 Features (February 2026)¶
SSGF Integration¶
step_ssgf()- Extended Kuramoto with geometry (W), PGBO (h_munu), and field pressure (F*cos) coupling terms- Direct integration with SSGF MicroCycleEngine pipeline
- Single
sin_diffcomputation shared across all coupling terms
Property-Based Testing¶
- proptest coverage for all numeric modules
- Catches edge cases: overflows, NaN, extreme values
Phase 4 Features (February 2026)¶
SC Compute Graph IR¶
A Rust-native intermediate representation for SC pipelines:
ScGraph: Directed acyclic graph of SC operationsScGraphBuilder: Fluent API for graph constructionverify(): Static verification (SSA, type checking, acyclicity)print()/parse(): Stable text format with round-trip fidelity- 11 operation types mapping to HDL primitives
SystemVerilog Emitter¶
Compile IR graphs to synthesizable RTL:
- Direct instantiation of existing
hdl/modules - Automatic clock/reset distribution
- Constant folding for Q8.8 fixed-point parameters
Co-Simulation Harness¶
Verify generated HDL against Rust golden model:
- LFSR full-cycle equivalence
- LIF neuron bit-exact comparison
- Encoder probability convergence
- Synapse AND operation verification
Phase 5 Features (February 2026)¶
IR Python Bridge¶
Construct SC compute graphs from Python and compile to SystemVerilog:
from sc_neurocore_engine.ir import ScGraphBuilder
b = ScGraphBuilder("my_synapse")
x = b.input("x_prob", "rate")
w = b.input("w_prob", "rate")
x_enc = b.encode(x, length=1024, seed=0xACE1)
w_enc = b.encode(w, length=1024, seed=0xBEEF)
product = b.bitwise_and(x_enc, w_enc)
count = b.popcount(product)
rate = b.div_const(count, 1024)
b.output("firing_rate", rate)
graph = b.build()
assert graph.verify() is None
sv_code = graph.emit_sv()
Co-Simulation¶
When Verilator is installed, co-sim tests compile HDL and compare against the Rust golden model bit-by-bit. Without Verilator, tests skip gracefully.
Distributable Wheels¶
Pre-built wheels available for: - Linux (x86_64, aarch64) - macOS (x86_64, arm64) - Windows (x86_64) - Python 3.10, 3.11, 3.12, 3.13, 3.14
Phase 6 Features (February 2026)¶
NumPy Zero-Copy Interop¶
For maximum performance, use the numpy-native variants:
import numpy as np
import sc_neurocore_engine as v3
bits = np.random.randint(0, 2, 1_000_000, dtype=np.uint8)
packed = v3.pack_bitstream_numpy(bits) # Zero-copy input
count = v3.popcount_numpy(packed) # Zero-copy input
recovered = v3.unpack_bitstream_numpy(packed, len(bits))
The original pack_bitstream() and popcount() functions still work
with Python lists for backward compatibility.
Batch Operations¶
Process entire arrays in single FFI calls:
# 100K LIF steps in one call (vs 100K per-step calls)
spikes, voltages = v3.batch_lif_run(
100_000, leak_k=20, gain_k=256, i_t=128
)
# Per-step varying current
currents = np.array([128, 200, 150], dtype=np.int16)
spikes, voltages = v3.batch_lif_run_varying(
leak_k=20, gain_k=256, currents=currents
)
CI/CD¶
- Verilator co-simulation runs automatically on every push (Ubuntu)
- Wheel builds on 3 OS x 4 Python versions
Phase 7 Features (February 2026)¶
Dense Forward Optimization¶
Three performance tiers for dense layer inference:
import numpy as np
import sc_neurocore_engine as v3
layer = v3.DenseLayer(64, 32, 1024)
inputs = np.random.uniform(0, 1, 64)
# Original (sequential encoding)
out = layer.forward(inputs.tolist())
# Fast (parallel encoding)
out = layer.forward_fast(inputs.tolist())
# Pre-packed (skip encoding)
packed = v3.batch_encode_numpy(inputs, length=1024, seed=42)
out = layer.forward_prepacked(packed)
batch_encode_numpy¶
Returns a 2-D numpy uint64 array instead of nested Python lists:
probs = np.array([0.3, 0.5, 0.7, 0.9])
packed = v3.batch_encode_numpy(probs, length=1024, seed=42)
# packed.shape == (4, 16) # 4 inputs x ceil(1024/64) words
# packed.dtype == np.uint64
Phase 8 Features (February 2026)¶
Single-Call Dense Forward with NumPy¶
The recommended high-performance inference API:
import numpy as np
import sc_neurocore_engine as v3
layer = v3.DenseLayer(64, 32, 1024)
inputs = np.random.uniform(0, 1, 64)
# Single FFI call: numpy in -> parallel encode -> parallel compute -> numpy out
out = layer.forward_numpy(inputs)
# out is a numpy float64 array of shape (32,)
Parallel Batch Encoding¶
batch_encode_numpy now uses rayon-parallel encoding:
probs = np.random.uniform(0, 1, 1000)
packed = v3.batch_encode_numpy(probs, length=1024, seed=42)
# Each probability encoded on its own thread
Note: batch_encode_numpy uses per-index seeding (seed + index) for
parallelism. Use batch_encode for sequential single-RNG seeding.
Phase 9 Features (February 2026)¶
Fast Bernoulli Encoding¶
forward_fast and batch_encode_numpy now use byte-threshold Bernoulli
encoding with 8x less random number generation overhead. This provides
1/256 probability granularity, which is negligible compared to the
statistical noise of 1024-bit bitstreams.
The original forward() and batch_encode() retain f64-precision
encoding for backward compatibility.
Zero-Copy Prepacked Forward¶
For maximum throughput with pre-encoded inputs:
import numpy as np
import sc_neurocore_engine as v3
layer = v3.DenseLayer(64, 32, 1024)
probs = np.random.uniform(0, 1, 64)
# Encode once, forward many times (zero-copy)
packed = v3.batch_encode_numpy(probs, length=1024, seed=42)
out = layer.forward_prepacked_numpy(packed)
# out is a numpy float64 array, packed was never copied
Thread Pool Tuning¶
Control rayon's parallel thread count:
v3.set_num_threads(4) # Use 4 threads for all parallel ops
Must be called before any parallel operation. Pass 0 for automatic (number of CPU cores).
Phase 10 Features (February 2026)¶
SIMD Pack Dispatch¶
pack_bitstream_numpy now uses runtime SIMD dispatch (AVX-512BW, AVX2, or portable fallback):
import numpy as np
import sc_neurocore_engine as v3
bits = np.random.randint(0, 2, 1_000_000, dtype=np.uint8)
packed = v3.pack_bitstream_numpy(bits)
This keeps API compatibility while accelerating large pack workloads.
Parallel Multi-Neuron LIF Batch¶
Run many independent neurons in parallel:
import numpy as np
import sc_neurocore_engine as v3
currents = np.full(100, 128, dtype=np.int16)
spikes, voltages = v3.batch_lif_run_multi(
100, 100_000, leak_k=20, gain_k=256, currents=currents
)
# spikes.shape == (100, 100000)
# voltages.shape == (100, 100000)
Rayon Work Threshold in Dense Fast Path¶
DenseLayer.forward_fast now selects sequential input encoding for small
input counts and rayon encoding for larger inputs. This avoids thread-pool
overhead on small workloads without changing numerical outputs (per-index
RNG seeding remains identical).
Phase 11 Features (February 2026)¶
SIMD Dense Inner Loop¶
Dense accumulation now uses SIMD-dispatched fused AND+popcount:
- AVX-512 (
avx512vpopcntdq) path for 8-word vector chunks - AVX2 path for vectorized AND + scalar lane popcount
- Portable fallback for non-SIMD targets
This path is used across dense forward variants without API changes.
SIMD Bernoulli Encode¶
forward_fast and batch_encode_numpy now use bernoulli_packed_simd:
- AVX-512BW compare path (
64 bytes -> 64-bit mask) - AVX2 compare path (
2 x 32-byte compares) - Scalar fallback for partial words and non-SIMD systems
Sampling semantics remain statistically equivalent to Phase 10 fast encoding.
Flat Packed Weight Storage¶
DenseLayer packed weights are now stored in one contiguous buffer:
- layout:
[neuron][input][word] - computed with
weight_slice(neuron, input)accessors
This removes nested Vec<Vec<Vec<u64>>> indirection and improves cache locality.
Zero-Allocation LIF Batch Outputs¶
batch_lif_run, batch_lif_run_multi, and batch_lif_run_varying now:
- pre-allocate numpy output arrays
- write directly into contiguous buffers
- avoid temporary
Vecallocations and flatten copies
Outputs and public signatures are unchanged.
Phase 12 Features (February 2026)¶
Fused Dense Forward Kernel¶
DenseLayer.forward_fast() now routes through a fused encode+AND+popcount path:
- no intermediate
Vec<Vec<u64>>for encoded inputs - per-input Bernoulli words are generated and consumed immediately
- deterministic seeding is preserved (
seed + input_idx)
Fast PRNG for Encode Paths¶
Fast encode paths now use xoshiro256++ (seeded deterministically) for improved throughput in:
DenseLayer.forward()DenseLayer.forward_fast()/ fused pathbatch_encode_numpy()
Weight packing remains ChaCha8-based to preserve existing model weight compatibility.
Batched Dense Forward API¶
Process many dense samples in one call:
import numpy as np
import sc_neurocore_engine as v3
layer = v3.DenseLayer(64, 32, 1024)
inputs = np.random.uniform(0, 1, (100, 64)).astype(np.float64)
outputs = layer.forward_batch_numpy(inputs, seed=42)
# outputs.shape == (100, 32)
This amortizes Python↔Rust FFI overhead and enables parallel execution over samples.