© 1998–2026 Miroslav Šotek. All rights reserved. Contact: www.anulum.li | protoscience@anulum.li ORCID: https://orcid.org/0009-0009-3560-0851 License: GNU AFFERO GENERAL PUBLIC LICENSE v3 Commercial Licensing: Available

SC-NeuroCore v3 Migration Guide¶

Status¶

Phase 1 scaffolding is in place:

Rust engine crate in engine/
Python bridge package in bridge/sc_neurocore_engine/
v2-vs-v3 equivalence tests in tests/equivalence/

Build (Local)¶

cd 03_CODE/sc-neurocore/engine
maturin develop --release

Quick Sanity Check¶

python -c "import sc_neurocore_engine; print(sc_neurocore_engine.__version__); print(sc_neurocore_engine.simd_tier())"

Equivalence Tests¶

cd 03_CODE/sc-neurocore
$env:PYTHONPATH=\"src;bridge\"
python -m pytest tests/equivalence -v --tb=short

Notes¶

v2 package under src/sc_neurocore/ remains untouched.
v3 bridge is a drop-in import path for hot kernels and fixed-point neuron APIs.
Encoder and LIF in v3 currently follow strict blueprint operation ordering (step-then-compare encoder, refractory override after threshold evaluation).

Phase 2 Features (February 2026)¶

Surrogate Gradients¶

SC-NeuroCore v3 introduces backpropagation support for stochastic computing layers via surrogate gradients:

SurrogateLif - LIF neuron with differentiable backward pass
DifferentiableDenseLayer - SC layer with weight gradient computation
Supported surrogates: FastSigmoid, SuperSpike, ArcTan, StraightThrough

Stochastic Attention¶

Rate-mode: bit-exact match with v2 (atol < 1e-12)
SC-mode: bitstream-based matrix multiply (new v3 capability)
Multi-head support (Phase 3)

Graph Neural Network¶

Rate-mode: bit-exact match with v2 (atol < 1e-12)
SC-mode: bitstream-based message passing (Phase 3)

Kuramoto Oscillator Solver¶

High-performance phase-difference coupling
SSGF-compatible extended solver with geometry + PGBO terms
Pre-allocated scratch arrays, rayon parallelism
Box-Muller noise generation with ChaCha8Rng

Phase 3 Features (February 2026)¶

SSGF Integration¶

step_ssgf() - Extended Kuramoto with geometry (W), PGBO (h_munu), and field pressure (F*cos) coupling terms
Direct integration with SSGF MicroCycleEngine pipeline
Single sin_diff computation shared across all coupling terms

Property-Based Testing¶

proptest coverage for all numeric modules
Catches edge cases: overflows, NaN, extreme values

Phase 4 Features (February 2026)¶

SC Compute Graph IR¶

A Rust-native intermediate representation for SC pipelines:

ScGraph: Directed acyclic graph of SC operations
ScGraphBuilder: Fluent API for graph construction
verify(): Static verification (SSA, type checking, acyclicity)
print() / parse(): Stable text format with round-trip fidelity
11 operation types mapping to HDL primitives

SystemVerilog Emitter¶

Compile IR graphs to synthesizable RTL:

Direct instantiation of existing hdl/ modules
Automatic clock/reset distribution
Constant folding for Q8.8 fixed-point parameters

Co-Simulation Harness¶

Verify generated HDL against Rust golden model:

LFSR full-cycle equivalence
LIF neuron bit-exact comparison
Encoder probability convergence
Synapse AND operation verification

Phase 5 Features (February 2026)¶

IR Python Bridge¶

Construct SC compute graphs from Python and compile to SystemVerilog:

from sc_neurocore_engine.ir import ScGraphBuilder

b = ScGraphBuilder("my_synapse")
x = b.input("x_prob", "rate")
w = b.input("w_prob", "rate")
x_enc = b.encode(x, length=1024, seed=0xACE1)
w_enc = b.encode(w, length=1024, seed=0xBEEF)
product = b.bitwise_and(x_enc, w_enc)
count = b.popcount(product)
rate = b.div_const(count, 1024)
b.output("firing_rate", rate)

graph = b.build()
assert graph.verify() is None
sv_code = graph.emit_sv()

Co-Simulation¶

When Verilator is installed, co-sim tests compile HDL and compare against the Rust golden model bit-by-bit. Without Verilator, tests skip gracefully.

Distributable Wheels¶

Pre-built wheels available for: - Linux (x86_64, aarch64) - macOS (x86_64, arm64) - Windows (x86_64) - Python 3.10, 3.11, 3.12, 3.13, 3.14

Phase 6 Features (February 2026)¶

NumPy Zero-Copy Interop¶

For maximum performance, use the numpy-native variants:

import numpy as np
import sc_neurocore_engine as v3

bits = np.random.randint(0, 2, 1_000_000, dtype=np.uint8)
packed = v3.pack_bitstream_numpy(bits)      # Zero-copy input
count = v3.popcount_numpy(packed)           # Zero-copy input
recovered = v3.unpack_bitstream_numpy(packed, len(bits))

The original pack_bitstream() and popcount() functions still work with Python lists for backward compatibility.

Batch Operations¶

Process entire arrays in single FFI calls:

# 100K LIF steps in one call (vs 100K per-step calls)
spikes, voltages = v3.batch_lif_run(
    100_000, leak_k=20, gain_k=256, i_t=128
)

# Per-step varying current
currents = np.array([128, 200, 150], dtype=np.int16)
spikes, voltages = v3.batch_lif_run_varying(
    leak_k=20, gain_k=256, currents=currents
)

CI/CD¶

Verilator co-simulation runs automatically on every push (Ubuntu)
Wheel builds on 3 OS x 4 Python versions

Phase 7 Features (February 2026)¶

Dense Forward Optimization¶

Three performance tiers for dense layer inference:

import numpy as np
import sc_neurocore_engine as v3

layer = v3.DenseLayer(64, 32, 1024)
inputs = np.random.uniform(0, 1, 64)

# Original (sequential encoding)
out = layer.forward(inputs.tolist())

# Fast (parallel encoding)
out = layer.forward_fast(inputs.tolist())

# Pre-packed (skip encoding)
packed = v3.batch_encode_numpy(inputs, length=1024, seed=42)
out = layer.forward_prepacked(packed)

batch_encode_numpy¶

Returns a 2-D numpy uint64 array instead of nested Python lists:

probs = np.array([0.3, 0.5, 0.7, 0.9])
packed = v3.batch_encode_numpy(probs, length=1024, seed=42)
# packed.shape == (4, 16)  # 4 inputs x ceil(1024/64) words
# packed.dtype == np.uint64

Phase 8 Features (February 2026)¶

Single-Call Dense Forward with NumPy¶

The recommended high-performance inference API:

import numpy as np
import sc_neurocore_engine as v3

layer = v3.DenseLayer(64, 32, 1024)
inputs = np.random.uniform(0, 1, 64)

# Single FFI call: numpy in -> parallel encode -> parallel compute -> numpy out
out = layer.forward_numpy(inputs)
# out is a numpy float64 array of shape (32,)

Parallel Batch Encoding¶

batch_encode_numpy now uses rayon-parallel encoding:

probs = np.random.uniform(0, 1, 1000)
packed = v3.batch_encode_numpy(probs, length=1024, seed=42)
# Each probability encoded on its own thread

Note: batch_encode_numpy uses per-index seeding (seed + index) for parallelism. Use batch_encode for sequential single-RNG seeding.

Phase 9 Features (February 2026)¶

Fast Bernoulli Encoding¶

forward_fast and batch_encode_numpy now use byte-threshold Bernoulli encoding with 8x less random number generation overhead. This provides 1/256 probability granularity, which is negligible compared to the statistical noise of 1024-bit bitstreams.

The original forward() and batch_encode() retain f64-precision encoding for backward compatibility.

Zero-Copy Prepacked Forward¶

For maximum throughput with pre-encoded inputs:

import numpy as np
import sc_neurocore_engine as v3

layer = v3.DenseLayer(64, 32, 1024)
probs = np.random.uniform(0, 1, 64)

# Encode once, forward many times (zero-copy)
packed = v3.batch_encode_numpy(probs, length=1024, seed=42)
out = layer.forward_prepacked_numpy(packed)
# out is a numpy float64 array, packed was never copied

Thread Pool Tuning¶

Control rayon's parallel thread count:

v3.set_num_threads(4)  # Use 4 threads for all parallel ops

Must be called before any parallel operation. Pass 0 for automatic (number of CPU cores).

Phase 10 Features (February 2026)¶

SIMD Pack Dispatch¶

pack_bitstream_numpy now uses runtime SIMD dispatch (AVX-512BW, AVX2, or portable fallback):

import numpy as np
import sc_neurocore_engine as v3

bits = np.random.randint(0, 2, 1_000_000, dtype=np.uint8)
packed = v3.pack_bitstream_numpy(bits)

This keeps API compatibility while accelerating large pack workloads.

Parallel Multi-Neuron LIF Batch¶

Run many independent neurons in parallel:

import numpy as np
import sc_neurocore_engine as v3

currents = np.full(100, 128, dtype=np.int16)
spikes, voltages = v3.batch_lif_run_multi(
    100, 100_000, leak_k=20, gain_k=256, currents=currents
)
# spikes.shape == (100, 100000)
# voltages.shape == (100, 100000)

Rayon Work Threshold in Dense Fast Path¶

DenseLayer.forward_fast now selects sequential input encoding for small input counts and rayon encoding for larger inputs. This avoids thread-pool overhead on small workloads without changing numerical outputs (per-index RNG seeding remains identical).

Phase 11 Features (February 2026)¶

SIMD Dense Inner Loop¶

Dense accumulation now uses SIMD-dispatched fused AND+popcount:

AVX-512 (avx512vpopcntdq) path for 8-word vector chunks
AVX2 path for vectorized AND + scalar lane popcount
Portable fallback for non-SIMD targets

This path is used across dense forward variants without API changes.

SIMD Bernoulli Encode¶

forward_fast and batch_encode_numpy now use bernoulli_packed_simd:

AVX-512BW compare path (64 bytes -> 64-bit mask)
AVX2 compare path (2 x 32-byte compares)
Scalar fallback for partial words and non-SIMD systems

Sampling semantics remain statistically equivalent to Phase 10 fast encoding.

Flat Packed Weight Storage¶

DenseLayer packed weights are now stored in one contiguous buffer:

layout: [neuron][input][word]
computed with weight_slice(neuron, input) accessors

This removes nested Vec<Vec<Vec<u64>>> indirection and improves cache locality.

Zero-Allocation LIF Batch Outputs¶

batch_lif_run, batch_lif_run_multi, and batch_lif_run_varying now:

pre-allocate numpy output arrays
write directly into contiguous buffers
avoid temporary Vec allocations and flatten copies

Outputs and public signatures are unchanged.

Phase 12 Features (February 2026)¶

Fused Dense Forward Kernel¶

DenseLayer.forward_fast() now routes through a fused encode+AND+popcount path:

no intermediate Vec<Vec<u64>> for encoded inputs
per-input Bernoulli words are generated and consumed immediately
deterministic seeding is preserved (seed + input_idx)

Fast PRNG for Encode Paths¶

Fast encode paths now use xoshiro256++ (seeded deterministically) for improved throughput in:

DenseLayer.forward()
DenseLayer.forward_fast() / fused path
batch_encode_numpy()

Weight packing remains ChaCha8-based to preserve existing model weight compatibility.

Batched Dense Forward API¶

Process many dense samples in one call:

import numpy as np
import sc_neurocore_engine as v3

layer = v3.DenseLayer(64, 32, 1024)
inputs = np.random.uniform(0, 1, (100, 64)).astype(np.float64)
outputs = layer.forward_batch_numpy(inputs, seed=42)
# outputs.shape == (100, 32)

This amortizes Python↔Rust FFI overhead and enables parallel execution over samples.