Acceleration¶

Backend modules for high-performance SC operations.

Module	Purpose
`vector_ops`	Packed uint64 bitwise AND, popcount, pack/unpack
`gpu_backend`	CuPy GPU dispatch (transparent NumPy fallback)
`jax_backend`	JAX JIT-compiled LIF step for TPU/GPU scaling
`jit_kernels`	Numba-accelerated inner loops
`mpi_driver`	MPI-based distributed simulation

Vector Operations¶

`sc_neurocore.accel.vector_ops` ¶

`pack_bitstream(bitstream)` ¶

Packs a uint8 bitstream (0s and 1s) into uint64 integers. This allows processing 64 time steps in parallel.

Parameters:

Name	Type	Description	Default
`bitstream`	`ndarray[Any, Any]`	Shape (N,) or (Batch, N) of uint8 {0,1}	required

Returns:

Name	Type	Description
`packed`	`ndarray[Any, Any]`	Shape (ceil(N/64),) or (Batch, ceil(N/64)) of uint64

Source code in src/sc_neurocore/accel/vector_ops.py

def pack_bitstream(bitstream: np.ndarray[Any, Any]) -> np.ndarray[Any, Any]:
    """
    Packs a uint8 bitstream (0s and 1s) into uint64 integers.
    This allows processing 64 time steps in parallel.

    Args:
        bitstream: Shape (N,) or (Batch, N) of uint8 {0,1}

    Returns:
        packed: Shape (ceil(N/64),) or (Batch, ceil(N/64)) of uint64
    """
    bitstream = np.asarray(bitstream, dtype=np.uint8)

    if bitstream.ndim == 1:
        # 1D case: single bitstream
        length = bitstream.size
        pad_len = (64 - (length % 64)) % 64
        if pad_len > 0:
            bitstream = np.append(bitstream, np.zeros(pad_len, dtype=np.uint8))

        chunks = bitstream.reshape(-1, 64)
        powers = 1 << np.arange(64, dtype=np.uint64)
        packed = (chunks * powers).sum(axis=1, dtype=np.uint64)
        return packed

    elif bitstream.ndim == 2:
        # 2D case: batch of bitstreams
        batch_size, length = bitstream.shape
        pad_len = (64 - (length % 64)) % 64

        if pad_len > 0:
            padding = np.zeros((batch_size, pad_len), dtype=np.uint8)
            bitstream = np.concatenate([bitstream, padding], axis=1)

        # Reshape to (batch, num_chunks, 64)
        num_chunks = bitstream.shape[1] // 64
        chunks = bitstream.reshape(batch_size, num_chunks, 64)

        powers = 1 << np.arange(64, dtype=np.uint64)
        packed = (chunks * powers).sum(axis=2, dtype=np.uint64)
        return packed

    else:
        raise ValueError(f"Expected 1D or 2D array, got {bitstream.ndim}D")

`unpack_bitstream(packed, original_length, original_shape=None)` ¶

Unpacks uint64 array back to uint8 bitstream.

Parameters:

Name	Type	Description	Default
`packed`	`ndarray[Any, Any]`	Packed uint64 array (1D or 2D)	required
`original_length`	`int`	Total number of bits to extract	required
`original_shape`	`Optional[tuple[Any, ...]]`	Optional tuple for reshaping output (batch, length)	`None`

Returns:

Type	Description
`ndarray[Any, Any]`	Unpacked bitstream of shape (original_length,) or original_shape

Source code in src/sc_neurocore/accel/vector_ops.py

def unpack_bitstream(
    packed: np.ndarray[Any, Any],
    original_length: int,
    original_shape: Optional[tuple[Any, ...]] = None,
) -> np.ndarray[Any, Any]:
    """
    Unpacks uint64 array back to uint8 bitstream.

    Args:
        packed: Packed uint64 array (1D or 2D)
        original_length: Total number of bits to extract
        original_shape: Optional tuple for reshaping output (batch, length)

    Returns:
        Unpacked bitstream of shape (original_length,) or original_shape
    """
    if packed.ndim == 1:
        # 1D packed array
        bits = ((packed[:, None] & (1 << np.arange(64, dtype=np.uint64))) > 0).astype(np.uint8)
        unpacked = bits.flatten()
        return unpacked[:original_length]

    elif packed.ndim == 2:
        # 2D packed array: (batch, num_chunks)
        batch_size, num_chunks = packed.shape
        # Extract bits: (batch, num_chunks, 64)
        bits = ((packed[:, :, None] & (1 << np.arange(64, dtype=np.uint64))) > 0).astype(np.uint8)
        # Reshape to (batch, num_chunks * 64)
        unpacked = bits.reshape(batch_size, -1)

        if original_shape is not None:
            return unpacked[:, : original_shape[1]]
        else:
            # Assume original_length is per-batch
            per_batch_len = original_length // batch_size
            return unpacked[:, :per_batch_len]

    else:
        raise ValueError(f"Expected 1D or 2D packed array, got {packed.ndim}D")

`vec_and(a_packed, b_packed)` ¶

Bitwise AND on packed arrays. Simulates SC Multiplication.

Source code in src/sc_neurocore/accel/vector_ops.py

def vec_and(a_packed: np.ndarray[Any, Any], b_packed: np.ndarray[Any, Any]) -> np.ndarray[Any, Any]:
    """
    Bitwise AND on packed arrays. Simulates SC Multiplication.
    """
    return np.bitwise_and(a_packed, b_packed)

`vec_xnor(a_packed, b_packed)` ¶

Bitwise XNOR on packed arrays. SC bipolar multiplication: P(A XNOR B) = P(A)P(B) + (1-P(A))(1-P(B)).

Source code in src/sc_neurocore/accel/vector_ops.py

def vec_xnor(
    a_packed: np.ndarray[Any, Any], b_packed: np.ndarray[Any, Any]
) -> np.ndarray[Any, Any]:
    """Bitwise XNOR on packed arrays. SC bipolar multiplication: P(A XNOR B) = P(A)*P(B) + (1-P(A))*(1-P(B))."""
    return ~np.bitwise_xor(a_packed, b_packed)

`vec_not(packed)` ¶

Bitwise NOT on packed arrays. SC complement: P(NOT A) = 1 - P(A).

Source code in src/sc_neurocore/accel/vector_ops.py

def vec_not(packed: np.ndarray[Any, Any]) -> np.ndarray[Any, Any]:
    """Bitwise NOT on packed arrays. SC complement: P(NOT A) = 1 - P(A)."""
    return ~packed

`vec_mux(select_packed, a_packed, b_packed)` ¶

Bitwise MUX on packed arrays. SC scaled addition: P(out) = P(sel)P(A) + (1-P(sel))P(B).

When sel is a Bernoulli(0.5) stream, this computes the average (A+B)/2.

Source code in src/sc_neurocore/accel/vector_ops.py

def vec_mux(
    select_packed: np.ndarray[Any, Any],
    a_packed: np.ndarray[Any, Any],
    b_packed: np.ndarray[Any, Any],
) -> np.ndarray[Any, Any]:
    """Bitwise MUX on packed arrays. SC scaled addition: P(out) = P(sel)*P(A) + (1-P(sel))*P(B).

    When sel is a Bernoulli(0.5) stream, this computes the average (A+B)/2.
    """
    return (select_packed & a_packed) | (~select_packed & b_packed)

`vec_popcount(packed)` ¶

Count total set bits (1s) in the packed array. Used for integration/accumulation.

Source code in src/sc_neurocore/accel/vector_ops.py

def vec_popcount(packed: np.ndarray[Any, Any]) -> int:
    """
    Count total set bits (1s) in the packed array.
    Used for integration/accumulation.
    """
    # Using numpy's ability to cast to specialized types or simple lookup?
    # Actually, Python 3.10+ int.bit_count() is fast, but for numpy arrays:
    # We can use a trick or just loop if C-extension isn't available.
    # A generic parallel popcount on uint64 in pure numpy is tricky without looping or lookup tables.
    # However, we can map to python int and sum.

    # For speed in pure python/numpy env without heavy deps:
    # Use binary decomposition for vectorized popcount
    x = packed.copy()
    x -= (x >> 1) & 0x5555555555555555
    x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333)
    x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F
    x = (x * 0x0101010101010101) >> 56
    return np.sum(x)

GPU Backend¶

`sc_neurocore.accel.gpu_backend` ¶

`to_device(arr)` ¶

Move a NumPy array to the active backend (GPU copy or no-op).

Source code in src/sc_neurocore/accel/gpu_backend.py

def to_device(arr: np.ndarray[Any, Any]) -> xp.ndarray:  # type: ignore
    """Move a NumPy array to the active backend (GPU copy or no-op)."""
    if HAS_CUPY:  # pragma: no cover
        return cp.asarray(arr)
    return arr

`to_host(arr)` ¶

Bring an array back to host RAM as a NumPy array.

Source code in src/sc_neurocore/accel/gpu_backend.py

def to_host(arr) -> np.ndarray[Any, Any]:  # type: ignore
    """Bring an array back to host RAM as a NumPy array."""
    if HAS_CUPY and isinstance(arr, cp.ndarray):  # pragma: no cover
        return cp.asnumpy(arr)
    return np.asarray(arr)

`gpu_pack_bitstream(bits)` ¶

Pack uint8 {0,1} array into uint64 words.

Works on both CuPy and NumPy arrays.

Parameters:

Name	Type	Description	Default
`bits`	`ndarray`	Shape `(N,)` or `(B, N)` of uint8.	required

Returns:

Type	Description
`ndarray`	Packed uint64 array, shape `(ceil(N/64),)` or `(B, ceil(N/64))`.

Source code in src/sc_neurocore/accel/gpu_backend.py

def gpu_pack_bitstream(bits: xp.ndarray) -> xp.ndarray:  # type: ignore
    """
    Pack uint8 {0,1} array into uint64 words.

    Works on both CuPy and NumPy arrays.

    Args:
        bits: Shape ``(N,)`` or ``(B, N)`` of uint8.

    Returns:
        Packed uint64 array, shape ``(ceil(N/64),)`` or ``(B, ceil(N/64))``.
    """
    _warn_cpu_fallback()
    bits = xp.asarray(bits, dtype=xp.uint8)

    if bits.ndim == 1:
        length = bits.size
        pad = (64 - length % 64) % 64
        if pad:
            bits = xp.concatenate([bits, xp.zeros(pad, dtype=xp.uint8)])
        chunks = bits.reshape(-1, 64)
        powers = xp.uint64(1) << xp.arange(64, dtype=xp.uint64)
        return (chunks.astype(xp.uint64) * powers).sum(axis=1)

    elif bits.ndim == 2:
        B, length = bits.shape
        pad = (64 - length % 64) % 64
        if pad:
            bits = xp.concatenate([bits, xp.zeros((B, pad), dtype=xp.uint8)], axis=1)
        n_words = bits.shape[1] // 64
        chunks = bits.reshape(B, n_words, 64)
        powers = xp.uint64(1) << xp.arange(64, dtype=xp.uint64)
        return (chunks.astype(xp.uint64) * powers).sum(axis=2)

    raise ValueError(f"Expected 1-D or 2-D, got {bits.ndim}-D")

`gpu_vec_and(a, b)` ¶

Bitwise AND on packed uint64 arrays (SC multiplication).

Source code in src/sc_neurocore/accel/gpu_backend.py

def gpu_vec_and(a: xp.ndarray, b: xp.ndarray) -> xp.ndarray:  # type: ignore
    """Bitwise AND on packed uint64 arrays (SC multiplication)."""
    _warn_cpu_fallback()
    return xp.bitwise_and(a, b)

`gpu_popcount(packed)` ¶

Vectorised SWAR popcount on uint64 arrays — returns per-element counts.

On CuPy this runs as a fused GPU kernel; on NumPy it uses the same SWAR bit-trick as vector_ops.vec_popcount but returns an array instead of a scalar.

Source code in src/sc_neurocore/accel/gpu_backend.py

def gpu_popcount(packed: xp.ndarray) -> xp.ndarray:  # type: ignore
    """
    Vectorised SWAR popcount on uint64 arrays — returns per-element counts.

    On CuPy this runs as a fused GPU kernel; on NumPy it uses the same
    SWAR bit-trick as ``vector_ops.vec_popcount`` but returns an array
    instead of a scalar.
    """
    _warn_cpu_fallback()
    x = packed.astype(xp.uint64).copy()
    m1 = xp.uint64(0x5555555555555555)
    m2 = xp.uint64(0x3333333333333333)
    m4 = xp.uint64(0x0F0F0F0F0F0F0F0F)
    h01 = xp.uint64(0x0101010101010101)

    x -= (x >> xp.uint64(1)) & m1
    x = (x & m2) + ((x >> xp.uint64(2)) & m2)
    x = (x + (x >> xp.uint64(4))) & m4
    return (x * h01) >> xp.uint64(56)

`gpu_vec_mac(packed_weights, packed_inputs)` ¶

GPU-accelerated multiply-accumulate for a dense SC layer.

Parameters:

Name	Type	Description	Default
`packed_weights`	`ndarray`	`(n_neurons, n_inputs, n_words)` uint64	required
`packed_inputs`	`ndarray`	`(n_inputs, n_words)` uint64	required

Returns:

Type	Description
`ndarray`	`(n_neurons,)` total bit counts (= SC dot products).

Source code in src/sc_neurocore/accel/gpu_backend.py

def gpu_vec_mac(
    packed_weights: xp.ndarray,  # type: ignore
    packed_inputs: xp.ndarray,  # type: ignore
) -> xp.ndarray:  # type: ignore
    """
    GPU-accelerated multiply-accumulate for a dense SC layer.

    Args:
        packed_weights: ``(n_neurons, n_inputs, n_words)`` uint64
        packed_inputs:  ``(n_inputs, n_words)`` uint64

    Returns:
        ``(n_neurons,)`` total bit counts (= SC dot products).
    """
    _warn_cpu_fallback()
    # Broadcast AND: (N, I, W) & (1, I, W) -> (N, I, W)
    products = xp.bitwise_and(packed_weights, packed_inputs[None, :, :])

    # Per-element popcount, then sum across inputs and words
    counts = gpu_popcount(products)  # (N, I, W) uint64
    return counts.sum(axis=(1, 2))  # (N,)

JAX Backend¶

`sc_neurocore.accel.jax_backend` ¶

JAX backend for SC-NeuroCore.

Provides JAX-accelerated primitives for stochastic computing, unlocking automatic differentiation, JIT compilation (XLA), and native TPU/GPU scaling.

Usage::

from sc_neurocore.accel.jax_backend import jnp, HAS_JAX, to_jax, to_host
from sc_neurocore.accel.jax_backend import jax_pack_bitstream, jax_vec_mac

if HAS_JAX:
    bits = jnp.array([1, 0, 1, 1], dtype=jnp.uint8)
    packed = jax_pack_bitstream(bits)