SNN Model Compression¶

Weight pruning, structural pruning, stochastic-aware pruning, and quantization for FPGA cost reduction.

Pruning¶

Three pruning strategies:

prune_weights — Magnitude-based: zero out weights with |w| below threshold. Standard approach.
prune_neurons — Structural: remove entire neurons with low firing rates, reducing layer width (not just sparsity).
prune_stochastic — SC-specific: score weights by bitstream contribution. Weights near 0 or 1 produce near-deterministic bitstreams (low entropy) and can be replaced with constant gates. Importance = min(p, 1-p) * bitstream_length.

from sc_neurocore.compression import prune_stochastic

# Prune weights contributing <1 popcount bit per inference
pruned, report = prune_stochastic(weights, bitstream_length=256, min_popcount_bits=1.0)
print(f"Sparsity: {report.sparsity:.1%}")

`sc_neurocore.compression.pruning` ¶

Weight, structural, and stochastic-aware pruning for SNN model compression.

Weight pruning: zero out weights below a magnitude threshold. Structural pruning: remove entire neurons that fire below an activity threshold, reducing layer width. Stochastic pruning: score weights by bitstream contribution — how many popcount bits they contribute per inference. SC-specific.

All methods reduce FPGA resource usage when combined with Projection(weight_threshold=) for runtime sparsity exploitation.

`PruningReport` `dataclass` ¶

Results of a pruning operation.

Source code in src/sc_neurocore/compression/pruning.py

@dataclass
class PruningReport:
    """Results of a pruning operation."""

    original_params: int
    pruned_params: int
    remaining_params: int
    sparsity: float
    original_neurons: int = 0
    pruned_neurons: int = 0

`prune_weights(weights, threshold=0.01, method='magnitude')` ¶

Prune small weights from layer weight matrices.

Parameters¶

weights : list of ndarray Weight matrices for each layer. threshold : float Pruning threshold. Weights with |w| <= threshold are zeroed. method : str 'magnitude' (default): prune by absolute value. 'percentile': treat threshold as percentile (0-100) of weight magnitudes to prune.

Returns¶

(pruned_weights, PruningReport)

Source code in src/sc_neurocore/compression/pruning.py

def prune_weights(
    weights: list[np.ndarray],
    threshold: float = 0.01,
    method: str = "magnitude",
) -> tuple[list[np.ndarray], PruningReport]:
    """Prune small weights from layer weight matrices.

    Parameters
    ----------
    weights : list of ndarray
        Weight matrices for each layer.
    threshold : float
        Pruning threshold. Weights with |w| <= threshold are zeroed.
    method : str
        'magnitude' (default): prune by absolute value.
        'percentile': treat threshold as percentile (0-100) of weight
        magnitudes to prune.

    Returns
    -------
    (pruned_weights, PruningReport)
    """
    pruned = []
    total_original = 0
    total_pruned = 0

    for w in weights:
        total_original += w.size
        w_copy = w.copy()

        if method == "percentile":
            abs_w = np.abs(w_copy)
            cutoff = np.percentile(abs_w[abs_w > 0], threshold) if np.any(abs_w > 0) else 0.0
            mask = abs_w <= cutoff
        else:
            mask = np.abs(w_copy) <= threshold

        w_copy[mask] = 0.0
        total_pruned += int(mask.sum())
        pruned.append(w_copy)

    remaining = total_original - total_pruned
    sparsity = total_pruned / max(total_original, 1)

    return pruned, PruningReport(
        original_params=total_original,
        pruned_params=total_pruned,
        remaining_params=remaining,
        sparsity=sparsity,
    )

`prune_neurons(weights, firing_rates=None, activity_threshold=0.001)` ¶

Structural pruning: remove neurons with low firing rates.

Removes entire rows from weight matrices (output neurons) and corresponding columns from the next layer's weight matrix (input connections). Reduces layer width, not just sparsity.

Parameters¶

weights : list of ndarray Weight matrices [W1, W2, ...] where W_i has shape (n_out, n_in). firing_rates : list of ndarray, optional Per-neuron firing rates for each layer. If None, uses output weight magnitude as a proxy for importance. activity_threshold : float Neurons with firing rate (or weight norm) below this are pruned.

Returns¶

(pruned_weights, PruningReport)

Source code in src/sc_neurocore/compression/pruning.py

def prune_neurons(
    weights: list[np.ndarray],
    firing_rates: list[np.ndarray] | None = None,
    activity_threshold: float = 0.001,
) -> tuple[list[np.ndarray], PruningReport]:
    """Structural pruning: remove neurons with low firing rates.

    Removes entire rows from weight matrices (output neurons) and
    corresponding columns from the next layer's weight matrix (input
    connections). Reduces layer width, not just sparsity.

    Parameters
    ----------
    weights : list of ndarray
        Weight matrices [W1, W2, ...] where W_i has shape (n_out, n_in).
    firing_rates : list of ndarray, optional
        Per-neuron firing rates for each layer. If None, uses output
        weight magnitude as a proxy for importance.
    activity_threshold : float
        Neurons with firing rate (or weight norm) below this are pruned.

    Returns
    -------
    (pruned_weights, PruningReport)
    """
    n_layers = len(weights)
    pruned_weights = [w.copy() for w in weights]
    total_neurons = sum(w.shape[0] for w in weights)
    neurons_pruned = 0

    for i in range(n_layers):
        w = pruned_weights[i]
        n_out = w.shape[0]

        if firing_rates is not None and i < len(firing_rates):
            importance = firing_rates[i]
        else:
            importance = np.linalg.norm(w, axis=1)

        keep_mask = importance > activity_threshold
        if keep_mask.all():
            continue

        n_removed = int((~keep_mask).sum())
        neurons_pruned += n_removed

        pruned_weights[i] = w[keep_mask]

        if i + 1 < n_layers:
            pruned_weights[i + 1] = pruned_weights[i + 1][:, keep_mask]

    total_remaining = total_neurons - neurons_pruned

    original_params = sum(w.size for w in weights)
    remaining_params = sum(w.size for w in pruned_weights)

    return pruned_weights, PruningReport(
        original_params=original_params,
        pruned_params=original_params - remaining_params,
        remaining_params=remaining_params,
        sparsity=(original_params - remaining_params) / max(original_params, 1),
        original_neurons=total_neurons,
        pruned_neurons=neurons_pruned,
    )

`prune_stochastic(weights, bitstream_length=256, min_popcount_bits=1.0)` ¶

Stochastic-aware pruning: score weights by bitstream contribution.

In SC networks, weight w encodes probability p = clip(|w|, 0, 1). The expected popcount contribution per inference is: contribution = min(p, 1-p) * bitstream_length

Weights that produce nearly-deterministic bitstreams (p near 0 or 1) contribute almost nothing to computation — they can be replaced with constant 0/1 gates, saving AND+popcount hardware.

Parameters¶

weights : list of ndarray Weight matrices (values in [0, 1] for unipolar SC). bitstream_length : int Bitstream length (L). Longer streams = more bits per weight. min_popcount_bits : float Minimum expected popcount contribution to keep a weight. Weights contributing fewer bits than this are zeroed.

Returns¶

(pruned_weights, PruningReport)

Source code in src/sc_neurocore/compression/pruning.py

def prune_stochastic(
    weights: list[np.ndarray],
    bitstream_length: int = 256,
    min_popcount_bits: float = 1.0,
) -> tuple[list[np.ndarray], PruningReport]:
    """Stochastic-aware pruning: score weights by bitstream contribution.

    In SC networks, weight w encodes probability p = clip(|w|, 0, 1).
    The expected popcount contribution per inference is:
        contribution = min(p, 1-p) * bitstream_length

    Weights that produce nearly-deterministic bitstreams (p near 0 or 1)
    contribute almost nothing to computation — they can be replaced with
    constant 0/1 gates, saving AND+popcount hardware.

    Parameters
    ----------
    weights : list of ndarray
        Weight matrices (values in [0, 1] for unipolar SC).
    bitstream_length : int
        Bitstream length (L). Longer streams = more bits per weight.
    min_popcount_bits : float
        Minimum expected popcount contribution to keep a weight.
        Weights contributing fewer bits than this are zeroed.

    Returns
    -------
    (pruned_weights, PruningReport)
    """
    pruned = []
    total_original = 0
    total_pruned = 0

    for w in weights:
        total_original += w.size
        w_copy = w.copy()

        # SC probability: clip to [0, 1]
        p = np.clip(np.abs(w_copy), 0.0, 1.0)
        # Expected popcount contribution: min(p, 1-p) * L
        # This is the "unpredictable" fraction of the bitstream
        contribution = np.minimum(p, 1.0 - p) * bitstream_length

        mask = contribution < min_popcount_bits
        w_copy[mask] = 0.0
        total_pruned += int(mask.sum())
        pruned.append(w_copy)

    remaining = total_original - total_pruned
    sparsity = total_pruned / max(total_original, 1)

    return pruned, PruningReport(
        original_params=total_original,
        pruned_params=total_pruned,
        remaining_params=remaining,
        sparsity=sparsity,
    )

Quantization¶

`sc_neurocore.compression.quantization` ¶

Quantize weights and delays for reduced hardware precision.

Weight quantization: reduce from float64 to fixed-point with configurable bit width. Fewer bits = smaller BRAM and simpler multiplier circuits.

Delay quantization: round continuous delays to integer steps or coarser grids. Fewer delay levels = smaller delay buffers on FPGA.

`quantize_weights(weights, bits=8, symmetric=True)` ¶

Quantize weight matrices to fixed-point with given bit width.

Parameters¶

weights : list of ndarray Float weight matrices. bits : int Target bit width (default 8). Range: [2, 16]. symmetric : bool Symmetric quantization around zero (default True).

Returns¶

list of ndarray Quantized weights (still float dtype but with discrete values).

Source code in src/sc_neurocore/compression/quantization.py

def quantize_weights(
    weights: list[np.ndarray],
    bits: int = 8,
    symmetric: bool = True,
) -> list[np.ndarray]:
    """Quantize weight matrices to fixed-point with given bit width.

    Parameters
    ----------
    weights : list of ndarray
        Float weight matrices.
    bits : int
        Target bit width (default 8). Range: [2, 16].
    symmetric : bool
        Symmetric quantization around zero (default True).

    Returns
    -------
    list of ndarray
        Quantized weights (still float dtype but with discrete values).
    """
    bits = max(2, min(bits, 16))
    n_levels = 2**bits

    quantized = []
    for w in weights:
        if symmetric:
            abs_max = max(np.abs(w).max(), 1e-8)
            scale = abs_max / (n_levels // 2 - 1)
            q = np.round(w / scale) * scale
            q = np.clip(q, -abs_max, abs_max)
        else:
            w_min, w_max = w.min(), w.max()
            w_range = max(w_max - w_min, 1e-8)
            scale = w_range / (n_levels - 1)
            q = np.round((w - w_min) / scale) * scale + w_min

        quantized.append(q)

    return quantized

`quantize_delays(delays, resolution=1, max_delay=None)` ¶

Quantize continuous delays to integer grid.

Parameters¶

delays : ndarray Continuous delay values. resolution : int Delay step size (default 1). Resolution=2 means delays are rounded to {0, 2, 4, 6, ...}, halving the buffer depth. max_delay : int, optional Clamp delays to this maximum.

Returns¶

ndarray of int Quantized integer delays.

Source code in src/sc_neurocore/compression/quantization.py

def quantize_delays(
    delays: np.ndarray,
    resolution: int = 1,
    max_delay: int | None = None,
) -> np.ndarray:
    """Quantize continuous delays to integer grid.

    Parameters
    ----------
    delays : ndarray
        Continuous delay values.
    resolution : int
        Delay step size (default 1). Resolution=2 means delays are
        rounded to {0, 2, 4, 6, ...}, halving the buffer depth.
    max_delay : int, optional
        Clamp delays to this maximum.

    Returns
    -------
    ndarray of int
        Quantized integer delays.
    """
    q = np.round(delays / resolution).astype(np.int64) * resolution
    q = np.clip(q, 0, None)
    if max_delay is not None:
        q = np.clip(q, 0, max_delay)
    return q

SNN Model Compression¶

Pruning¶

sc_neurocore.compression.pruning ¶

PruningReport dataclass ¶

prune_weights(weights, threshold=0.01, method='magnitude') ¶

Parameters¶

Returns¶

prune_neurons(weights, firing_rates=None, activity_threshold=0.001) ¶

Parameters¶

Returns¶

prune_stochastic(weights, bitstream_length=256, min_popcount_bits=1.0) ¶

Parameters¶

Returns¶

Quantization¶

sc_neurocore.compression.quantization ¶

quantize_weights(weights, bits=8, symmetric=True) ¶

Parameters¶

Returns¶

quantize_delays(delays, resolution=1, max_delay=None) ¶

Parameters¶

Returns¶

`sc_neurocore.compression.pruning` ¶

`PruningReport` `dataclass` ¶

`prune_weights(weights, threshold=0.01, method='magnitude')` ¶

`prune_neurons(weights, firing_rates=None, activity_threshold=0.001)` ¶

`prune_stochastic(weights, bitstream_length=256, min_popcount_bits=1.0)` ¶

`sc_neurocore.compression.quantization` ¶

`quantize_weights(weights, bits=8, symmetric=True)` ¶

`quantize_delays(delays, resolution=1, max_delay=None)` ¶