Skip to content

SNN Model Compression

Weight pruning, structural pruning, stochastic-aware pruning, and quantization for FPGA cost reduction.

Pruning

Three pruning strategies:

  • prune_weights — Magnitude-based: zero out weights with |w| below threshold. Standard approach.
  • prune_neurons — Structural: remove entire neurons with low firing rates, reducing layer width (not just sparsity).
  • prune_stochasticSC-specific: score weights by bitstream contribution. Weights near 0 or 1 produce near-deterministic bitstreams (low entropy) and can be replaced with constant gates. Importance = min(p, 1-p) * bitstream_length.
from sc_neurocore.compression import prune_stochastic

# Prune weights contributing <1 popcount bit per inference
pruned, report = prune_stochastic(weights, bitstream_length=256, min_popcount_bits=1.0)
print(f"Sparsity: {report.sparsity:.1%}")

sc_neurocore.compression.pruning

Weight, structural, and stochastic-aware pruning for SNN model compression.

Weight pruning: zero out weights below a magnitude threshold. Structural pruning: remove entire neurons that fire below an activity threshold, reducing layer width. Stochastic pruning: score weights by bitstream contribution — how many popcount bits they contribute per inference. SC-specific.

All methods reduce FPGA resource usage when combined with Projection(weight_threshold=) for runtime sparsity exploitation.

PruningReport dataclass

Results of a pruning operation.

Source code in src/sc_neurocore/compression/pruning.py
27
28
29
30
31
32
33
34
35
36
@dataclass
class PruningReport:
    """Results of a pruning operation."""

    original_params: int
    pruned_params: int
    remaining_params: int
    sparsity: float
    original_neurons: int = 0
    pruned_neurons: int = 0

prune_weights(weights, threshold=0.01, method='magnitude')

Prune small weights from layer weight matrices.

Parameters

weights : list of ndarray Weight matrices for each layer. threshold : float Pruning threshold. Weights with |w| <= threshold are zeroed. method : str 'magnitude' (default): prune by absolute value. 'percentile': treat threshold as percentile (0-100) of weight magnitudes to prune.

Returns

(pruned_weights, PruningReport)

Source code in src/sc_neurocore/compression/pruning.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def prune_weights(
    weights: list[np.ndarray],
    threshold: float = 0.01,
    method: str = "magnitude",
) -> tuple[list[np.ndarray], PruningReport]:
    """Prune small weights from layer weight matrices.

    Parameters
    ----------
    weights : list of ndarray
        Weight matrices for each layer.
    threshold : float
        Pruning threshold. Weights with |w| <= threshold are zeroed.
    method : str
        'magnitude' (default): prune by absolute value.
        'percentile': treat threshold as percentile (0-100) of weight
        magnitudes to prune.

    Returns
    -------
    (pruned_weights, PruningReport)
    """
    pruned = []
    total_original = 0
    total_pruned = 0

    for w in weights:
        total_original += w.size
        w_copy = w.copy()

        if method == "percentile":
            abs_w = np.abs(w_copy)
            cutoff = np.percentile(abs_w[abs_w > 0], threshold) if np.any(abs_w > 0) else 0.0
            mask = abs_w <= cutoff
        else:
            mask = np.abs(w_copy) <= threshold

        w_copy[mask] = 0.0
        total_pruned += int(mask.sum())
        pruned.append(w_copy)

    remaining = total_original - total_pruned
    sparsity = total_pruned / max(total_original, 1)

    return pruned, PruningReport(
        original_params=total_original,
        pruned_params=total_pruned,
        remaining_params=remaining,
        sparsity=sparsity,
    )

prune_neurons(weights, firing_rates=None, activity_threshold=0.001)

Structural pruning: remove neurons with low firing rates.

Removes entire rows from weight matrices (output neurons) and corresponding columns from the next layer's weight matrix (input connections). Reduces layer width, not just sparsity.

Parameters

weights : list of ndarray Weight matrices [W1, W2, ...] where W_i has shape (n_out, n_in). firing_rates : list of ndarray, optional Per-neuron firing rates for each layer. If None, uses output weight magnitude as a proxy for importance. activity_threshold : float Neurons with firing rate (or weight norm) below this are pruned.

Returns

(pruned_weights, PruningReport)

Source code in src/sc_neurocore/compression/pruning.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
def prune_neurons(
    weights: list[np.ndarray],
    firing_rates: list[np.ndarray] | None = None,
    activity_threshold: float = 0.001,
) -> tuple[list[np.ndarray], PruningReport]:
    """Structural pruning: remove neurons with low firing rates.

    Removes entire rows from weight matrices (output neurons) and
    corresponding columns from the next layer's weight matrix (input
    connections). Reduces layer width, not just sparsity.

    Parameters
    ----------
    weights : list of ndarray
        Weight matrices [W1, W2, ...] where W_i has shape (n_out, n_in).
    firing_rates : list of ndarray, optional
        Per-neuron firing rates for each layer. If None, uses output
        weight magnitude as a proxy for importance.
    activity_threshold : float
        Neurons with firing rate (or weight norm) below this are pruned.

    Returns
    -------
    (pruned_weights, PruningReport)
    """
    n_layers = len(weights)
    pruned_weights = [w.copy() for w in weights]
    total_neurons = sum(w.shape[0] for w in weights)
    neurons_pruned = 0

    for i in range(n_layers):
        w = pruned_weights[i]
        n_out = w.shape[0]

        if firing_rates is not None and i < len(firing_rates):
            importance = firing_rates[i]
        else:
            importance = np.linalg.norm(w, axis=1)

        keep_mask = importance > activity_threshold
        if keep_mask.all():
            continue

        n_removed = int((~keep_mask).sum())
        neurons_pruned += n_removed

        pruned_weights[i] = w[keep_mask]

        if i + 1 < n_layers:
            pruned_weights[i + 1] = pruned_weights[i + 1][:, keep_mask]

    total_remaining = total_neurons - neurons_pruned

    original_params = sum(w.size for w in weights)
    remaining_params = sum(w.size for w in pruned_weights)

    return pruned_weights, PruningReport(
        original_params=original_params,
        pruned_params=original_params - remaining_params,
        remaining_params=remaining_params,
        sparsity=(original_params - remaining_params) / max(original_params, 1),
        original_neurons=total_neurons,
        pruned_neurons=neurons_pruned,
    )

prune_stochastic(weights, bitstream_length=256, min_popcount_bits=1.0)

Stochastic-aware pruning: score weights by bitstream contribution.

In SC networks, weight w encodes probability p = clip(|w|, 0, 1). The expected popcount contribution per inference is: contribution = min(p, 1-p) * bitstream_length

Weights that produce nearly-deterministic bitstreams (p near 0 or 1) contribute almost nothing to computation — they can be replaced with constant 0/1 gates, saving AND+popcount hardware.

Parameters

weights : list of ndarray Weight matrices (values in [0, 1] for unipolar SC). bitstream_length : int Bitstream length (L). Longer streams = more bits per weight. min_popcount_bits : float Minimum expected popcount contribution to keep a weight. Weights contributing fewer bits than this are zeroed.

Returns

(pruned_weights, PruningReport)

Source code in src/sc_neurocore/compression/pruning.py
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
def prune_stochastic(
    weights: list[np.ndarray],
    bitstream_length: int = 256,
    min_popcount_bits: float = 1.0,
) -> tuple[list[np.ndarray], PruningReport]:
    """Stochastic-aware pruning: score weights by bitstream contribution.

    In SC networks, weight w encodes probability p = clip(|w|, 0, 1).
    The expected popcount contribution per inference is:
        contribution = min(p, 1-p) * bitstream_length

    Weights that produce nearly-deterministic bitstreams (p near 0 or 1)
    contribute almost nothing to computation — they can be replaced with
    constant 0/1 gates, saving AND+popcount hardware.

    Parameters
    ----------
    weights : list of ndarray
        Weight matrices (values in [0, 1] for unipolar SC).
    bitstream_length : int
        Bitstream length (L). Longer streams = more bits per weight.
    min_popcount_bits : float
        Minimum expected popcount contribution to keep a weight.
        Weights contributing fewer bits than this are zeroed.

    Returns
    -------
    (pruned_weights, PruningReport)
    """
    pruned = []
    total_original = 0
    total_pruned = 0

    for w in weights:
        total_original += w.size
        w_copy = w.copy()

        # SC probability: clip to [0, 1]
        p = np.clip(np.abs(w_copy), 0.0, 1.0)
        # Expected popcount contribution: min(p, 1-p) * L
        # This is the "unpredictable" fraction of the bitstream
        contribution = np.minimum(p, 1.0 - p) * bitstream_length

        mask = contribution < min_popcount_bits
        w_copy[mask] = 0.0
        total_pruned += int(mask.sum())
        pruned.append(w_copy)

    remaining = total_original - total_pruned
    sparsity = total_pruned / max(total_original, 1)

    return pruned, PruningReport(
        original_params=total_original,
        pruned_params=total_pruned,
        remaining_params=remaining,
        sparsity=sparsity,
    )

Quantization

sc_neurocore.compression.quantization

Quantize weights and delays for reduced hardware precision.

Weight quantization: reduce from float64 to fixed-point with configurable bit width. Fewer bits = smaller BRAM and simpler multiplier circuits.

Delay quantization: round continuous delays to integer steps or coarser grids. Fewer delay levels = smaller delay buffers on FPGA.

quantize_weights(weights, bits=8, symmetric=True)

Quantize weight matrices to fixed-point with given bit width.

Parameters

weights : list of ndarray Float weight matrices. bits : int Target bit width (default 8). Range: [2, 16]. symmetric : bool Symmetric quantization around zero (default True).

Returns

list of ndarray Quantized weights (still float dtype but with discrete values).

Source code in src/sc_neurocore/compression/quantization.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def quantize_weights(
    weights: list[np.ndarray],
    bits: int = 8,
    symmetric: bool = True,
) -> list[np.ndarray]:
    """Quantize weight matrices to fixed-point with given bit width.

    Parameters
    ----------
    weights : list of ndarray
        Float weight matrices.
    bits : int
        Target bit width (default 8). Range: [2, 16].
    symmetric : bool
        Symmetric quantization around zero (default True).

    Returns
    -------
    list of ndarray
        Quantized weights (still float dtype but with discrete values).
    """
    bits = max(2, min(bits, 16))
    n_levels = 2**bits

    quantized = []
    for w in weights:
        if symmetric:
            abs_max = max(np.abs(w).max(), 1e-8)
            scale = abs_max / (n_levels // 2 - 1)
            q = np.round(w / scale) * scale
            q = np.clip(q, -abs_max, abs_max)
        else:
            w_min, w_max = w.min(), w.max()
            w_range = max(w_max - w_min, 1e-8)
            scale = w_range / (n_levels - 1)
            q = np.round((w - w_min) / scale) * scale + w_min

        quantized.append(q)

    return quantized

quantize_delays(delays, resolution=1, max_delay=None)

Quantize continuous delays to integer grid.

Parameters

delays : ndarray Continuous delay values. resolution : int Delay step size (default 1). Resolution=2 means delays are rounded to {0, 2, 4, 6, ...}, halving the buffer depth. max_delay : int, optional Clamp delays to this maximum.

Returns

ndarray of int Quantized integer delays.

Source code in src/sc_neurocore/compression/quantization.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
def quantize_delays(
    delays: np.ndarray,
    resolution: int = 1,
    max_delay: int | None = None,
) -> np.ndarray:
    """Quantize continuous delays to integer grid.

    Parameters
    ----------
    delays : ndarray
        Continuous delay values.
    resolution : int
        Delay step size (default 1). Resolution=2 means delays are
        rounded to {0, 2, 4, 6, ...}, halving the buffer depth.
    max_delay : int, optional
        Clamp delays to this maximum.

    Returns
    -------
    ndarray of int
        Quantized integer delays.
    """
    q = np.round(delays / resolution).astype(np.int64) * resolution
    q = np.clip(q, 0, None)
    if max_delay is not None:
        q = np.clip(q, 0, max_delay)
    return q