Skip to content

SNN Model Compression

Weight pruning, structural pruning, stochastic-aware pruning, and quantization for FPGA cost reduction.

Pruning

Three pruning strategies:

  • prune_weights — Magnitude-based: zero out weights with |w| below threshold. Standard approach.
  • prune_neurons — Structural: remove entire neurons with low firing rates, reducing layer width (not just sparsity).
  • prune_stochasticSC-specific: score weights by bitstream contribution. Weights near 0 or 1 produce near-deterministic bitstreams (low entropy) and can be replaced with constant gates. Importance = min(p, 1-p) * bitstream_length.
Python
from sc_neurocore.compression import prune_stochastic

# Prune weights contributing <1 popcount bit per inference
pruned, report = prune_stochastic(weights, bitstream_length=256, min_popcount_bits=1.0)
print(f"Sparsity: {report.sparsity:.1%}")

sc_neurocore.compression.pruning

Weight, structural, and stochastic-aware pruning for SNN model compression.

Weight pruning: zero out weights below a magnitude threshold. Structural pruning: remove entire neurons that fire below an activity threshold, reducing layer width. Stochastic pruning: score weights by bitstream contribution — how many popcount bits they contribute per inference. SC-specific.

All methods reduce FPGA resource usage when combined with Projection(weight_threshold=) for runtime sparsity exploitation.

PruningReport dataclass

Results of a pruning operation.

Source code in src/sc_neurocore/compression/pruning.py
Python
28
29
30
31
32
33
34
35
36
37
@dataclass
class PruningReport:
    """Results of a pruning operation."""

    original_params: int
    pruned_params: int
    remaining_params: int
    sparsity: float
    original_neurons: int = 0
    pruned_neurons: int = 0

prune_weights(weights, threshold=0.01, method='magnitude')

Prune small weights from layer weight matrices.

Parameters

weights : list of ndarray Weight matrices for each layer. threshold : float Pruning threshold. Weights with |w| <= threshold are zeroed. method : str 'magnitude' (default): prune by absolute value. 'percentile': treat threshold as percentile (0-100) of weight magnitudes to prune.

Returns

(pruned_weights, PruningReport)

Source code in src/sc_neurocore/compression/pruning.py
Python
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def prune_weights(
    weights: list[np.ndarray],
    threshold: float = 0.01,
    method: str = "magnitude",
) -> tuple[list[np.ndarray], PruningReport]:
    """Prune small weights from layer weight matrices.

    Parameters
    ----------
    weights : list of ndarray
        Weight matrices for each layer.
    threshold : float
        Pruning threshold. Weights with |w| <= threshold are zeroed.
    method : str
        'magnitude' (default): prune by absolute value.
        'percentile': treat threshold as percentile (0-100) of weight
        magnitudes to prune.

    Returns
    -------
    (pruned_weights, PruningReport)
    """
    pruned = []
    total_original = 0
    total_pruned = 0

    for w in weights:
        total_original += w.size
        w_copy = w.copy()

        if method == "percentile":
            abs_w = np.abs(w_copy)
            cutoff = np.percentile(abs_w[abs_w > 0], threshold) if np.any(abs_w > 0) else 0.0
            mask = abs_w <= cutoff
        else:
            mask = np.abs(w_copy) <= threshold

        w_copy[mask] = 0.0
        total_pruned += int(mask.sum())
        pruned.append(w_copy)

    remaining = total_original - total_pruned
    sparsity = total_pruned / max(total_original, 1)

    return pruned, PruningReport(
        original_params=total_original,
        pruned_params=total_pruned,
        remaining_params=remaining,
        sparsity=sparsity,
    )

prune_neurons(weights, firing_rates=None, activity_threshold=0.001)

Structural pruning: remove neurons with low firing rates.

Removes entire rows from weight matrices (output neurons) and corresponding columns from the next layer's weight matrix (input connections). Reduces layer width, not just sparsity.

Parameters

weights : list of ndarray Weight matrices [W1, W2, ...] where W_i has shape (n_out, n_in). firing_rates : list of ndarray, optional Per-neuron firing rates for each layer. If None, uses output weight magnitude as a proxy for importance. activity_threshold : float Neurons with firing rate (or weight norm) below this are pruned.

Returns

(pruned_weights, PruningReport)

Source code in src/sc_neurocore/compression/pruning.py
Python
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
def prune_neurons(
    weights: list[np.ndarray],
    firing_rates: list[np.ndarray] | None = None,
    activity_threshold: float = 0.001,
) -> tuple[list[np.ndarray], PruningReport]:
    """Structural pruning: remove neurons with low firing rates.

    Removes entire rows from weight matrices (output neurons) and
    corresponding columns from the next layer's weight matrix (input
    connections). Reduces layer width, not just sparsity.

    Parameters
    ----------
    weights : list of ndarray
        Weight matrices [W1, W2, ...] where W_i has shape (n_out, n_in).
    firing_rates : list of ndarray, optional
        Per-neuron firing rates for each layer. If None, uses output
        weight magnitude as a proxy for importance.
    activity_threshold : float
        Neurons with firing rate (or weight norm) below this are pruned.

    Returns
    -------
    (pruned_weights, PruningReport)
    """
    n_layers = len(weights)
    pruned_weights = [w.copy() for w in weights]
    total_neurons = sum(w.shape[0] for w in weights)
    neurons_pruned = 0

    for i in range(n_layers):
        w = pruned_weights[i]
        n_out = w.shape[0]

        if firing_rates is not None and i < len(firing_rates):
            importance = firing_rates[i]
        else:
            importance = np.linalg.norm(w, axis=1)

        keep_mask = importance > activity_threshold
        if keep_mask.all():
            continue

        n_removed = int((~keep_mask).sum())
        neurons_pruned += n_removed

        pruned_weights[i] = w[keep_mask]

        if i + 1 < n_layers:
            pruned_weights[i + 1] = pruned_weights[i + 1][:, keep_mask]

    total_remaining = total_neurons - neurons_pruned

    original_params = sum(w.size for w in weights)
    remaining_params = sum(w.size for w in pruned_weights)

    return pruned_weights, PruningReport(
        original_params=original_params,
        pruned_params=original_params - remaining_params,
        remaining_params=remaining_params,
        sparsity=(original_params - remaining_params) / max(original_params, 1),
        original_neurons=total_neurons,
        pruned_neurons=neurons_pruned,
    )

prune_stochastic(weights, bitstream_length=256, min_popcount_bits=1.0)

Stochastic-aware pruning: score weights by bitstream contribution.

In SC networks, weight w encodes probability p = clip(|w|, 0, 1). The expected popcount contribution per inference is: contribution = min(p, 1-p) * bitstream_length

Weights that produce nearly-deterministic bitstreams (p near 0 or 1) contribute almost nothing to computation — they can be replaced with constant 0/1 gates, saving AND+popcount hardware.

Parameters

weights : list of ndarray Weight matrices (values in [0, 1] for unipolar SC). bitstream_length : int Bitstream length (L). Longer streams = more bits per weight. min_popcount_bits : float Minimum expected popcount contribution to keep a weight. Weights contributing fewer bits than this are zeroed.

Returns

(pruned_weights, PruningReport)

Source code in src/sc_neurocore/compression/pruning.py
Python
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
def prune_stochastic(
    weights: list[np.ndarray],
    bitstream_length: int = 256,
    min_popcount_bits: float = 1.0,
) -> tuple[list[np.ndarray], PruningReport]:
    """Stochastic-aware pruning: score weights by bitstream contribution.

    In SC networks, weight w encodes probability p = clip(|w|, 0, 1).
    The expected popcount contribution per inference is:
        contribution = min(p, 1-p) * bitstream_length

    Weights that produce nearly-deterministic bitstreams (p near 0 or 1)
    contribute almost nothing to computation — they can be replaced with
    constant 0/1 gates, saving AND+popcount hardware.

    Parameters
    ----------
    weights : list of ndarray
        Weight matrices (values in [0, 1] for unipolar SC).
    bitstream_length : int
        Bitstream length (L). Longer streams = more bits per weight.
    min_popcount_bits : float
        Minimum expected popcount contribution to keep a weight.
        Weights contributing fewer bits than this are zeroed.

    Returns
    -------
    (pruned_weights, PruningReport)
    """
    pruned = []
    total_original = 0
    total_pruned = 0

    for w in weights:
        total_original += w.size
        w_copy = w.copy()

        # SC probability: clip to [0, 1]
        p = np.clip(np.abs(w_copy), 0.0, 1.0)
        # Expected popcount contribution: min(p, 1-p) * L
        # This is the "unpredictable" fraction of the bitstream
        contribution = np.minimum(p, 1.0 - p) * bitstream_length

        mask = contribution < min_popcount_bits
        w_copy[mask] = 0.0
        total_pruned += int(mask.sum())
        pruned.append(w_copy)

    remaining = total_original - total_pruned
    sparsity = total_pruned / max(total_original, 1)

    return pruned, PruningReport(
        original_params=total_original,
        pruned_params=total_pruned,
        remaining_params=remaining,
        sparsity=sparsity,
    )

Quantization

sc_neurocore.compression.quantization

Quantize weights and delays for reduced hardware precision.

Weight quantization: reduce from float64 to fixed-point with configurable bit width. Fewer bits = smaller BRAM and simpler multiplier circuits.

Delay quantization: round continuous delays to integer steps or coarser grids. Fewer delay levels = smaller delay buffers on FPGA.

quantize_weights(weights, bits=8, symmetric=True)

Quantize weight matrices to fixed-point with given bit width.

Parameters

weights : list of ndarray Float weight matrices. bits : int Target bit width (default 8). Range: [2, 16]. symmetric : bool Symmetric quantization around zero (default True).

Returns

list of ndarray Quantized weights (still float dtype but with discrete values).

Source code in src/sc_neurocore/compression/quantization.py
Python
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def quantize_weights(
    weights: list[np.ndarray],
    bits: int = 8,
    symmetric: bool = True,
) -> list[np.ndarray]:
    """Quantize weight matrices to fixed-point with given bit width.

    Parameters
    ----------
    weights : list of ndarray
        Float weight matrices.
    bits : int
        Target bit width (default 8). Range: [2, 16].
    symmetric : bool
        Symmetric quantization around zero (default True).

    Returns
    -------
    list of ndarray
        Quantized weights (still float dtype but with discrete values).
    """
    bits = max(2, min(bits, 16))
    n_levels = 2**bits

    quantized = []
    for w in weights:
        if symmetric:
            abs_max = max(np.abs(w).max(), 1e-8)
            scale = abs_max / (n_levels // 2 - 1)
            q = np.round(w / scale) * scale
            q = np.clip(q, -abs_max, abs_max)
        else:
            w_min, w_max = w.min(), w.max()
            w_range = max(w_max - w_min, 1e-8)
            scale = w_range / (n_levels - 1)
            q = np.round((w - w_min) / scale) * scale + w_min

        quantized.append(q)

    return quantized

quantize_delays(delays, resolution=1, max_delay=None)

Quantize continuous delays to integer grid.

Parameters

delays : ndarray Continuous delay values. resolution : int Delay step size (default 1). Resolution=2 means delays are rounded to {0, 2, 4, 6, ...}, halving the buffer depth. max_delay : int, optional Clamp delays to this maximum.

Returns

ndarray of int Quantized integer delays.

Source code in src/sc_neurocore/compression/quantization.py
Python
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def quantize_delays(
    delays: np.ndarray,
    resolution: int = 1,
    max_delay: int | None = None,
) -> np.ndarray:
    """Quantize continuous delays to integer grid.

    Parameters
    ----------
    delays : ndarray
        Continuous delay values.
    resolution : int
        Delay step size (default 1). Resolution=2 means delays are
        rounded to {0, 2, 4, 6, ...}, halving the buffer depth.
    max_delay : int, optional
        Clamp delays to this maximum.

    Returns
    -------
    ndarray of int
        Quantized integer delays.
    """
    q = np.round(delays / resolution).astype(np.int64) * resolution
    q = np.clip(q, 0, None)
    if max_delay is not None:
        q = np.clip(q, 0, max_delay)
    return q