Tutorial 44: SNN Model Compression¶

Trained SNN models are often too large for small FPGAs or edge neuromorphic chips. Model compression reduces resource usage through pruning, quantisation, and automatic target-aware optimisation — while preserving accuracy.

SC-NeuroCore provides a compression pipeline that goes from a trained PyTorch SNN directly to a size-optimised FPGA deployment.

Why Compress?¶

Target	LUTs Available	Typical SNN Requirement	Fits?
iCE40 UP5K	5,280	2,000–50,000	Maybe
ECP5 25K	24,576	2,000–50,000	Usually
Artix-7 35T	20,800	2,000–50,000	Usually

A 784→256→128→10 MNIST SNN has ~230K weights. At Q8.8 (16 bits each), that's 3.7 Mbit — too large for iCE40's 30 BRAMs (240 Kbit). After pruning (90% sparsity) and quantisation (4-bit), it fits in 37 Kbit.

1. Weight Pruning¶

Remove near-zero synaptic weights. SNN weights are naturally sparse — many connections have negligible effect on firing patterns.

Python

import numpy as np
from sc_neurocore.compression import prune_weights

# Simulate trained weights (3 layers)
weights = [
    np.random.randn(784, 256) * 0.05,  # input → hidden1
    np.random.randn(256, 128) * 0.1,   # hidden1 → hidden2
    np.random.randn(128, 10) * 0.2,    # hidden2 → output
]

# Prune weights below threshold
pruned, report = prune_weights(weights, threshold=0.05)
print(f"Sparsity: {report.sparsity:.1%}")
print(f"Parameters: {report.original_params:,} → {report.remaining_params:,}")
print(f"Memory reduction: {report.compression_ratio:.1f}×")

Magnitude vs Activity-Based Pruning¶

Magnitude pruning removes small weights. Activity-based pruning removes weights connected to low-activity neurons — more targeted for SNNs:

Python

from sc_neurocore.compression import prune_weights

# Activity-based: monitor spike rates during validation,
# prune weights connected to neurons with <5% firing rate
pruned, report = prune_weights(
    weights,
    method="activity",
    activity_threshold=0.05,
    spike_rates=monitored_rates,  # from SpikeMonitor
)

2. Structural Pruning¶

Remove entire neurons (not just individual weights). This reduces network dimensions, directly cutting LUT and FF usage:

Python

from sc_neurocore.compression import prune_neurons

# Remove neurons with <5% average firing rate
pruned_weights, report = prune_neurons(weights, activity_threshold=0.05)
print(f"Removed {report.pruned_neurons} neurons")
print(f"New dimensions: {report.new_dimensions}")
# e.g., [784, 196, 112, 10] — hidden layers shrunk

3. Weight Quantisation¶

Reduce weight precision from float32 to lower bit-widths:

Python

from sc_neurocore.compression.quantization import quantize_weights

# Q8.8 (16-bit fixed-point) — SC-NeuroCore's default hardware format
q16 = quantize_weights(weights, bits=16, scheme="fixed_point")

# Q4.4 (8-bit) — aggressive, 2× compression vs Q8.8
q8 = quantize_weights(weights, bits=8, scheme="fixed_point")

# Binary weights (1-bit) — extreme compression, XOR-based compute
q1 = quantize_weights(weights, bits=1, scheme="binary")

# Ternary weights {-1, 0, +1} — 2-bit, no multipliers needed
q2 = quantize_weights(weights, bits=2, scheme="ternary")

Quantisation Impact on Accuracy¶

Measured on MNIST SNN (784→128→10, 25 timesteps):

Precision	Bits/Weight	Accuracy	Memory	LUTs (est.)
float32	32	97.2%	400 KB	—
Q8.8	16	97.0%	200 KB	baseline
Q4.4	8	96.5%	100 KB	0.6×
Ternary	2	94.1%	25 KB	0.2×
Binary	1	91.8%	12.5 KB	0.1×

These numbers are from our training benchmarks (see benchmarks/results/), not estimates.

4. Delay Quantisation¶

Learnable delays (from DelayLinear) are float-valued during training. For hardware, they must be integer clock cycles:

Python

from sc_neurocore.compression.quantization import quantize_delays

delays = np.array([1.3, 2.7, 4.1, 0.8])
quantised = quantize_delays(delays, resolution=2)
# [2, 4, 4, 2] — rounded to nearest multiple of resolution

5. Auto-Optimise for Target FPGA¶

Automatically select pruning threshold and quantisation level to fit a specific FPGA target:

Python

from sc_neurocore.optimizer import fit_to_target

result = fit_to_target(
    layer_shapes=[(784, 128), (128, 10)],
    weights=weights,
    target="ice40",  # or "ecp5", "artix7"
)

print(result.summary())
# Target: ice40 (5280 LUTs, 30 BRAMs)
# Selected: Q4.4, 85% pruning, structural (128→96 hidden)
# Estimated: 2400 LUTs, 12 BRAMs — FITS
# Accuracy impact: -0.7% (from validation set)

6. End-to-End: Train → Compress → Deploy¶

Python

from sc_neurocore.training import SpikingNet, train_epoch
from sc_neurocore.compression import prune_weights
from sc_neurocore.compression.quantization import quantize_weights

# 1. Train
model = SpikingNet(n_input=784, n_hidden=128, n_output=10)
# ... training loop ...

# 2. Export SC weights
sc_weights = model.to_sc_weights()

# 3. Prune
pruned, _ = prune_weights(
    [w["weight"] for w in sc_weights],
    threshold=0.05,
)

# 4. Quantise
quantised = quantize_weights(pruned, bits=8)

# 5. In the Studio:
#    - Load the quantised weights
#    - Click Pipeline → ice40
#    - View resource utilisation with compression applied

Comparison with Other Frameworks¶

Feature	SC-NeuroCore	snnTorch	Norse	Lava
Weight pruning	Yes	Manual	No	No
Structural pruning	Yes	No	No	No
Quantisation (multi-level)	1/2/4/8/16-bit	No	No	8-bit
FPGA-aware auto-optimise	Yes	No	No	Loihi-aware
Delay quantisation	Yes	No	No	No

References¶

Han et al. (2016). "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR.
Kundu et al. (2021). "Spike-Thrift: Towards Energy-Efficient Deep Spiking Neural Networks by Limiting Spiking Activity via Attention- Guided Compression." WACV 2021.
Kim et al. (2022). "Exploring Lottery Ticket Hypothesis in Spiking Neural Networks." ECCV 2022.