Deploy an SNN on FPGA in 20 Minutes¶
This tutorial walks through the full SC-NeuroCore pipeline: train a digit classifier in Python, simulate it with stochastic bitstreams (bit-exact match to RTL), synthesise to FPGA, and read the resource report.
Prerequisites:
- Python 3.10+,
pip install sc-neurocore scikit-learn - Yosys (open-source synthesis, for resource reports)
- Optional: Xilinx Vivado Design Suite (for Artix-7/Zynq bitstream generation) or Lattice iCEcube2 (for iCE40)
- Optional: FPGA board (Artix-7 100T, Zynq-7020, or iCE40 for physical deployment)
1. Train and quantise (5 min)¶
python examples/mnist_fpga/demo.py
This loads sklearn's 8×8 digits dataset (1,797 images, 10 classes), applies PCA to reduce 64 features → 16, trains a logistic regression classifier, and quantises weights to Q8.8 fixed-point (the format used by the Verilog RTL).
Expected output:
Float accuracy: 94.2%
Q8.8 accuracy: 94.2%
Max quantization error: 0.0019
The quantisation error is bounded by 1/256 ≈ 0.0039 — one LSB of Q8.8. Classification accuracy is preserved because the weight magnitudes are small relative to the representable range (−128 to +127.996).
2. Stochastic computing simulation (5 min)¶
The same script runs inference using LFSR-based bitstream encoding that
matches hdl/sc_bitstream_encoder.v cycle-exactly:
SC accuracy: 94.0% (100 samples, L=1024)
SC accuracy matches Q8.8 at this task complexity. Stochastic multiplication
has variance proportional to 1/L; increasing --stream-length 4096 can
improve accuracy on harder tasks at the cost of more clock cycles on FPGA.
The key insight: what you simulate in Python is what the FPGA executes.
Same LFSR polynomial (x^16 + x^14 + x^13 + x^11 + 1), same seed
assignment strategy (input: 0xACE1 + i*7, weight: 0xBEEF + j*N*13 + i*13),
same Q8.8 comparator logic.
3. Export Verilog weights¶
python examples/mnist_fpga/demo.py --export-verilog hdl/generated/mnist_weights.vh
This generates a .vh include file with localparam declarations for each
weight in the 10×16 matrix (160 values). The weights are synthesised into
LUT constants — no BRAM needed at this scale.
4. Synthesise with Yosys (3 min)¶
The sc_dense_matrix_layer module in hdl/sc_dense_matrix_layer.v implements
the full per-neuron dense layer: N_INPUTS shared encoders, N_NEURONS × N_INPUTS
weight encoders and AND synapses, N_NEURONS dot-product units and LIF neurons.
python tools/yosys_synth.py --module sc_dense_matrix_layer --markdown
For the reference 3-input × 7-neuron configuration:
| Module | LUTs | FFs |
|---|---|---|
sc_neurocore_top |
3,673 | 1,221 |
The 16→10 MNIST classifier is estimated at ~28K LUTs (scaled from the Yosys
default sc_neurocore_top at 3,673 LUTs), fitting an Artix-7 100T.
5. Vivado implementation (optional, 5 min)¶
If you have Vivado installed, run the full implementation flow for timing and power numbers:
vivado -mode batch -source tools/vivado_impl.tcl \
-tclargs -top sc_dense_matrix_layer -part xc7a100tcsg324-1 -clk 250
Parse the reports:
python tools/vivado_report.py vivado_reports/
This gives Fmax (expected ~200–300 MHz on 7-series), LUT/FF/BRAM utilisation, and dynamic power (expected <0.5 W for the 16→10 configuration).
Latency per inference: stream_len clock cycles. At 250 MHz with L=1024:
1024 / 250 MHz = 4.1 µs per classification.
6. Scaling guide¶
| Configuration | Est. LUTs | Target FPGA | Accuracy |
|---|---|---|---|
| 16→10 (PCA) | ~56K | Artix-7 100T | ~94% |
| 8→10 (PCA) | ~28K | Artix-7 35T | ~88% |
| 64→10 (raw pixels) | ~225K | Kintex-7 325T | ~96% |
| 16→64→10 (2-layer) | ~400K | Virtex-7 | ~97% |
Resource usage scales as O(N_INPUTS × N_NEURONS) for a single layer. Time-multiplexed designs (1 encoder cycling through N inputs) reduce LUTs by N× at the cost of N× more clock cycles.
What you proved¶
- Bit-exact simulation: Python SC model matches Verilog RTL (same LFSR seeds, Q8.8 arithmetic, overflow semantics)
- Quantisation preserves accuracy: Q8.8 introduces <0.004 max error, no accuracy loss for this task
- SC overhead is bounded: accuracy gap = f(1/L), controllable via bitstream length
- FPGA-ready: synthesisable to Xilinx 7-series, ~4 µs inference latency at 250 MHz, <0.5 W power budget