Stochastic Computing for Hardware Engineers¶
A guide for FPGA/ASIC designers who want to understand SC-NeuroCore's hardware architecture, synthesis flow, and design trade-offs.
SC arithmetic on silicon¶
Stochastic computing replaces complex arithmetic circuits with trivial logic gates:
| Operation | Conventional (16-bit) | SC | Gate reduction |
|---|---|---|---|
| Multiply | 256 LUTs (array mult) | 1 LUT (AND) | 256× |
| Add (scaled) | 16 LUTs (ripple adder) | 2 LUTs (MUX) | 8× |
| Integrate | ~50 LUTs (accumulator) | ~4 LUTs (counter) | 12× |
| Square | 256 LUTs | 0 (wire, auto-correlation) | ∞ |
The trade-off: SC requires L clock cycles per operation (L = bitstream length). For L=256, an SC multiplier is 256 cycles × 1 LUT vs 1 cycle × 256 LUTs. Same throughput, but SC uses 256× less area and power.
SC-NeuroCore HDL modules¶
All Verilog modules are in hdl/ and have been synthesised with
Yosys for Lattice iCE40 and ECP5:
| Module | Function | LUTs (iCE40) | FFs |
|---|---|---|---|
sc_lif_neuron.v |
Q8.8 LIF neuron | ~120 | ~48 |
sc_bitstream_encoder.v |
LFSR + threshold → bitstream | ~40 | ~16 |
sc_bitstream_synapse.v |
AND-gate multiply | 1 | 0 |
sc_mux_add.v |
2-input MUX (scaled add) | 2 | 0 |
sc_cordiv.v |
CORDIV stochastic divider | ~10 | ~2 |
sc_dotproduct_to_current.v |
Popcount → membrane current | ~N×4 | ~16 |
sc_firing_rate_bank.v |
Population rate measurement | ~M×8 | ~M×8 |
sc_dense_layer_core.v |
Dense layer datapath FSM | ~N×M×2 | ~M×48 |
sc_dense_layer_top.v |
Top-level with AXI config | ~N×M×2+200 | ~M×48+64 |
sc_dense_matrix_layer.v |
N×M weight matrix layer | ~N×M×2 | ~M×48 |
sc_axil_cfg.v |
AXI-Lite register file | ~200 | ~64 |
sc_axil_cfg_param.v |
Parameterized AXI-Lite register file | ~250 | ~80 |
sc_axis_interface.v |
AXI-Stream bulk bitstream I/O | ~150 | ~48 |
sc_dma_controller.v |
DMA for weight upload and output readback | ~300 | ~96 |
sc_cdc_primitives.v |
Clock domain crossing (2-FF sync, Gray counter, async FIFO) | ~60 | ~32 |
sc_neurocore_top.v |
System top (DMA + AXI + layers) | varies | varies |
Architecture: sc_dense_layer_top.v¶
The core building block is a fully-connected layer:
┌─────────────────────────────────┐
N inputs ──► │ N×M AND gates (weight multiply) │
│ M MUX trees (weighted sum) │
│ M LIF neurons (integrate-fire) │
│ M spike outputs │
└─────────────────────────────────┘
Port map¶
module sc_dense_layer_top #(
parameter N_INPUTS = 50,
parameter N_NEURONS = 128,
parameter BIT_WIDTH = 16 // Q8.8
)(
input wire clk,
input wire rst_n,
input wire [BIT_WIDTH-1:0] I_in [0:N_INPUTS-1],
input wire [BIT_WIDTH-1:0] weights [0:N_NEURONS*N_INPUTS-1],
output wire spike_out[0:N_NEURONS-1],
output wire [BIT_WIDTH-1:0] V_out [0:N_NEURONS-1]
);
Timing¶
- Latency: 2 clock cycles (1 for multiply+MUX, 1 for LIF update)
- Throughput: 1 result per L clocks (L = bitstream length)
- Clock rate: 100-200 MHz on iCE40, up to 400 MHz on ECP5
Synthesis flow¶
SC-NeuroCore uses Yosys (open-source) for synthesis:
# Step 1: Synthesise for iCE40
yosys -p "read_verilog hdl/sc_lif_neuron.v; synth_ice40 -top sc_lif_neuron; stat"
# Step 2: Place and route
nextpnr-ice40 --hx8k --json sc_lif_neuron.json --asc sc_lif_neuron.asc
# Step 3: Generate bitstream
icepack sc_lif_neuron.asc sc_lif_neuron.bin
# Step 4: Program FPGA
iceprog sc_lif_neuron.bin
For ECP5:
yosys -p "read_verilog hdl/sc_dense_layer_top.v; synth_ecp5 -top sc_dense_layer_top; stat"
nextpnr-ecp5 --85k --json sc_dense_layer_top.json --lpf constraints.lpf --textcfg sc_dense_layer_top.config
ecppack sc_dense_layer_top.config sc_dense_layer_top.bit
Co-simulation verification¶
Before deploying to hardware, verify the Verilog against the Python bit-true model using Verilator:
# Build testbench
verilator -Wall --cc hdl/sc_lif_neuron.v --exe tb_cosim.cpp --build
# Generate test vectors from Python
python -c "from sc_neurocore import FixedPointLIFNeuron; ..."
# Run and compare
./obj_dir/Vsc_lif_neuron
python compare_outputs.py # cycle-by-cycle comparison
The Python FixedPointLIFNeuron is bit-exact to the Verilog. Every
intermediate value (multiply, shift, clamp) matches. If co-simulation
fails, the hardware will not reproduce the software.
See Tutorial 09 for the complete co-simulation flow.
Q8.8 fixed-point design¶
All hardware arithmetic uses Q8.8 (16-bit signed, 8 fractional bits):
Multiply: (A × B) >> 8
- A, B are int16 (Q8.8)
- Product is int32
- Right-shift by 8 realigns the binary point
- Saturate to int16 range [-32768, 32767]
Add: A + B
- Direct int16 addition
- Overflow handled by saturation
Compare: A >= THRESHOLD
- Simple signed comparison
- THRESHOLD = 256 = 1.0 in Q8.8
Overflow prevention¶
The LIF neuron's voltage can grow without bound under sustained input. The hardware clamps voltage to the int16 range:
// Saturation logic in sc_lif_neuron.v
wire signed [31:0] product = leak_k * V_mem;
wire signed [15:0] leak_term = (product > 32'sh7FFF_FF) ? 16'sh7FFF :
(product < -32'sh8000_00) ? -16'sh8000 :
product[23:8];
LFSR random number generation¶
SC requires random bitstreams. The sc_lfsr_16bit.v module generates
pseudo-random sequences using a 16-bit linear feedback shift register:
Polynomial: x¹⁶ + x¹⁴ + x¹³ + x¹¹ + 1
Period: 65,535 (2¹⁶ - 1)
Taps: bits 15, 13, 12, 10
Each encoder needs its own LFSR with a different seed to ensure independence. Correlated random numbers cause systematic errors in SC arithmetic.
Resource scaling¶
For an N-input, M-neuron network:
| Resource | Formula | 50→128 example |
|---|---|---|
| LUTs (AND gates) | N × M | 6,400 |
| LUTs (MUX trees) | M × log₂(N) | ~768 |
| LUTs (LIF neurons) | M × ~120 | ~15,360 |
| FFs (LIF state) | M × 48 | 6,144 |
| LFSRs | N + M | 178 |
| Total LUTs | ~22,500 |
An iCE40 HX8K has 7,680 LUTs → fits a 20→32 layer. An ECP5-85K has 83,640 LUTs → fits a 100→256 layer.
Power estimation¶
SC power is dominated by switching activity (toggle rate ≈ 50%):
P = C_eff × V² × f × N_gates × activity
For iCE40 at 100 MHz, 1.2V:
C_eff ≈ 5 fF per LUT
Activity ≈ 0.5 (stochastic bitstreams toggle ~50%)
N_gates = 22,500
P ≈ 5e-15 × 1.44 × 1e8 × 22,500 × 0.5
P ≈ 8.1 mW
Compare to a conventional 16-bit neural network on the same FPGA: ~50-100 mW. SC saves 6-12× on dynamic power.
Design rules for SC hardware¶
- One LFSR per encoder — shared LFSRs create correlation
- Minimum L=64 for meaningful computation — below this, quantisation noise dominates (effective precision < 4 bits)
- L=256 is the sweet spot — 8-bit effective precision, reasonable latency
- Clock gating on idle neurons — neurons that haven't spiked recently can be clock-gated to save power
- Weight memory: store Q8.8 weights in BRAM, not LUT-based registers, for networks larger than ~1K weights
Further reading¶
- Tutorial 09: Hardware Co-simulation
- Tutorial 13: Fixed-Point Arithmetic
- Tutorial 14: Network Export & Deployment
- Hardware Guide:
docs/hardware/HARDWARE_GUIDE.md - FPGA Toolchain:
docs/hardware/FPGA_TOOLCHAIN_GUIDE.md