Stochastic Computing for Hardware Engineers¶

A guide for FPGA/ASIC designers who want to understand SC-NeuroCore's hardware architecture, synthesis flow, and design trade-offs.

SC arithmetic on silicon¶

Stochastic computing replaces complex arithmetic circuits with trivial logic gates:

Operation	Conventional (16-bit)	SC	Gate reduction
Multiply	256 LUTs (array mult)	1 LUT (AND)	256×
Add (scaled)	16 LUTs (ripple adder)	2 LUTs (MUX)	8×
Integrate	~50 LUTs (accumulator)	~4 LUTs (counter)	12×
Square	256 LUTs	0 (wire, auto-correlation)	∞

The trade-off: SC requires L clock cycles per operation (L = bitstream length). For L=256, an SC multiplier is 256 cycles × 1 LUT vs 1 cycle × 256 LUTs. Same throughput, but SC uses 256× less area and power.

SC-NeuroCore HDL modules¶

All Verilog modules are in hdl/ and have been synthesised with Yosys for Lattice iCE40 and ECP5:

Module	Function	LUTs (iCE40)	FFs
`sc_lif_neuron.v`	Q8.8 LIF neuron	~120	~48
`sc_bitstream_encoder.v`	LFSR + threshold → bitstream	~40	~16
`sc_bitstream_synapse.v`	AND-gate multiply	1	0
`sc_mux_add.v`	2-input MUX (scaled add)	2	0
`sc_cordiv.v`	CORDIV stochastic divider	~10	~2
`sc_dotproduct_to_current.v`	Popcount → membrane current	~N×4	~16
`sc_firing_rate_bank.v`	Population rate measurement	~M×8	~M×8
`sc_dense_layer_core.v`	Dense layer datapath FSM	~N×M×2	~M×48
`sc_dense_layer_top.v`	Top-level with AXI config	~N×M×2+200	~M×48+64
`sc_dense_matrix_layer.v`	N×M weight matrix layer	~N×M×2	~M×48
`sc_axil_cfg.v`	AXI-Lite register file	~200	~64
`sc_axil_cfg_param.v`	Parameterized AXI-Lite register file	~250	~80
`sc_axis_interface.v`	AXI-Stream bulk bitstream I/O	~150	~48
`sc_dma_controller.v`	DMA for weight upload and output readback	~300	~96
`sc_cdc_primitives.v`	Clock domain crossing (2-FF sync, Gray counter, async FIFO)	~60	~32
`sc_neurocore_top.v`	System top (DMA + AXI + layers)	varies	varies

Architecture: `sc_dense_layer_top.v`¶

The core building block is a fully-connected layer:

                ┌─────────────────────────────────┐
  N inputs ──►  │  N×M AND gates (weight multiply) │
                │  M MUX trees (weighted sum)       │
                │  M LIF neurons (integrate-fire)   │
                │  M spike outputs                  │
                └─────────────────────────────────┘

Port map¶

module sc_dense_layer_top #(
    parameter N_INPUTS  = 50,
    parameter N_NEURONS = 128,
    parameter BIT_WIDTH = 16    // Q8.8
)(
    input  wire                    clk,
    input  wire                    rst_n,
    input  wire [BIT_WIDTH-1:0]    I_in     [0:N_INPUTS-1],
    input  wire [BIT_WIDTH-1:0]    weights  [0:N_NEURONS*N_INPUTS-1],
    output wire                    spike_out[0:N_NEURONS-1],
    output wire [BIT_WIDTH-1:0]    V_out    [0:N_NEURONS-1]
);

Timing¶

Latency: 2 clock cycles (1 for multiply+MUX, 1 for LIF update)
Throughput: 1 result per L clocks (L = bitstream length)
Clock rate: 100-200 MHz on iCE40, up to 400 MHz on ECP5

Synthesis flow¶

SC-NeuroCore uses Yosys (open-source) for synthesis:

# Step 1: Synthesise for iCE40
yosys -p "read_verilog hdl/sc_lif_neuron.v; synth_ice40 -top sc_lif_neuron; stat"

# Step 2: Place and route
nextpnr-ice40 --hx8k --json sc_lif_neuron.json --asc sc_lif_neuron.asc

# Step 3: Generate bitstream
icepack sc_lif_neuron.asc sc_lif_neuron.bin

# Step 4: Program FPGA
iceprog sc_lif_neuron.bin

For ECP5:

yosys -p "read_verilog hdl/sc_dense_layer_top.v; synth_ecp5 -top sc_dense_layer_top; stat"
nextpnr-ecp5 --85k --json sc_dense_layer_top.json --lpf constraints.lpf --textcfg sc_dense_layer_top.config
ecppack sc_dense_layer_top.config sc_dense_layer_top.bit

Co-simulation verification¶

Before deploying to hardware, verify the Verilog against the Python bit-true model using Verilator:

# Build testbench
verilator -Wall --cc hdl/sc_lif_neuron.v --exe tb_cosim.cpp --build

# Generate test vectors from Python
python -c "from sc_neurocore import FixedPointLIFNeuron; ..."

# Run and compare
./obj_dir/Vsc_lif_neuron
python compare_outputs.py  # cycle-by-cycle comparison

The Python FixedPointLIFNeuron is bit-exact to the Verilog. Every intermediate value (multiply, shift, clamp) matches. If co-simulation fails, the hardware will not reproduce the software.

See Tutorial 09 for the complete co-simulation flow.

Q8.8 fixed-point design¶

All hardware arithmetic uses Q8.8 (16-bit signed, 8 fractional bits):

Multiply: (A × B) >> 8
  - A, B are int16 (Q8.8)
  - Product is int32
  - Right-shift by 8 realigns the binary point
  - Saturate to int16 range [-32768, 32767]

Add: A + B
  - Direct int16 addition
  - Overflow handled by saturation

Compare: A >= THRESHOLD
  - Simple signed comparison
  - THRESHOLD = 256 = 1.0 in Q8.8

Overflow prevention¶

The LIF neuron's voltage can grow without bound under sustained input. The hardware clamps voltage to the int16 range:

// Saturation logic in sc_lif_neuron.v
wire signed [31:0] product = leak_k * V_mem;
wire signed [15:0] leak_term = (product > 32'sh7FFF_FF) ? 16'sh7FFF :
                                (product < -32'sh8000_00) ? -16'sh8000 :
                                product[23:8];

LFSR random number generation¶

SC requires random bitstreams. The sc_lfsr_16bit.v module generates pseudo-random sequences using a 16-bit linear feedback shift register:

Polynomial: x¹⁶ + x¹⁴ + x¹³ + x¹¹ + 1
Period: 65,535 (2¹⁶ - 1)
Taps: bits 15, 13, 12, 10

Each encoder needs its own LFSR with a different seed to ensure independence. Correlated random numbers cause systematic errors in SC arithmetic.

Resource scaling¶

For an N-input, M-neuron network:

Resource	Formula	50→128 example
LUTs (AND gates)	N × M	6,400
LUTs (MUX trees)	M × log₂(N)	~768
LUTs (LIF neurons)	M × ~120	~15,360
FFs (LIF state)	M × 48	6,144
LFSRs	N + M	178
Total LUTs		~22,500

An iCE40 HX8K has 7,680 LUTs → fits a 20→32 layer. An ECP5-85K has 83,640 LUTs → fits a 100→256 layer.

Power estimation¶

SC power is dominated by switching activity (toggle rate ≈ 50%):

P = C_eff × V² × f × N_gates × activity

For iCE40 at 100 MHz, 1.2V:
  C_eff ≈ 5 fF per LUT
  Activity ≈ 0.5 (stochastic bitstreams toggle ~50%)
  N_gates = 22,500
  P ≈ 5e-15 × 1.44 × 1e8 × 22,500 × 0.5
  P ≈ 8.1 mW

Compare to a conventional 16-bit neural network on the same FPGA: ~50-100 mW. SC saves 6-12× on dynamic power.

Design rules for SC hardware¶

One LFSR per encoder — shared LFSRs create correlation
Minimum L=64 for meaningful computation — below this, quantisation noise dominates (effective precision < 4 bits)
L=256 is the sweet spot — 8-bit effective precision, reasonable latency
Clock gating on idle neurons — neurons that haven't spiked recently can be clock-gated to save power
Weight memory: store Q8.8 weights in BRAM, not LUT-based registers, for networks larger than ~1K weights