Stochastic Computing Theory: A Deep Dive¶

The mathematical foundations of stochastic computing, from Gaines (1969) to modern SC neural networks. This tutorial bridges the gap between "it works" and "here's why it works."

Prerequisites: Basic probability theory, familiarity with Tutorial 01.

1. Bernoulli bitstreams as random variables¶

A stochastic bitstream of length L encoding probability p is a sequence of i.i.d. Bernoulli(p) random variables:

B = (b_1, b_2, ..., b_L)  where b_i ~ Bernoulli(p)

P(b_i = 1) = p
P(b_i = 0) = 1 - p

Estimator: p̂ = (1/L) Σ b_i
E[p̂] = p                    (unbiased)
Var(p̂) = p(1-p)/L           (central limit theorem)
σ(p̂) = √(p(1-p)/L)         (standard error)

For p=0.5 (worst case), σ = 1/(2√L):

L	σ	Effective bits
64	0.063	~4
256	0.031	~5
1024	0.016	~6
4096	0.008	~7

Effective bits ≈ -log₂(σ). The 1/√L scaling means doubling precision costs 4× the bitstream length — diminishing returns.

2. Multiplication via AND¶

For independent bitstreams A ~ Bernoulli(p_a) and B ~ Bernoulli(p_b):

C = A AND B

P(C = 1) = P(A=1) · P(B=1) = p_a · p_b

This is exact multiplication with zero error — as long as A and B are independent. Independence is the fundamental requirement.

Correlation destroys accuracy¶

If A and B share the same random source:

P(A AND B) = P(A=1, B=1) = p_a·p_b + ρ·√(p_a(1-p_a)·p_b(1-p_b))

where ρ is the Pearson correlation between the bitstreams.

For ρ=1 (same LFSR): P(A AND A) = p_a ≠ p_a² (unless p_a ∈ {0, 1}). This is squaring, not multiplication. Useful for computing x² but not for general weight × input.

import numpy as np

L = 10000
p_a, p_b = 0.7, 0.4

# Independent
rng_a = np.random.RandomState(42)
rng_b = np.random.RandomState(137)
A = (rng_a.rand(L) < p_a).astype(int)
B = (rng_b.rand(L) < p_b).astype(int)
print(f"Independent: AND mean = {(A & B).mean():.4f} (expected {p_a*p_b:.4f})")

# Correlated (same RNG)
rng_shared = np.random.RandomState(42)
U = rng_shared.rand(L)
A_corr = (U < p_a).astype(int)
B_corr = (U < p_b).astype(int)
print(f"Correlated:  AND mean = {(A_corr & B_corr).mean():.4f} (expected {min(p_a, p_b):.4f})")

With shared randomness, AND(A, B) = min(p_a, p_b) — this is the stochastic MIN operation, not multiplication.

3. Addition via MUX¶

A 2:1 multiplexer with select line S ~ Bernoulli(0.5):

C = S ? A : B

P(C=1) = P(S=1)·P(A=1) + P(S=0)·P(B=1) = 0.5·p_a + 0.5·p_b

This computes the scaled sum (p_a + p_b)/2. For N inputs with random selection:

P(C=1) = (1/N) Σ p_i

Exact arithmetic average — no carry propagation, no overflow.

4. The SC noise floor¶

Every SC operation adds noise. For a chain of K operations:

Total variance ≈ Σ_k Var(output_k) / L_k

In a multi-layer network with layers of sizes N_1, N_2, ..., N_K:

Output σ ≈ √(K / L)  (rough approximation)

This is why deep SC networks (K > 5) need longer bitstreams or ensemble averaging — noise compounds through layers.

# Demonstrate noise compounding
from sc_neurocore import VectorizedSCLayer

def measure_layer_noise(n_layers, L, n_trials=100):
    layers = [VectorizedSCLayer(n_inputs=32, n_neurons=32, length=L)
              for _ in range(n_layers)]
    x = np.full(32, 0.5)  # constant input

    outputs = []
    for _ in range(n_trials):
        h = x.copy()
        for layer in layers:
            h = np.clip(layer.forward(h), 0.01, 0.99)
        outputs.append(h.mean())

    return np.std(outputs)

print("Noise vs depth (L=256):")
for depth in [1, 2, 3, 5, 8]:
    noise = measure_layer_noise(depth, L=256)
    print(f"  {depth} layers: σ = {noise:.4f}")

5. XNOR for bipolar encoding¶

Unipolar SC represents [0, 1]. For signed values [-1, 1], use bipolar encoding:

Value x ∈ [-1, 1]
P(bit=1) = (x + 1) / 2

Multiplication: XNOR(A, B)
P(XNOR = 1) = P(A=1)·P(B=1) + P(A=0)·P(B=0)
             = p_a·p_b + (1-p_a)(1-p_b)
             = 2·p_a·p_b - p_a - p_b + 1

Converting back to bipolar values:

x_a = 2p_a - 1,  x_b = 2p_b - 1
XNOR encodes: x_a · x_b

XNOR is an exact signed multiplier in bipolar SC.

6. Stochastic number precision theory¶

The Fisher information of a Bernoulli(p) bitstream of length L:

I(p) = L / (p(1-p))

Cramér-Rao lower bound on estimator variance:

Var(p̂) ≥ 1/I(p) = p(1-p)/L

The sample mean achieves this bound — it is the most efficient estimator. No encoding scheme can do better than 1/√L precision for L random bits.

Comparison with deterministic binary¶

A deterministic B-bit binary number has precision 2^(-B). An SC bitstream of length L = 2^B has precision ~1/√(2^B) = 2^(-B/2).

SC is exponentially less precise per bit. But each bit requires 1 gate to process, while binary requires O(2^B) gates. The total "precision × area" product is comparable.

7. Sobol sequences for better precision¶

LFSR bitstreams are not truly random — they are pseudo-random with known correlation structure. Low-discrepancy sequences (Sobol, Halton) can improve SC precision beyond 1/√L:

# Standard LFSR encoding
from sc_neurocore import BitstreamEncoder

enc_lfsr = BitstreamEncoder(length=256, seed=42)
bits_lfsr = enc_lfsr.encode(0.3)
error_lfsr = abs(bits_lfsr.mean() - 0.3)

# Sobol-like deterministic encoding (for comparison)
# Generate bits by comparing p against evenly spaced thresholds
thresholds = np.linspace(0, 1, 256, endpoint=False)
np.random.shuffle(thresholds)  # shuffle to avoid systematic patterns
bits_sobol = (thresholds < 0.3).astype(np.uint8)
error_sobol = abs(bits_sobol.mean() - 0.3)

print(f"LFSR error:  {error_lfsr:.6f}")
print(f"Sobol error: {error_sobol:.6f}")

Sobol encoding achieves O(log(L)/L) error vs O(1/√L) for random — but loses the independence property needed for AND multiplication. Use Sobol only when computing single values, not products.

8. Error propagation in SC neural networks¶

For a two-layer SC network with layers W₁ (n×m) and W₂ (m×k):

h = SC_forward(W₁, x)     error: σ_h ≈ √(n/(L·m))
y = SC_forward(W₂, h)     error: σ_y ≈ √(m/(L·k)) + propagated σ_h

Total output error: σ_y ≈ √(1/L) · (√(n/m) + √(m/k))

This predicts: - Wider layers (larger m) reduce noise (more averaging in the MUX tree) - Output error scales as 1/√L (confirming the fundamental limit) - Deeper networks accumulate errors additively

9. Information-theoretic capacity¶

The mutual information between SC input and output:

I(X; Y) ≤ L · C_BSC(p_e)

where C_BSC(p_e) = 1 - H(p_e) is the binary symmetric channel capacity
and p_e is the effective bit error rate.

For L=256 and a well-designed SC circuit (p_e ≈ 0.01):

I(X; Y) ≤ 256 · (1 - H(0.01)) ≈ 256 · 0.92 ≈ 235 bits

This is the theoretical maximum information an SC link of length 256 can carry per encoding — about 29 bytes. A single float32 carries 32 bits. So an SC encoding of L=256 carries ~7× more information than a single float, but spread across 256 clock cycles.

What you learned¶

Bernoulli bitstreams are unbiased estimators with σ ∝ 1/√L
AND gate = exact multiplication (requires independent inputs)
MUX = exact scaled addition (no overflow possible)
Correlated bitstreams break multiplication (AND becomes MIN)
XNOR implements signed (bipolar) multiplication
Noise compounds through layers: keep networks wide and shallow
Sobol encoding can beat 1/√L for single values but breaks AND
SC trades exponential area savings for polynomial precision loss