GPU Compute Backend (wgpu)¶
Module: sc_neurocore_engine.GpuDenseLayer
Rust path: sc_neurocore_engine::gpu::GpuDenseLayer
Feature gate: --features gpu
Backends: Vulkan (Linux/Windows), Metal (macOS), DX12 (Windows)
Tested on: NVIDIA GTX 1060 6GB, AMD RX 6600 XT 8GB
1. Mathematical Formalism¶
Stochastic computing dense layer¶
The GPU backend accelerates the SC-NeuroCore DenseLayer, which performs stochastic computing matrix-vector multiplication using Bernoulli-encoded bitstreams and AND+popcount accumulation.
Bernoulli encoding:
$$B_i[k] \sim \text{Bernoulli}(p_i), \quad k = 1, \ldots, L$$
Each input probability $p_i$ is encoded as a bitstream of length $L$ where each bit is independently 1 with probability $p_i$.
Stochastic AND multiplication:
$$Y_{ij}[k] = W_{ij}[k] \wedge B_i[k]$$
where $W_{ij}$ is the pre-encoded weight bitstream and $B_i$ is the input bitstream. The AND operation implements stochastic multiplication: $\mathbb{E}[Y] = w_{ij} \cdot p_i$.
Popcount accumulation:
$$\text{out}j = \frac{1}{L} \sum)$$}^{N_{in}} \text{popcount}(Y_{ij
The output for neuron $j$ is the normalised sum of bit-counts across all inputs.
GPU kernel architecture¶
Two compute shader kernels mirror the CPU hot path:
Kernel 1 — Encode (encode.wgsl):
Each GPU thread generates one packed vec2<u32> word (64 bits) of a Bernoulli bitstream.
The PRNG is Philox 4×32-10, a counter-based RNG that enables massively parallel
independent random streams without state sharing.
Dispatch: (ceil(n_inputs × words_per_input / 256), n_samples, 1)
Kernel 2 — Accumulate (accumulate.wgsl):
Each workgroup processes one (neuron, sample) pair. 256 threads cooperatively AND+popcount all input words for one neuron, then perform workgroup-local tree reduction via shared memory.
Dispatch: (n_neurons, n_samples, 1) — one workgroup per (neuron, sample)
Philox 4×32-10 PRNG¶
Counter-based RNG producing 4 random u32 values per round:
Input: (counter[4], key[2])
10 rounds of: {
x0 = mulhi(R0, counter[0]) ^ key[0] ^ counter[3]
x1 = mullo(R0, counter[0])
x2 = mulhi(R1, counter[2]) ^ key[1] ^ counter[1]
x3 = mullo(R1, counter[2])
counter = (x0, x1, x2, x3)
key += (PHILOX_W0, PHILOX_W1)
}
Constants: $R_0 = \text{0xD2511F53}$, $R_1 = \text{0xCD9E8D57}$, $W_0 = \text{0x9E3779B9}$ (golden ratio), $W_1 = \text{0xBB67AE85}$.
The PRNG is implemented in WGSL with 16-bit decomposition for mulhilo since
WGSL has no native 64-bit integer multiplication.
u64 emulation via vec2¶
WGSL has no native u64 type. Bitstreams are packed as vec2<u32> (2 × 32 bits = 64 bits).
AND and popcount operate on both halves independently:
let result = vec2<u32>(a.x & b.x, a.y & b.y);
let count = countOneBits(result.x) + countOneBits(result.y);
This incurs ~15-25% overhead vs native u64 popcount but is memory-bound anyway, so the penalty is negligible at large layer sizes.
2. Theoretical Context¶
Problem statement¶
SC-NeuroCore's single-neuron Rust engine is best-in-class (3.8 ns/step LIF), but network-level stochastic computing is O(n²) in the number of neurons × inputs, dominated by Bernoulli encoding and AND+popcount. At 1K+ neurons, CPU performance is insufficient for real-time simulation:
| Framework | 1K neurons | 10K neurons |
|---|---|---|
| SC-NeuroCore (CPU) | 11 ms | ~1.1 s |
| Brian2 (CPU) | 2.5 ms | 250 ms |
| snnTorch (GPU) | 0.5 ms | 5 ms |
| GeNN (GPU) | 0.3 ms | 3 ms |
The GPU backend closes this gap by offloading the encoding and accumulation to massively parallel GPU compute shaders.
Why wgpu?¶
| Alternative | Pros | Cons |
|---|---|---|
| CUDA | Fastest on NVIDIA | NVIDIA-only, no AMD support |
| ROCm | AMD support | Linux-only, poor API stability |
| OpenCL | Cross-platform | Stagnant, poor tooling |
| Vulkan Compute | Cross-platform, low-level | Verbose API |
| wgpu | Cross-platform, Rust-native, safe | Slight overhead vs raw Vulkan |
| Metal | Best on macOS | Apple-only |
wgpu was chosen because: 1. Cross-platform: Vulkan (Linux/Windows), Metal (macOS), DX12 (Windows) 2. Rust-native: First-class Rust API, no FFI overhead 3. Production-grade: Powers Firefox WebGPU implementation 4. AMD support: User's primary GPUs are AMD RX 6600 XT (RDNA2) 5. No CUDA dependency: Avoids vendor lock-in
GPU-native vs CPU-upload design¶
Two design alternatives were considered:
Option A: Upload CPU random bytes to GPU - Simple: generate bitstreams on CPU, upload packed words via PCIe - Bottleneck: PCIe bandwidth (8-16 GB/s) limits throughput at batch scale - For 1K inputs × 1024-bit bitstreams × 256 batch = 32 MB/batch → 2 ms upload alone
Option B: GPU-native RNG (chosen) - Philox PRNG generates random bits directly on GPU - Only probabilities (f32, small) and output (f32, small) cross PCIe - For 1K inputs: upload 4 KB probabilities, download 4 KB output - Eliminates PCIe bottleneck entirely
SC_GPU_ADAPTER environment variable¶
Multi-GPU systems can select a specific adapter via substring match:
export SC_GPU_ADAPTER="AMD" # selects AMD GPU
export SC_GPU_ADAPTER="1060" # selects GTX 1060
export SC_GPU_ADAPTER="RDNA" # selects RDNA-based GPU
If unset, wgpu's HighPerformance heuristic selects the most powerful adapter.
3. Pipeline Position¶
Python / Rust application
│
▼
┌─────────────────────────────────────────────────────┐
│ GpuDenseLayer │
│ │
│ ┌──────────┐ ┌──────────────────────────────┐ │
│ │ CPU │ │ GPU (wgpu) │ │
│ │ DenseLayer│ │ │ │
│ │ (fallback)│ │ Encode kernel (Philox RNG) │ │
│ │ │ │ │ │ │
│ │ │ │ Accumulate kernel (AND+pop) │ │
│ │ │ │ │ │ │
│ │ │ │ Output buffer → readback │ │
│ └──────────┘ └──────────────────────────────┘ │
│ │
│ Weights: uploaded once, persistent on GPU │
│ Inputs: uploaded per forward() call (f32 probs) │
│ Output: downloaded per forward() call (f32 values) │
└─────────────────────────────────────────────────────┘
Inputs¶
| Input | Type | Range | Description |
|---|---|---|---|
input_values |
Vec<f64> or list[float] |
$[0, 1]$ | Input probabilities per neuron |
seed |
u64 |
Any | Philox PRNG seed (different seed = different bitstreams) |
n_samples |
usize (batch only) |
$\geq 1$ | Batch size for batch forward |
Outputs¶
| Output | Type | Range | Description |
|---|---|---|---|
Vec<f64> or list[float] |
Per-neuron | $[0, n_{inputs}]$ | Stochastic dot product output |
4. Features¶
| Feature | Description |
|---|---|
| Cross-platform GPU | Vulkan, Metal, DX12 via wgpu |
| Feature-gated | --features gpu, no compile-time cost without |
| Philox 4×32-10 PRNG | GPU-native counter-based RNG |
| Two-kernel architecture | Encode + Accumulate, matching CPU hot path |
| Persistent weights | Uploaded once, reused across forward calls |
| Batch mode | forward_batch() processes multiple samples in one dispatch |
| Single-sample mode | forward_fast() for interactive use |
| SC_GPU_ADAPTER | Environment variable for GPU selection |
| CPU fallback | GpuDenseLayer wraps a CPU DenseLayer internally |
| Pre-allocated buffers | No per-call allocation up to max_batch |
| PyO3 bindings | from sc_neurocore_engine import GpuDenseLayer |
| Workgroup reduction | Shared memory tree reduction for popcount |
| Adapter detection | Reports GPU name and availability |
5. Usage Examples¶
Python: basic GPU forward¶
from sc_neurocore_engine import GpuDenseLayer
# Check GPU availability.
if GpuDenseLayer.is_gpu_available():
layer = GpuDenseLayer(n_inputs=256, n_neurons=128, length=1024, seed=42)
print(f"GPU: {layer.gpu_name()}")
# Single-sample forward.
inputs = [0.5] * 256
output = layer.forward_fast(inputs, seed=42)
print(f"Output shape: {len(output)}, first 5: {output[:5]}")
else:
print("No GPU available — use CPU DenseLayer instead")
Python: batch forward¶
import numpy as np
layer = GpuDenseLayer(256, 128, length=1024, seed=42, max_batch=512)
# Batch of 64 samples.
n_samples = 64
inputs_flat = np.random.uniform(0.0, 1.0, size=n_samples * 256).tolist()
outputs = layer.forward_batch(inputs_flat, n_samples=n_samples, seed=42)
# Reshape to (n_samples, n_neurons).
import numpy as np
out_array = np.array(outputs).reshape(n_samples, 128)
print(f"Batch output shape: {out_array.shape}")
Rust: GPU forward¶
use sc_neurocore_engine::gpu::GpuDenseLayer;
if let Some(layer) = GpuDenseLayer::try_new(256, 128, 1024, 42, 256) {
let inputs = vec![0.5f64; 256];
let output = layer.forward_gpu(&inputs, 42);
println!("GPU output: {:?}", &output[..5]);
println!("GPU: {}", layer.gpu_name());
}
GPU adapter selection¶
# List available adapters.
python3 -c "
from sc_neurocore_engine import GpuDenseLayer
layer = GpuDenseLayer(64, 32)
print(f'Selected: {layer.gpu_name()}')
"
# Force AMD GPU.
SC_GPU_ADAPTER="AMD" python3 -c "
from sc_neurocore_engine import GpuDenseLayer
layer = GpuDenseLayer(64, 32)
print(f'Selected: {layer.gpu_name()}')
"
Build with GPU support¶
# Compile Rust engine with GPU feature.
cd engine
cargo build --release --features gpu
# Run GPU tests.
cargo test --features gpu
# Run GPU benchmarks.
cargo bench --features gpu --bench gpu_bench
GPU vs CPU comparison¶
import time
from sc_neurocore_engine import GpuDenseLayer
sizes = [(64, 32), (256, 128), (512, 256), (1000, 500)]
for n_in, n_out in sizes:
layer = GpuDenseLayer(n_in, n_out, length=1024, seed=42)
inputs = [0.5] * n_in
# Warm up.
for _ in range(10):
layer.forward_fast(inputs, seed=42)
# Measure.
start = time.perf_counter()
for _ in range(100):
layer.forward_fast(inputs, seed=42)
elapsed = (time.perf_counter() - start) / 100
print(f"{n_in}x{n_out}: {elapsed*1e6:.0f} us/forward")
6. Technical Reference¶
Class: GpuDenseLayer (Python)¶
Exposed via PyO3. Module: sc_neurocore_engine.
Constructor¶
GpuDenseLayer(n_inputs, n_neurons, length=1024, seed=24301, max_batch=256)
| Parameter | Type | Default | Description |
|---|---|---|---|
n_inputs |
int |
Required | Number of input features |
n_neurons |
int |
Required | Number of output neurons |
length |
int |
1024 |
Bitstream length (higher = more accurate, slower) |
seed |
int |
24301 |
Weight initialisation seed |
max_batch |
int |
256 |
Maximum batch size (pre-allocates buffers) |
Raises RuntimeError if no GPU is available.
Methods¶
| Method | Signature | Description |
|---|---|---|
forward_fast |
(inputs: list[float], seed: int) -> list[float] |
Single-sample forward |
forward_batch |
(inputs_flat: list[float], n_samples: int, seed: int) -> list[float] |
Batch forward (flat row-major) |
gpu_name |
() -> str |
Name of the GPU adapter |
is_gpu_available |
() -> bool (static) |
Check GPU availability |
Properties¶
| Property | Type | Description |
|---|---|---|
n_inputs |
int |
Number of input features |
n_neurons |
int |
Number of output neurons |
length |
int |
Bitstream length |
Rust API¶
pub struct GpuDenseLayer {
pub cpu: DenseLayer,
ctx: Arc<GpuContext>,
weight_buf: wgpu::Buffer,
// ... pre-allocated buffers
}
impl GpuDenseLayer {
pub fn try_new(n_inputs, n_neurons, length, seed, max_batch) -> Option<Self>
pub fn forward_gpu(&self, inputs: &[f64], seed: u64) -> Vec<f64>
pub fn forward_batch_gpu(&self, inputs_flat: &[f64], n_samples: usize, seed: u64) -> Vec<f64>
pub fn gpu_name(&self) -> &str
}
GpuContext singleton¶
pub struct GpuContext {
pub device: wgpu::Device,
pub queue: wgpu::Queue,
pub encode_pipeline: wgpu::ComputePipeline,
pub accumulate_pipeline: wgpu::ComputePipeline,
pub adapter_name: String,
}
Initialised once via OnceLock. Shared across all GpuDenseLayer instances.
WGSL shader files¶
| File | Workgroup size | Description |
|---|---|---|
philox.wgsl |
— | Philox 4×32-10 RNG functions (included by encode) |
encode.wgsl |
256 | Bernoulli encoding: probability → packed bitstream |
accumulate.wgsl |
256 | AND + popcount + workgroup tree reduction |
Buffer layout¶
| Buffer | Size | Persistence | Description |
|---|---|---|---|
weight_buf |
$N_{in} \times N_{out} \times W$ × 8 bytes | Persistent | Packed weight bitstreams |
input_prob_buf |
$N_{in} \times B$ × 4 bytes | Per-call | Input probabilities (f32) |
packed_input_buf |
$N_{in} \times W \times B$ × 8 bytes | Per-call | Encoded input bitstreams |
output_buf |
$N_{out} \times B$ × 4 bytes | Per-call | Output values (f32) |
output_staging |
Same as output | Per-call | MAP_READ staging buffer |
where $W = \lceil L/64 \rceil$ (words per bitstream) and $B$ = batch size.
Memory limits¶
The maxStorageBufferBindingSize on most GPUs is 128-256 MB. For 1000×1000 layers
with length=1024 (16 words): weight buffer = $1000 \times 1000 \times 16 \times 8$ = 128 MB.
Larger layers may require chunking or reduced bitstream length.
7. Performance Benchmarks¶
Local GPU benchmarks (i5-11600K host)¶
Measured with Criterion micro-benchmarks (commit 27299c8b):
| Layer size | CPU (i5-11600K) | GTX 1060 (x16 PCIe) | RX 6600 XT (x1 PCIe) |
|---|---|---|---|
| 64×32 | 58 µs | 137 µs (0.4×) | 1.0 ms (0.06×) |
| 512×256 | 5.6 ms | 263 µs (21.3×) | 1.65 ms (3.4×) |
| 1000×500 | 11.2 ms | 539 µs (20.9×) | 2.8 ms (4.0×) |
Key findings¶
- GPU crossover at ~128 neurons: Below this, dispatch overhead (~50 µs) dominates
- 21× speedup at 512+ neurons on GTX 1060 (x16 PCIe 3.0)
- RX 6600 XT limited by PCIe x1 riser: Only 985 MB/s bandwidth vs 15.75 GB/s for x16
- Batch mode amortises dispatch: 256 samples in one dispatch costs ~2× single-sample
GTX 1060 vs RX 6600 XT¶
The RX 6600 XT has more compute units (32 vs 10) and higher bandwidth (256 GB/s vs 192 GB/s) but is connected via a PCIe x1 mining riser, limiting host-GPU transfer to ~985 MB/s. With a proper x16 connection, the RX 6600 XT would likely outperform the GTX 1060 by 2-3×.
Scaling behaviour¶
| Dimension | Complexity | GPU advantage |
|---|---|---|
| $N < 128$ | O(N²) small | CPU faster (dispatch overhead) |
| $128 \leq N < 512$ | O(N²) medium | GPU ~5-10× faster |
| $N \geq 512$ | O(N²) large | GPU ~20× faster (memory-bound) |
| Batch $B$ | O(B × N²) | GPU ~40× (amortised dispatch) |
Memory bandwidth analysis¶
For 512×256 layer with length=1024: - Weight reads: $512 \times 256 \times 16 \times 8$ = 16.7 MB - Input reads: $512 \times 16 \times 8$ = 65 KB (negligible) - Output writes: $256 \times 4$ = 1 KB (negligible)
GTX 1060 memory bandwidth: 192 GB/s → theoretical minimum: 16.7 MB / 192 GB/s = 87 µs. Measured: 263 µs → ~33% memory bandwidth utilisation (reasonable for compute shaders with workgroup synchronisation overhead).
8. Citations¶
-
wgpu. "wgpu: Safe and portable GPU abstraction in Rust." https://wgpu.rs/ — Cross-platform GPU compute framework.
-
Philox PRNG. Salmon, J. K. et al. "Parallel Random Numbers: As Easy as 1, 2, 3." SC '11: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. — Counter-based RNG algorithm used for GPU Bernoulli encoding.
-
Stochastic computing. Alaghi, A. & Hayes, J. P. "Survey of Stochastic Computing." ACM Transactions on Embedded Computing Systems 12(2s):92:1-92:19, 2013. — Theoretical foundation for bitstream-based computation.
-
WGSL specification. W3C. "WebGPU Shading Language." https://www.w3.org/TR/WGSL/ — Shader language used for compute kernels.
-
GPU-accelerated SNNs. Knight, J. C. & Nowotny, T. "GPUs outperform current HPC and neuromorphic solutions in terms of speed and energy when simulating a highly-connected cortical model." Frontiers in Neuroscience 12:941, 2018. — Motivation for GPU acceleration of spiking network simulation.
-
GeNN. Yavuz, E. et al. "GeNN: a code generation framework for accelerated brain simulations." Scientific Reports 6:18854, 2016. — GPU-accelerated SNN framework, comparison target.
Validation¶
Test suite results (11 GPU integration tests)¶
All tests passing with cargo test --features gpu:
| Test | What it verifies | Status |
|---|---|---|
gpu_layer_creation |
GpuDenseLayer::try_new succeeds | PASS |
gpu_output_shape |
Output length = n_neurons | PASS |
gpu_output_non_negative |
All outputs ≥ 0 | PASS |
gpu_output_range |
All outputs ≤ n_inputs | PASS |
gpu_deterministic |
Same seed → same output | PASS |
gpu_cpu_agreement |
GPU ≈ CPU within 5% relative | PASS |
gpu_batch |
Batch output shape = n_samples × n_neurons | PASS |
gpu_zero_inputs |
All-zero inputs → ~zero output | PASS |
gpu_all_ones |
All-one inputs → output ≈ sum(weights) | PASS |
gpu_different_seeds |
Different seeds → different output | PASS |
gpu_adapter_name |
Non-empty adapter name string | PASS |
GPU-CPU parity¶
GPU and CPU outputs agree within 5% relative tolerance (at the scale of n_inputs). The remaining difference arises from: - Different PRNG (Philox on GPU vs rand on CPU) - f32 accumulation on GPU vs f64 on CPU - Different operation ordering (parallel vs sequential)
These differences are within the inherent stochastic noise of the bitstream computation.
Design Decisions¶
Why two kernels instead of one fused kernel?¶
Separating encode and accumulate allows: 1. Memory efficiency: Encoded bitstreams can be reused across neurons 2. Debugging: Each stage can be tested independently 3. Flexibility: Different encode strategies (Bernoulli, LFSR, Sobol) can be swapped without changing the accumulate kernel
A fused kernel would reduce global memory traffic but increase register pressure and limit workgroup occupancy.
Why f32 on GPU, f64 on CPU?¶
RDNA2 (RX 6600 XT) has 1:1 f32:f16 ratio but only 1:16 f64 rate. Pascal (GTX 1060) has 1:32 f64 rate. Since the Bernoulli encoding quantises to 8-bit effective precision anyway (p mapped to [0, 255] threshold), f32 accumulation is more than sufficient.
Why persistent weights?¶
Weights are uploaded once at construction time and stored in a GPU buffer. They are reused across all forward calls. This amortises the upload cost (which can be significant for large layers) and keeps the per-call data transfer minimal (only input probabilities and output values).
Why OnceLock singleton for GpuContext?¶
The wgpu Device and Queue are heavyweight objects. Creating them for each GpuDenseLayer would waste GPU memory and initialisation time. The singleton ensures all layers share a single device/queue, and the compiled compute pipelines are also shared.
Why max_batch pre-allocation?¶
GPU buffer allocation is expensive (~100 µs). Pre-allocating for the maximum expected batch size eliminates per-call allocation overhead. If the actual batch size is smaller, only the used portion is dispatched.
Known Limitations¶
-
No auto-chunking: Layers exceeding the
maxStorageBufferBindingSize(typically 128-256 MB) will fail at creation time. Manual chunking is needed. -
No mixed precision: All GPU computation is f32. No f16 or int8 modes.
-
No sparse support: Only dense layers. Sparse connectivity patterns still use the full weight matrix.
-
No gradient computation: Forward pass only. Training requires CPU fallback with surrogate gradients.
-
No multi-GPU: Only one adapter is used (selected by SC_GPU_ADAPTER or wgpu heuristic). Multi-GPU data parallelism is not supported.
-
Dispatch overhead: ~50 µs per kernel dispatch makes GPU slower than CPU for small layers (<128 neurons).
-
PCIe bandwidth dependency: Performance depends heavily on PCIe link speed. x1 risers (mining rigs) significantly reduce throughput.
-
No timeline profiling: No built-in GPU profiling. Use external tools (Nsight, RenderDoc) for kernel-level analysis.
SC-NeuroCore v3.14.0 — Stochastic Computing Spiking Neural Network Framework
© 2020–2026 Miroslav Šotek / ANULUM. AGPL-3.0-or-later.