HPC and GPU Acceleration¶
SCPN-Fusion-Core supports high-performance computing through a Rust native backend, C++ FFI bridge, and a GPU acceleration roadmap.
Rust Workspace¶
The scpn-fusion-rs/ directory contains a 10-crate Rust workspace
that mirrors the Python package structure:
Crate |
Purpose |
|---|---|
|
Shared data types, configuration structs, error types |
|
Linear algebra (SOR, GMRES, multigrid), FFT, interpolation, Chebyshev polynomials, elliptic integrals, tridiagonal solver |
|
Grad-Shafranov kernel, transport, inverse reconstruction, stability, pedestal model, AMR |
|
MHD sawtooth, Hall-MHD, turbulence, FNO, heating, compact reactor optimiser, design scanner, sandpile |
|
Neutronics, divertor, wall interaction, PWI erosion, TEMHD, balance of plant |
|
Blanket engineering, magnet design, tritium systems, plant layout |
|
PID, MPC, SNN controller, SPI mitigation, disruption predictor, digital twin, analytic solver, SOC learning |
|
Sensor models, tomography |
|
Neural equilibrium, neural transport, disruption classifier, polynomial chaos expansion (PCE) UQ |
|
PyO3 bindings producing |
Build Configuration¶
The workspace is optimised for maximum performance:
# Cargo.toml [profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
Key dependencies: ndarray, nalgebra, rayon (parallelism),
rustfft, serde, pyo3 (Python bindings).
The Rust workspace has no external C or Fortran dependencies – it is pure Rust.
Python-Rust FFI¶
The fusion-python crate provides PyO3 bindings that expose the Rust
solvers as a native Python extension module. The Python package
auto-detects the extension at import time:
try:
from ._rust_compat import FusionKernel, RUST_BACKEND
except ImportError:
from .fusion_kernel import FusionKernel
RUST_BACKEND = False
All API signatures are identical between the Python and Rust paths, ensuring zero code changes when switching backends.
C++ FFI Bridge¶
The hpc_bridge module (hpc/hpc_bridge.py) provides a C++ FFI
bridge for interfacing with external HPC solvers:
solver.cpp– C++ solver implementation usingtypes.hshared data structuresctypes-based Python bindings for calling compiled C++ from PythonShared-memory data exchange to avoid serialisation overhead
Native library loading is fail-closed. By default, HPCBridge only attempts
package-local solver libraries under scpn_fusion/hpc or
scpn_fusion/hpc/bin. External libraries must be provided through an
absolute SCPN_SOLVER_LIB path or explicit lib_path argument, must be a
regular file, and must carry trust metadata through SCPN_SOLVER_LIB_SHA256,
SCPN_SOLVER_TRUST_MANIFEST, or a .sha256 sidecar before ctypes is
allowed to load them. Relative paths are rejected so the process current
directory and dynamic-loader search path cannot silently select a native
solver.
This bridge is primarily used for prototyping custom solver kernels before porting them to Rust.
GPU Acceleration Status¶
GPU support is tracked through local-only governance notes and implemented through the public runtime surfaces below:
Production-decomposition evidence¶
Production-scale decomposition is not accepted until distributed MPI or multi-GPU measurements exist. The current public contract is:
python validation/benchmark_production_decomposition_contract.py
It publishes radial/toroidal rank tiling, reciprocal neighbour checks, halo
payload shapes, decomposition-invariant reductions, and local large-grid CPU
timing evidence. The latest tracked local large-grid row executes
9,437,184 5D phase cells over 24 local rank tiles with zero
reconstruction error. This is single-process CPU evidence only; it is not a
cluster scaling or GPU throughput claim.
- Phase 1: wgpu SOR kernel
Red-Black SOR stencil implemented as a
wgpucompute shader, providing cross-platform GPU acceleration (Vulkan, Metal, D3D12, WebGPU) with deterministic CPU fallback.Performance targets:
65x65 grid: 2x–4x speedup
257x257 grid: 5x–12x speedup
- Phase 2: GPU-backed GMRES preconditioning
CUDA/ROCm adapters for the GMRES linear solver with CPU fallback for environments without GPU drivers.
Performance target: 2x–6x speedup on inverse solves.
- Phase 3: Full multigrid on device
Smooth, restrict, prolong, and coupled nonlinear multigrid path running entirely on GPU.
Performance targets:
< 1 ms for control-loop grids
10x–30x speedup for 257x257+ workloads
Acceptance gates for each phase:
Correctness: residual behaviour matches CPU reference within configured tolerance
Performance: measured speedups meet declared minimum floors
Operations: runtime capability detection + automatic CPU fallback
The gpu_runtime module (core/gpu_runtime.py) provides the
GPURuntimeBridge class for managing GPU device detection, memory
allocation, and kernel dispatch.
Benchmarking¶
Criterion micro-benchmarks are included in the Rust workspace:
cd scpn-fusion-rs
cargo bench
Available benchmarks:
sor_bench.rs– Red-Black SOR stencil at 65x65 and 128x128inverse_bench.rs– Levenberg-Marquardt inverse reconstruction (FD vs analytical Jacobian comparison)neural_transport_bench.rs– Neural transport MLP inference
Python-side profiling is available via:
python profiling/profile_kernel.py --top 50
python profiling/profile_geometry_3d.py --toroidal 48 --poloidal 48 --top 50
Results are written to artifacts/profiling/.
Performance Summary¶
Metric |
Rust (release) |
Python (NumPy) |
Speedup |
|---|---|---|---|
65x65 equilibrium |
~100 ms |
~5 s |
~50x |
128x128 equilibrium |
~1 s |
~30 s |
~30x |
SOR step (65x65) |
microseconds |
milliseconds |
~100x |
Neural transport MLP |
~5 microseconds/point |
~500 microseconds/point |
~100x |
Inverse reconstruction |
~4 s (5 LM iters) |
~60 s |
~15x |
Note
These are internal measurements on specific hardware. We encourage
independent reproduction using cargo bench and
benchmarks/collect_results.sh.