HPC and GPU Acceleration

SCPN-Fusion-Core supports high-performance computing through a Rust native backend, C++ FFI bridge, and a GPU acceleration roadmap.

Rust Workspace

The scpn-fusion-rs/ directory contains a 10-crate Rust workspace that mirrors the Python package structure:

Crate

Purpose

fusion-types

Shared data types, configuration structs, error types

fusion-math

Linear algebra (SOR, GMRES, multigrid), FFT, interpolation, Chebyshev polynomials, elliptic integrals, tridiagonal solver

fusion-core

Grad-Shafranov kernel, transport, inverse reconstruction, stability, pedestal model, AMR

fusion-physics

MHD sawtooth, Hall-MHD, turbulence, FNO, heating, compact reactor optimiser, design scanner, sandpile

fusion-nuclear

Neutronics, divertor, wall interaction, PWI erosion, TEMHD, balance of plant

fusion-engineering

Blanket engineering, magnet design, tritium systems, plant layout

fusion-control

PID, MPC, SNN controller, SPI mitigation, disruption predictor, digital twin, analytic solver, SOC learning

fusion-diagnostics

Sensor models, tomography

fusion-ml

Neural equilibrium, neural transport, disruption classifier, polynomial chaos expansion (PCE) UQ

fusion-python

PyO3 bindings producing scpn_fusion_rs.pyd / .so

Build Configuration

The workspace is optimised for maximum performance:

# Cargo.toml [profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1

Key dependencies: ndarray, nalgebra, rayon (parallelism), rustfft, serde, pyo3 (Python bindings).

The Rust workspace has no external C or Fortran dependencies – it is pure Rust.

Python-Rust FFI

The fusion-python crate provides PyO3 bindings that expose the Rust solvers as a native Python extension module. The Python package auto-detects the extension at import time:

try:
    from ._rust_compat import FusionKernel, RUST_BACKEND
except ImportError:
    from .fusion_kernel import FusionKernel
    RUST_BACKEND = False

All API signatures are identical between the Python and Rust paths, ensuring zero code changes when switching backends.

C++ FFI Bridge

The hpc_bridge module (hpc/hpc_bridge.py) provides a C++ FFI bridge for interfacing with external HPC solvers:

  • solver.cpp – C++ solver implementation using types.h shared data structures

  • ctypes-based Python bindings for calling compiled C++ from Python

  • Shared-memory data exchange to avoid serialisation overhead

Native library loading is fail-closed. By default, HPCBridge only attempts package-local solver libraries under scpn_fusion/hpc or scpn_fusion/hpc/bin. External libraries must be provided through an absolute SCPN_SOLVER_LIB path or explicit lib_path argument, must be a regular file, and must carry trust metadata through SCPN_SOLVER_LIB_SHA256, SCPN_SOLVER_TRUST_MANIFEST, or a .sha256 sidecar before ctypes is allowed to load them. Relative paths are rejected so the process current directory and dynamic-loader search path cannot silently select a native solver.

This bridge is primarily used for prototyping custom solver kernels before porting them to Rust.

GPU Acceleration Status

GPU support is tracked through local-only governance notes and implemented through the public runtime surfaces below:

Production-decomposition evidence

Production-scale decomposition is not accepted until distributed MPI or multi-GPU measurements exist. The current public contract is:

python validation/benchmark_production_decomposition_contract.py

It publishes radial/toroidal rank tiling, reciprocal neighbour checks, halo payload shapes, decomposition-invariant reductions, and local large-grid CPU timing evidence. The latest tracked local large-grid row executes 9,437,184 5D phase cells over 24 local rank tiles with zero reconstruction error. This is single-process CPU evidence only; it is not a cluster scaling or GPU throughput claim.

Phase 1: wgpu SOR kernel

Red-Black SOR stencil implemented as a wgpu compute shader, providing cross-platform GPU acceleration (Vulkan, Metal, D3D12, WebGPU) with deterministic CPU fallback.

Performance targets:

  • 65x65 grid: 2x–4x speedup

  • 257x257 grid: 5x–12x speedup

Phase 2: GPU-backed GMRES preconditioning

CUDA/ROCm adapters for the GMRES linear solver with CPU fallback for environments without GPU drivers.

Performance target: 2x–6x speedup on inverse solves.

Phase 3: Full multigrid on device

Smooth, restrict, prolong, and coupled nonlinear multigrid path running entirely on GPU.

Performance targets:

  • < 1 ms for control-loop grids

  • 10x–30x speedup for 257x257+ workloads

Acceptance gates for each phase:

  • Correctness: residual behaviour matches CPU reference within configured tolerance

  • Performance: measured speedups meet declared minimum floors

  • Operations: runtime capability detection + automatic CPU fallback

The gpu_runtime module (core/gpu_runtime.py) provides the GPURuntimeBridge class for managing GPU device detection, memory allocation, and kernel dispatch.

Benchmarking

Criterion micro-benchmarks are included in the Rust workspace:

cd scpn-fusion-rs
cargo bench

Available benchmarks:

  • sor_bench.rs – Red-Black SOR stencil at 65x65 and 128x128

  • inverse_bench.rs – Levenberg-Marquardt inverse reconstruction (FD vs analytical Jacobian comparison)

  • neural_transport_bench.rs – Neural transport MLP inference

Python-side profiling is available via:

python profiling/profile_kernel.py --top 50
python profiling/profile_geometry_3d.py --toroidal 48 --poloidal 48 --top 50

Results are written to artifacts/profiling/.

Performance Summary

Metric

Rust (release)

Python (NumPy)

Speedup

65x65 equilibrium

~100 ms

~5 s

~50x

128x128 equilibrium

~1 s

~30 s

~30x

SOR step (65x65)

microseconds

milliseconds

~100x

Neural transport MLP

~5 microseconds/point

~500 microseconds/point

~100x

Inverse reconstruction

~4 s (5 LM iters)

~60 s

~15x

Note

These are internal measurements on specific hardware. We encourage independent reproduction using cargo bench and benchmarks/collect_results.sh.