============================================
HPC and GPU Acceleration
============================================

SCPN-Fusion-Core supports high-performance computing through a Rust
native backend, C++ FFI bridge, and a planned GPU acceleration path.

Rust Workspace
---------------

The ``scpn-fusion-rs/`` directory contains a 10-crate Rust workspace
that mirrors the Python package structure:

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Crate
     - Purpose
   * - ``fusion-types``
     - Shared data types, configuration structs, error types
   * - ``fusion-math``
     - Linear algebra (SOR, GMRES, multigrid), FFT, interpolation,
       Chebyshev polynomials, elliptic integrals, tridiagonal solver
   * - ``fusion-core``
     - Grad-Shafranov kernel, transport, inverse reconstruction,
       stability, pedestal model, AMR
   * - ``fusion-physics``
     - MHD sawtooth, Hall-MHD, turbulence, FNO, heating, compact
       reactor optimiser, design scanner, sandpile
   * - ``fusion-nuclear``
     - Neutronics, divertor, wall interaction, PWI erosion, TEMHD,
       balance of plant
   * - ``fusion-engineering``
     - Blanket engineering, magnet design, tritium systems, plant layout
   * - ``fusion-control``
     - PID, MPC, SNN controller, SPI mitigation, disruption predictor,
       digital twin, analytic solver, SOC learning
   * - ``fusion-diagnostics``
     - Sensor models, tomography
   * - ``fusion-ml``
     - Neural equilibrium, neural transport, disruption classifier,
       polynomial chaos expansion (PCE) UQ
   * - ``fusion-python``
     - PyO3 bindings producing ``scpn_fusion_rs.pyd`` / ``.so``

Build Configuration
^^^^^^^^^^^^^^^^^^^^

The workspace is optimised for maximum performance::

    # Cargo.toml [profile.release]
    opt-level = 3
    lto = "fat"
    codegen-units = 1

Key dependencies: ``ndarray``, ``nalgebra``, ``rayon`` (parallelism),
``rustfft``, ``serde``, ``pyo3`` (Python bindings).

The Rust workspace has no external C or Fortran dependencies -- it is
pure Rust.

Python-Rust FFI
^^^^^^^^^^^^^^^^

The ``fusion-python`` crate provides PyO3 bindings that expose the Rust
solvers as a native Python extension module.  The Python package
auto-detects the extension at import time::

    try:
        from ._rust_compat import FusionKernel, RUST_BACKEND
    except ImportError:
        from .fusion_kernel import FusionKernel
        RUST_BACKEND = False

All API signatures are identical between the Python and Rust paths,
ensuring zero code changes when switching backends.

C++ FFI Bridge
---------------

The ``hpc_bridge`` module (``hpc/hpc_bridge.py``) provides a C++ FFI
bridge for interfacing with external HPC solvers:

- ``solver.cpp`` -- C++ solver implementation using ``types.h`` shared
  data structures
- ``ctypes``-based Python bindings for calling compiled C++ from Python
- Shared-memory data exchange to avoid serialisation overhead

This bridge is primarily used for prototyping custom solver kernels
before porting them to Rust.

GPU Acceleration Roadmap
--------------------------

GPU support is planned in three phases (tracked in
``docs/GPU_ACCELERATION_ROADMAP.md``):

**Phase 1: wgpu SOR kernel**
   Red-Black SOR stencil implemented as a ``wgpu`` compute shader,
   providing cross-platform GPU acceleration (Vulkan, Metal, D3D12,
   WebGPU) with deterministic CPU fallback.

   Performance targets:

   - 65x65 grid: 2x--4x speedup
   - 257x257 grid: 5x--12x speedup

**Phase 2: GPU-backed GMRES preconditioning**
   CUDA/ROCm adapters for the GMRES linear solver with CPU fallback
   for environments without GPU drivers.

   Performance target: 2x--6x speedup on inverse solves.

**Phase 3: Full multigrid on device**
   Smooth, restrict, prolong, and coupled nonlinear multigrid path
   running entirely on GPU.

   Performance targets:

   - < 1 ms for control-loop grids
   - 10x--30x speedup for 257x257+ workloads

**Acceptance gates** for each phase:

- Correctness: residual behaviour matches CPU reference within
  configured tolerance
- Performance: measured speedups meet declared minimum floors
- Operations: runtime capability detection + automatic CPU fallback

The ``gpu_runtime`` module (``core/gpu_runtime.py``) provides the
``GPURuntimeBridge`` class for managing GPU device detection, memory
allocation, and kernel dispatch.

Benchmarking
--------------

Criterion micro-benchmarks are included in the Rust workspace::

    cd scpn-fusion-rs
    cargo bench

Available benchmarks:

- ``sor_bench.rs`` -- Red-Black SOR stencil at 65x65 and 128x128
- ``inverse_bench.rs`` -- Levenberg-Marquardt inverse reconstruction
  (FD vs analytical Jacobian comparison)
- ``neural_transport_bench.rs`` -- Neural transport MLP inference

Python-side profiling is available via::

    python profiling/profile_kernel.py --top 50
    python profiling/profile_geometry_3d.py --toroidal 48 --poloidal 48 --top 50

Results are written to ``artifacts/profiling/``.

Performance Summary
^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 30 20 30 20

   * - Metric
     - Rust (release)
     - Python (NumPy)
     - Speedup
   * - 65x65 equilibrium
     - ~100 ms
     - ~5 s
     - ~50x
   * - 128x128 equilibrium
     - ~1 s
     - ~30 s
     - ~30x
   * - SOR step (65x65)
     - microseconds
     - milliseconds
     - ~100x
   * - Neural transport MLP
     - ~5 microseconds/point
     - ~500 microseconds/point
     - ~100x
   * - Inverse reconstruction
     - ~4 s (5 LM iters)
     - ~60 s
     - ~15x

.. note::

   These are internal measurements on specific hardware.  We encourage
   independent reproduction using ``cargo bench`` and
   ``benchmarks/collect_results.sh``.

Related Modules
-----------------

- :mod:`scpn_fusion.hpc.hpc_bridge` -- C++/Rust FFI bridge
- :mod:`scpn_fusion.core.gpu_runtime` -- GPU runtime management
- :mod:`scpn_fusion.core._rust_compat` -- Rust backend auto-detection