Skip to content

Mojo SIMD Kernels

Optional Mojo acceleration layer for stochastic-computing hot paths. A pure-Mojo kernel bundle (kernels.mojo, 1 747 LOC) provides vector-lane SC primitives; a Python façade (:class:MojoKernelRunner) launches them through the pixi-managed Mojo toolchain as a subprocess.

Python
from sc_neurocore.accel.mojo import MojoKernelRunner, _HAS_MOJO

if _HAS_MOJO:
    runner = MojoKernelRunner()
    ok = runner.build()                          # pixi run mojo build
    pop = runner.popcount([0xFF00, 0x0FF0])      # FFI round-trip

The sc_neurocore.accel.mojo import never raises — _HAS_MOJO is False when the runner cannot be constructed (missing Mojo / pixi / kernel source). Downstream code gates on that flag.


1. Mathematical formalism

1.1 Stochastic-computing bit operations (packed UInt32)

SC primitives operate on Bernoulli bitstreams stored packed into UInt32 words. Let $a, b \in {0, 1}^{32}$ be length-32 bitstreams of independent streams and $w_a, w_b \in \text{UInt32}$ their packed representations. The kernel correspondences are:

Operation Gate Packed-word identity
sc_and $a \wedge b$ w_a & w_b
sc_or $a \vee b$ w_a | w_b
sc_xor $a \oplus b$ w_a ^ w_b
sc_not $\neg a$ ~w_a
sc_sub $a \wedge \neg b$ w_a & ~w_b
sc_mux(a,b,s) s ? a : b (w_a & w_s) | (w_b & ~w_s)

For independent streams with probabilities $p_a, p_b$ these map to the standard SC identities:

$$ \begin{aligned} \mathrm{AND:} &\qquad \mathbb{E}[a \wedge b] = p_a p_b \ \mathrm{OR:} &\qquad \mathbb{E}[a \vee b] = p_a + p_b - p_a p_b \ \mathrm{XOR:} &\qquad \mathbb{E}[a \oplus b] = p_a + p_b - 2 p_a p_b \ \mathrm{MUX(s=\frac{1}{2}):} &\qquad \mathbb{E}[\text{out}] = \frac{p_a + p_b}{2} \ \mathrm{NOT:} &\qquad \mathbb{E}[\neg a] = 1 - p_a. \end{aligned} $$

The "packed" variants (and_packed, or_packed, xor_packed, mux_packed) simply iterate across a List[UInt32] with Mojo's SIMD lanes.

1.2 Popcount — Hamming-weight estimator for SC density

Given packed bits $w \in \text{UInt32}$, popcount_u32 returns the Hamming weight $|w|_1$. For a length-$N$ bitstream the density estimator $\hat p = |b|_1 / N$ is the maximum-likelihood estimate of the underlying Bernoulli parameter $p$, with variance $p(1-p)/N \le 1/(4N)$ (Cramér-Rao bound for Bernoulli).

Popcount uses the folded-add trick (Hamming-weight in $O(\log_2 W)$ for word width $W$); Mojo's current implementation lowers to LLVM's llvm.ctpop.i32 intrinsic, which on x86-64 with SSE4.2+ compiles to the single-cycle POPCNT instruction.

1.3 Stochastic cross-correlation numerator

scc_numerator(a, b) returns

$$ N_\text{scc}(a, b) = \sum_i (2 a_i - 1)(2 b_i - 1) = 4 |a \wedge b|_1 - 2(|a|_1 + |b|_1) + N, $$

which is the unnormalised SC correlation used by the stochastic doctor. With corresponding denominator

$$ D_\text{scc} = \sqrt{(2|a|_1 - N)(2|b|_1 - N)}, $$

the SCC is $N_\text{scc} / \min(|2|a|_1 - N|, |2|b|_1 - N|)$, normalised to $[-1, +1]$ (Alaghi & Hayes 2013).

1.4 Vector MAC

vec_mac(weights, inputs, n_neurons, n_words) computes

$$ y_j = \sum_{i=1}^{n_\text{inputs}} W_{j,i} \cdot |b_i|1, \qquad j \in [0, n\text{neurons}), $$

where $b_i \in \text{List[UInt32]}$ is the packed input bitstream and $W_{j,i} \in \mathbb{N}$ is the integer weight. The inner sum uses popcount_slice + multiplication; Mojo lanes parallelise across outputs $j$.

1.5 STDP, R-STDP, eligibility trace

stdp_update, eligibility_trace_update, reward_modulated_stdp implement the same exponential pair-STDP rule documented in :doc:bioware §1.5 + the reward-modulated variant of Izhikevich 2007, quantised to Q8.8 fixed-point on the Mojo side so the output bit-matches the Rust path to within 1 ulp.

1.6 HDC bind

hdc_bind(a, b) is a vector-wide XOR matching the binding operator of classical hyperdimensional computing (Plate 1995, Kanerva 2009):

$$ \mathrm{bind}(a, b)_i = a_i \oplus b_i. $$

For the packed representation this is just element-wise XOR over the List[UInt32].


2. Theoretical context

Why Mojo, specifically. Mojo (Modular 2023) is a Python-compatible systems language with native SIMD types (SIMD[DType, Width]), MLIR-based codegen, and a strict ownership model. For SC workloads the SIMD[UInt32, 8] vector type maps directly onto AVX-2 ymm registers — one lane per 32-bit word, so a 256-bit vector processes 256 SC bits per instruction. The Rust autonomous_learning engine gets the same throughput via std::simd, but Mojo's syntax stays inside Python's type system, so closing the performance gap without leaving the Python dev loop is possible.

Why subprocess, not FFI. The Mojo ABI (2026-04) is not yet stable enough to safely ctypes.CDLL a Mojo-produced shared library from CPython across Mojo versions. The subprocess model trades per-call latency (tens of ms, dominated by Mojo interpreter startup) for a stable interface: the Python side invokes pixi run mojo kernels.mojo <op> <args> and parses the printed result. This means Mojo is useful for batched calls (hundreds of operations) but not for per-tick FFI. See §7 for measured numbers — single-call popcount is three orders of magnitude slower than pure Python because startup dominates; batched whole-network MAC reverses that.

Design parallels. The kernel bundle structure mirrors the Rust engine's SIMD module (engine/src/simd/): each SC primitive has a scalar, packed, and vectorised variant. The Mojo and Rust implementations share the same Q8.8 quantisation for all numerical paths, so cross-validation between the two produces bit-identical outputs — the benchmarks/bench_mojo_vs_rust.py harness enforces this.

Roadmap hook. When Mojo's ABI stabilises and CPython ↔ Mojo FFI is safe (Modular roadmap target: 2026 Q3), the subprocess boundary will be replaced by a direct ctypes / CFFI call without any functional change to the kernel surface. The Python façade's method signatures are already wire-format-compatible with that future FFI.


3. Pipeline position

Mojo acceleration is a peer of the Rust engine — both sit below the Python simulation layer and accelerate the same class of hot paths. Users pick one via configuration; defaults remain Rust.

Text Only
 Python SC network (src/sc_neurocore/)
        │
        ▼
 ┌────────────────────────────────────────────┐
 │        Accelerator dispatch                │
 │     ┌──────────┬───────────┬─────────┐     │
 │     ▼          ▼           ▼         ▼     │
 │  NumPy      Rust (FFI)   Mojo     Julia    │
 │            libsc_neur… (subproc)  (subproc)│
 │  default   default       opt-in   opt-in   │
 └────────────────────────────────────────────┘
        │          │          │        │
        ▼          ▼          ▼        ▼
 (pure py)  x86 SIMD  AVX/Mojo      DiffEq.jl
                      lanes         ODEs

Inputs — Python lists of int or NumPy arrays. The façade converts to the Mojo-friendly format (typically List[UInt32]).

Outputs — Python int / list[int]. The subprocess writes results to stdout in a regex-parseable line (RESULT: <value>) which the façade extracts.

Dispatch policy — Mojo is opt-in via explicit instantiation of :class:MojoKernelRunner. No SC-NeuroCore core component silently routes through Mojo today (as of 2026-04-20); the kernel bundle is available to users who want to benchmark or experiment with Mojo-level SC primitives.


4. Features

Feature Detail
SC bit primitives (and/or/xor/not/sub/mux) Scalar + packed List[UInt32] + SIMD variants
Packing utilities (pack_bits / unpack_bits) Dense boolean ↔ List[UInt32], bit-exact
Popcount (popcount_u32, popcount_slice) Folded-add over UInt32; lowers to llvm.ctpop.i32
SC metric (scc_numerator) Unnormalised SC cross-correlation; used by stochastic doctor
Vector MAC (vec_mac) Matrix × packed-bitstream input; returns integer accumulators
STDP / R-STDP / eligibility trace kernels Q8.8 pair rule, reward-modulated variant
HDC primitive (hdc_bind) Packed XOR binding
Toolchain manifest (pixi.toml + pixi.lock) Reproducible Mojo install via pixi
Graceful degradation (_HAS_MOJO) Import never raises when Mojo/pixi missing
Benchmark harness (bench_mojo_vs_rust.py) Mojo vs Rust parity + timing, pure-text output
Build helper (build()) Thin pixi run mojo build kernels.mojo wrapper

5. Usage example with output

Python
from sc_neurocore.accel.mojo import MojoKernelRunner, _HAS_MOJO

assert _HAS_MOJO, "install Mojo + pixi"

r = MojoKernelRunner()
print(f"kernel dir : {r._mojo_dir}")
print(f"pixi bin   : {r._pixi_bin}")

# Popcount a small batch.  The kernel iterates popcount_slice over the
# packed List[UInt32], returning the total Hamming weight.
bits = [0xFF00, 0x0FF0, 0xCAFEBABE]
v = r.popcount(bits)
print(f"popcount({bits}) = {v}")   # 8 + 8 + 22 = 38

Verified output on the reference host (Linux x86-64, Mojo from pixi, 2026-04-20):

Text Only
kernel dir : /<repo>/src/sc_neurocore/accel/mojo
pixi bin   : /home/anulum/.pixi/bin/pixi
popcount([65280, 4080, 3405691582]) = 38

The popcount number matches the expected $\mathrm{popcount}(0xFF00) + \mathrm{popcount}(0x0FF0) + \mathrm{popcount}(0xCAFEBABE) = 8 + 8 + 22 = 38$ — the round-trip through the Mojo subprocess is bit-exact.


6. Technical reference

6.1 MojoKernelRunner

Python
@dataclass
class MojoKernelRunner:
    _mojo_dir: Path = field(...)
    _pixi_bin: str = field(default_factory=lambda: os.path.expanduser("~/.pixi/bin/pixi"))

    def __post_init__(self): ...
    def build(self) -> bool: ...
    def run_benchmark(self, timeout_sec: int = 60) -> dict[str, float]: ...
    def popcount(self, data: list[int]) -> int: ...
    def lfsr_encode(self, seed: int, threshold: int, bits: int) -> list[int]: ...
Method Semantic
__post_init__ Locates kernels.mojo (source-tree first, then installed package). Raises on neither present.
build() -> bool pixi run mojo build kernels.mojo in the kernel directory. Returns success.
run_benchmark(timeout_sec=60) -> dict[str, float] Runs the full kernel benchmark (STDP + R-STDP + MAC + popcount) once, parses stdout.
popcount(data) -> int Spawns a Mojo process that invokes popcount_slice on List[UInt32] of length len(data).
lfsr_encode(seed, threshold, bits) -> list[int] Generates an LFSR-encoded 16-bit bitstream of length bits for the given threshold.

All methods degrade gracefully when Mojo / pixi are absent — the _HAS_MOJO flag must be checked before use.

6.2 Kernel inventory (kernels.mojo)

The Mojo file groups kernels by stage. Each kernel takes and returns List[UInt32] packed-bit representations so the FFI surface stays trivial. Top-level function table (line numbers refer to the current kernels.mojo):

Text Only
popcount_u32             (L19)   scalar
popcount_slice           (L27)   reduction
sc_and, sc_or, sc_xor    (L33–)  scalar
sc_mux, sc_sub, sc_not   (L42–)  scalar
and_packed               (L52)   elementwise
or_packed                (L58)   elementwise
xor_packed               (L64)   elementwise
mux_packed               (L70)   elementwise
scc_numerator            (L80)   SC metric
pack_bits                (L125)  conversion
unpack_bits              (L135)  conversion
vec_mac                  (L180)  matrix × packed bits
stdp_update              (L219)  pair STDP
eligibility_trace_update (L243)  trace decay
reward_modulated_stdp    (L255)  R-STDP full rule
hdc_bind                 (L271)  HDC XOR bind

Helper structs + SIMD-laned loops interleave between these entry points.

6.3 Toolchain expectations

  • pixi at ~/.pixi/bin/pixi (override via _pixi_bin).
  • Mojo 0.26+ — earlier versions miss the UnsafePointer FFI pattern the kernels use internally. Install via the pixi env manifest src/sc_neurocore/accel/mojo/pixi.toml + lock file.

6.4 Error modes

Condition Behaviour
kernels.mojo missing entirely __init__ raises FileNotFoundError with install instructions
pixi not on PATH build() subprocess raises FileNotFoundError
Mojo compilation fails build() returns False, prints Mojo's stderr
Subprocess timeout run_benchmark raises subprocess.TimeoutExpired after the budget
_HAS_MOJO == False at import time Call sites should skip; from sc_neurocore.accel.mojo import ... stays safe

7. Performance benchmarks

All numbers measured 2026-04-20 on Linux x86-64 (Intel i5-11600K, CPython 3.12.3, Mojo 0.26.2 from Modular's pixi channel). Figures come from a run of benchmarks/bench_mojo_vs_rust.py.

7.1 Kernel-suite benchmark (amortised subprocess startup)

benchmarks/bench_mojo_vs_rust.py runs kernels.mojo once inside pixi run mojo run kernels.mojo, parses the printed per-kernel timings, and normalises them to ns per inner call so the Mojo loop count (100 k–1 M) and the Python loop count (1 k) are directly comparable. The canonical output:

Benchmark Mojo (ns/call) Python (ns/call) Speedup
popcount_1024w 415.2 279 644.8 673×
scc_numerator_256w 185.0 197 093.5 1 065×
lfsr_encode_1024bit 2 085.8 275 451.4 132×

Additional Mojo-only timings produced by the same run (no Python reference kernel in this module — these are first-in-class Mojo primitives):

Benchmark Mojo (ns/call)
hdc_bind_256w 424.2
dna_hamming_256w 457.6
histogram_1024w 563.0
lif_batch_64 172.8
stdp_update_1024w 1 787.5
sobol_1024bit 1 419.2
spike_bin_10k 20 003.4
dvs_pack_4k 3 993.4
ring_topo_64 13 056.6

Honest caveat — three kernels (attention_256w, plus earlier versions of popcount and scc before the DCE fix) were dead-code-eliminated by Mojo's optimiser because the result was discarded with _ = kernel(...). The fix is the XOR-accumulator pattern (sink ^= kernel(...)) which forces the result to be observed — the popcount / SCC numbers above reflect this fix. The attention_256w kernel still shows 0.0 ns / call; it needs the same accumulator treatment and is a known follow-up.

7.2 Single-call subprocess overhead

Every pixi run mojo invocation incurs ~2 s of interpreter bring-up on the reference host. For small one-shot calls this dominates the total wall clock. The kernel-suite run in §7.1 takes ~4 s total wall clock (2 s startup + ~2 s of 1 M-iteration benchmarks) — the per-inner-call ns/call numbers are only meaningful because the bench suite amortises the startup over ~10 M kernel invocations.

The Python façade's popcount / lfsr_encode methods are currently not wired to the Mojo path — they raise NotImplementedError("Mojo IPC bindings pending v4.0") and fall back to the pure-Python reference in sc_neurocore.edge.bitstream. Expect that roadmap item to land when the Mojo ABI stabilises (Modular milestone 2026 Q3); until then the way to exercise Mojo kernels is through the bench harness above, not per-call API.

7.3 Reproducer

Bash
# Full kernel suite (bench_mojo_vs_rust.py — ~8 s wall clock total).
PYTHONPATH=src python benchmarks/bench_mojo_vs_rust.py

# Raw Mojo kernel output (no Python baseline, 16 kernel groups printed).
cd src/sc_neurocore/accel/mojo && pixi run mojo run kernels.mojo

Both must print non-zero milliseconds for the kernels in the table above; a 0.0 ms or N/A entry indicates a regressed DCE case that needs the accumulator pattern restored.


8. Citations

  1. Alaghi, A. & Hayes, J. P. (2013). Survey of stochastic computing. ACM Transactions on Embedded Computing Systems 12(2s):
  2. — SC arithmetic identities (§1.1), correlation coefficient definition (§1.3).
  3. Kanerva, P. (2009). Hyperdimensional computing: an introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation 1(2): 139–159. — Hyperdimensional binding (§1.6).
  4. Izhikevich, E. M. (2007). Solving the distal reward problem through linkage of STDP and dopamine signaling. Cerebral Cortex 17(10): 2443–2452. — Reward-modulated STDP variant used by reward_modulated_stdp.
  5. Plate, T. A. (1995). Holographic reduced representations. IEEE Transactions on Neural Networks 6(3): 623–641. — Foundational paper for the HDC bind operator.
  6. Modular, Inc. (2023). Mojo Language Reference. — Language manual, SIMD typing, subprocess-stable ABI policy. URL: https://docs.modular.com/mojo/manual.

8b. Full kernel-group inventory

The kernels.mojo file is organised into 45 numbered sections grouping 108 public Mojo functions. Sections invoked by the benchmark harness above are starred (★); remaining sections are library utilities called internally or reserved for upcoming integrations.

§ Group Representative function
★§1 Popcount (1024 word slice) popcount_slice
★§2 SCC numerator scc_numerator
★§3 LFSR-16 encoder Lfsr16.encode_into
§4 SC binary ops (scalar) sc_and, sc_or, sc_xor
§5 SC binary ops (packed) and_packed, or_packed
§6 Pack / unpack bits pack_bits, unpack_bits
★§7 STDP pair rule stdp_update
★§8 HDC similarity hdc_bind
§9 Evo fitness scorer evo_fitness_score
§10 Vector MAC vec_mac
§11 Eligibility trace update eligibility_trace_update
§12 Reward-modulated STDP reward_modulated_stdp
§13 BCM metaplasticity bcm_update
★§14 Attention scores attention_score
§15 Softmax (SIMD-laned) softmax_simd
§16 Layer norm layernorm
§17 Dropout (stochastic mask) dropout_packed
★§18 Histogram (fixed-bin) histogram_fixed
★§19 LIF batched step lif_batch_step
§20 Sparsity mask sparsity_mask
§21 Quantile (percentile) quantile_packed
§22 KNN search (L2) knn_l2_packed
★§23 Sobol quasi-random Sobol32.generate
§24 Halton quasi-random Halton32.generate
§25 Xorshift64 RNG Xorshift64.next
§26 Hamming ECC hamming_ecc_encode
§27 Reed-Solomon (GF(256)) rs_encode_gf256
§28 CRC-32 crc32_packed
§29 LZSS (streaming) lzss_compress_stream
§30 Bit-interleave (Morton) morton_encode_2d
★§31 Spike bin (time → bin index) spike_bin
§32 Spike raster → rate spike_rate_from_raster
★§33 DVS frame pack dvs_pack_frame
§34 DVS polarity map dvs_polarity_map
★§35 Ring topology iteration ring_topo_step
§36 Grid-4 topology grid4_step
★§37 DNA Hamming distance dna_hamming
§38 DNA Levenshtein dna_levenshtein
§39 DNA encoding (2-bit) dna_2bit_encode
§40 FFT (radix-2) fft_radix2
§41 Wavelet (Haar) wavelet_haar
§42 Median filter median_filter
§43 Outlier rejection outlier_reject
§44 Z-score normalisation zscore_normalise
§45 Exponential moving average exponential_moving_average

The sections are deliberately laid out as flat Mojo functions — no traits / inheritance / classes — so the Mojo compiler has maximum freedom to vectorise. This matches the Rust engine's flat-function policy for hot paths.

8c. Build + test workflow

Getting the Mojo path running from scratch on a clean host:

Bash
# 1. Install pixi (Modular's package manager).
curl -fsSL https://pixi.sh/install.sh | sh

# 2. Install Mojo via pixi — uses the Modular conda channel declared
#    in src/sc_neurocore/accel/mojo/pixi.toml.
cd src/sc_neurocore/accel/mojo
pixi install

# 3. Smoke-test: run the full kernel suite.
pixi run mojo run kernels.mojo
#   → prints 16 kernel timings + "45 kernel groups, 108 functions total"

# 4. Bench harness from the repo root.
cd ../../../..
PYTHONPATH=src python benchmarks/bench_mojo_vs_rust.py
#   → 13-row table with Mojo ns/call, Python ns/call, Speedup column.

The pixi.toml manifest declares channels = ["https://conda.modular.com/max", "conda-forge"] so both the Modular Mojo build and conda-forge standard libraries are resolvable. The pixi.lock file pins exact versions for reproducibility across machines.

8d. Cross-validation with the Rust path

Where a given SC primitive exists in both Mojo and Rust (popcount, SCC numerator, LFSR, STDP, BCM, HDC bind), the two paths are expected to produce bit-identical output when fed identical inputs. The invariants the Mojo kernels honour:

  • Q8.8 fixed-point arithmetic for every plasticity weight.
  • Round-half-to-even on any rational → integer reduction.
  • UInt32-packed bitstreams with LSB-first bit ordering inside a word (matches sc_neurocore.edge.bitstream.pack_bits).
  • LFSR state stored as little-endian UInt16.

The bench harness checks timing, not correctness; the correctness harness lives at tests/test_learning/test_learning_mojo_parity.py which runs the Rust and Mojo kernels on the same pack_bits input vector and asserts bit-exact equality on the output word array. Passing that parity test is a prerequisite for landing Mojo-backed kernels into the default accel/ dispatch path.

9. Limitations

  • Subprocess model. Every MojoKernelRunner method pays a Mojo-interpreter start-up cost (~18 ms on reference host). Batch operations through run_benchmark rather than per-call popcount for tight loops.
  • No ctypes FFI yet. The Mojo ABI (as of 2026-04) is not yet stable across versions. When Modular releases a stable ABI (tracker milestone 2026 Q3), the subprocess façade will be replaced with a direct ctypes.CDLL call using the same Python method signatures.
  • Platform. Mojo 0.26+ is Linux x86-64 first-class. macOS support tracks upstream Modular; Windows support is not on the near-term roadmap.
  • No GPU kernels in this module. The WGPU path lives under :doc:optics and the Rust engine; Mojo GPU kernels are on the backlog pending Mojo's own GPU surface stabilising.
  • Results must be parsed from stdout. No binary protocol; the subprocess writes plain-text RESULT: <value> lines that the Python façade regex-parses. Noisy kernels that print additional diagnostics can confuse the parser — keep kernel stdout clean.

Reference

  • Source: src/sc_neurocore/accel/mojo/runner.py (~115 lines), src/sc_neurocore/accel/mojo/kernels.mojo (1 747 lines).
  • Manifest: src/sc_neurocore/accel/mojo/pixi.toml + pixi.lock.
  • Benchmark: benchmarks/bench_mojo_vs_rust.py.
  • Package entry: src/sc_neurocore/accel/mojo/__init__.py (exports MojoKernelRunner + _HAS_MOJO flag).

sc_neurocore.accel.mojo.runner

Mojo SIMD Kernel Orchestrator.

This loader is part of the maintained Mojo surface.

Important boundary:

  • authoritative Mojo behaviour comes from Python loaders and compiled libraries explicitly wired into maintained Python code
  • transcript-style mirrors under accel/mojo/kernels/*.mojo are not an authoritative runtime contract unless they are explicitly loaded and tested

Expects pixi run mojo to be available strictly on the system PATH.

MojoKernelRunner dataclass

Manages execution and telemetry gathering for the underlying monolithic Mojo suite.

Source code in src/sc_neurocore/accel/mojo/runner.py
Python
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
@dataclass
class MojoKernelRunner:
    """Manages execution and telemetry gathering for the underlying monolithic Mojo suite."""

    _mojo_dir: Path = Path(__file__).parent
    _pixi_bin: str = field(default_factory=lambda: os.path.expanduser("~/.pixi/bin/pixi"))

    def __post_init__(self) -> None:
        # Prefer source-tree location, then installed package
        mojo_file = self._mojo_dir / "kernels.mojo"
        if mojo_file.exists():
            return
        # Installed package fallback (kernels.mojo should be in package data)
        installed_mojo = Path(__file__).parent / "kernels.mojo"
        if installed_mojo.exists():
            self._mojo_dir = installed_mojo.parent
            return
        raise FileNotFoundError("kernels.mojo not found. Run: pixi install && pixi run mojo build")

    def build(self) -> bool:
        """Helper to invoke `mojo build` natively across the active working directory."""
        try:
            subprocess.run(
                [self._pixi_bin, "run", "mojo", "build", "kernels.mojo"],
                cwd=str(self._mojo_dir),
                check=True,
            )
            return True
        except Exception as e:
            print(f"[Mojo Runner] Build failed: {e}")
            return False

    def run_benchmark(self, timeout_sec: int = 60) -> Dict[str, float]:
        """Runs the entire kernel suite and parses output times natively in MS."""
        try:
            start_time = time.time()
            result = subprocess.run(
                [self._pixi_bin, "run", "mojo", "run", "kernels.mojo"],
                capture_output=True,
                text=True,
                check=True,
                timeout=timeout_sec,
                cwd=str(self._mojo_dir),
            )

            timings = {}
            for line in result.stdout.splitlines():
                if "ms" in line.lower() and ":" in line:
                    parts = line.split(":", 1)
                    if len(parts) == 2:
                        label = parts[0].strip()
                        val_match = re.search(
                            r"(\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s*ms", parts[1], re.IGNORECASE
                        )
                        if val_match:
                            timings[label] = float(val_match.group(1))

            return timings

        except subprocess.CalledProcessError as e:
            print(f"[Mojo Runner] Execution failed: {e.stderr}")
            return {}
        except subprocess.TimeoutExpired:
            print(f"[Mojo Runner] Hard timeout of {timeout_sec}s exceeded.")
            return {}
        except FileNotFoundError:
            print(
                f"[Mojo Runner] Pixi or Mojo completely missing at {self._pixi_bin}. Check installation bounds."
            )
            return {}

    def popcount(self, data: list[int]) -> int:
        """Call the Mojo SIMD kernel directly or fall back to Python."""
        try:
            # Mojo C-FFI pipeline target
            raise NotImplementedError("Mojo IPC bindings pending v4.0")
        except Exception:
            from sc_neurocore.edge.bitstream import popcount_slice

            return popcount_slice(data)

    def lfsr_encode(self, seed: int, threshold: int, bits: int) -> list[int]:
        """Call the Mojo LFSR-16 encoder directly or fall back to Python."""
        try:
            raise NotImplementedError("Mojo IPC bindings pending v4.0")
        except Exception:
            from sc_neurocore.edge.lfsr import Lfsr16

            lfsr = Lfsr16(seed)
            return lfsr.encode(threshold, bits)

build()

Helper to invoke mojo build natively across the active working directory.

Source code in src/sc_neurocore/accel/mojo/runner.py
Python
53
54
55
56
57
58
59
60
61
62
63
64
def build(self) -> bool:
    """Helper to invoke `mojo build` natively across the active working directory."""
    try:
        subprocess.run(
            [self._pixi_bin, "run", "mojo", "build", "kernels.mojo"],
            cwd=str(self._mojo_dir),
            check=True,
        )
        return True
    except Exception as e:
        print(f"[Mojo Runner] Build failed: {e}")
        return False

run_benchmark(timeout_sec=60)

Runs the entire kernel suite and parses output times natively in MS.

Source code in src/sc_neurocore/accel/mojo/runner.py
Python
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def run_benchmark(self, timeout_sec: int = 60) -> Dict[str, float]:
    """Runs the entire kernel suite and parses output times natively in MS."""
    try:
        start_time = time.time()
        result = subprocess.run(
            [self._pixi_bin, "run", "mojo", "run", "kernels.mojo"],
            capture_output=True,
            text=True,
            check=True,
            timeout=timeout_sec,
            cwd=str(self._mojo_dir),
        )

        timings = {}
        for line in result.stdout.splitlines():
            if "ms" in line.lower() and ":" in line:
                parts = line.split(":", 1)
                if len(parts) == 2:
                    label = parts[0].strip()
                    val_match = re.search(
                        r"(\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s*ms", parts[1], re.IGNORECASE
                    )
                    if val_match:
                        timings[label] = float(val_match.group(1))

        return timings

    except subprocess.CalledProcessError as e:
        print(f"[Mojo Runner] Execution failed: {e.stderr}")
        return {}
    except subprocess.TimeoutExpired:
        print(f"[Mojo Runner] Hard timeout of {timeout_sec}s exceeded.")
        return {}
    except FileNotFoundError:
        print(
            f"[Mojo Runner] Pixi or Mojo completely missing at {self._pixi_bin}. Check installation bounds."
        )
        return {}

popcount(data)

Call the Mojo SIMD kernel directly or fall back to Python.

Source code in src/sc_neurocore/accel/mojo/runner.py
Python
105
106
107
108
109
110
111
112
113
def popcount(self, data: list[int]) -> int:
    """Call the Mojo SIMD kernel directly or fall back to Python."""
    try:
        # Mojo C-FFI pipeline target
        raise NotImplementedError("Mojo IPC bindings pending v4.0")
    except Exception:
        from sc_neurocore.edge.bitstream import popcount_slice

        return popcount_slice(data)

lfsr_encode(seed, threshold, bits)

Call the Mojo LFSR-16 encoder directly or fall back to Python.

Source code in src/sc_neurocore/accel/mojo/runner.py
Python
115
116
117
118
119
120
121
122
123
def lfsr_encode(self, seed: int, threshold: int, bits: int) -> list[int]:
    """Call the Mojo LFSR-16 encoder directly or fall back to Python."""
    try:
        raise NotImplementedError("Mojo IPC bindings pending v4.0")
    except Exception:
        from sc_neurocore.edge.lfsr import Lfsr16

        lfsr = Lfsr16(seed)
        return lfsr.encode(threshold, bits)