Transcriptomic Foundation Model Interfaces¶

Module: sc_neurocore.bio.transcriptomic Rust path: sc_neurocore_engine::analysis::neural_decoders::gaussian_attention References: Li et al. (2025) Genome Biology 26:402; Theodoris et al. (2023) Nature 619 Family: Single-cell transcriptomic foundation models Tier: Research (explicit import from sc_neurocore.bio) Exports: ScKGBERTInterface, GeneformerInterface, rank_value_encode

1. Mathematical Formalism¶

1.1 Shared: Rank-value encoding¶

Rank-value encoding (Theodoris et al. 2023) converts a cell's gene expression vector into an ordered sequence of gene tokens. The ordering reflects both expression magnitude and gene rarity across the corpus.

Weighting by inverse corpus frequency:

For gene $g$ with expression $x_g$ and corpus median $\tilde{x}_g$:

$$w_g = \frac{1}{\tilde{x}_g + \epsilon}$$

where $\epsilon = 10^{-10}$ prevents division by zero.

Weighted expression:

$$\hat{x}_g = x_g \cdot w_g$$

Filtering and sorting:

Exclude all genes where $x_g = 0$.
Sort remaining genes in descending order of $\hat{x}_g$.
Return the sorted gene indices as an integer array.

The effect is that rarely expressed genes (low corpus median) receive higher weight, placing them earlier in the sequence. This is analogous to TF-IDF weighting in natural language processing: rare genes carry more information.

When global_medians is None, uniform weighting is applied ($w_g = 1$ for all $g$), and the ranking reduces to a simple descending sort by raw expression.

1.2 ScKGBERTInterface — Dual-encoder architecture¶

Li et al. (2025) introduced scKGBERT, a knowledge-enhanced foundation model for single-cell transcriptomics. The architecture consists of two parallel encoders that share a gene token embedding table.

S-Encoder (Sequence Encoder)¶

The S-Encoder processes the rank-value-encoded gene sequence through Gaussian self-attention. Given $n$ expressed genes, each represented by a $d$-dimensional token embedding $\mathbf{e}_i$:

Gather token embeddings for expressed genes from the shared embedding table.
Scale each token by its rank position: $\mathbf{t}_i = \mathbf{e}_i / (i + 1)$, so higher-ranked genes have stronger representation.
Project to query, key, value spaces: $\mathbf{q}_i = \mathbf{t}_i \mathbf{W}_Q$, $\mathbf{k}_i = \mathbf{t}_i \mathbf{W}_K$, $\mathbf{v}_i = \mathbf{t}_i \mathbf{W}_V$.
Apply Gaussian attention (see below).
Mean-pool the attended representations to produce a $d$-dimensional cell embedding.

K-Encoder (Knowledge Graph Encoder)¶

The K-Encoder aggregates neighbourhood information from the protein-protein interaction (PPI) knowledge graph (STRING database). For each expressed gene $g$:

Identify the PPI neighbourhood: all genes $j$ where the STRING confidence score $c_{gj} > 0$.
Compute neighbourhood embedding as a confidence-weighted mean:

$$\mathbf{h}g = \frac{\sum} > 0} c_{gj} \cdot \mathbf{ej}{\sum$$} > 0} c_{gj}

If gene $g$ has no neighbours in the graph, fall back to its own embedding: $\mathbf{h}_g = \mathbf{e}_g$.

Apply Gaussian attention over the neighbourhood embeddings.
Mean-pool to produce the K-Encoder cell embedding.

Gaussian attention¶

The central attention mechanism in scKGBERT replaces the scaled dot-product attention of standard Transformers with a Gaussian kernel over Euclidean distances:

$$\alpha_{ij} = \frac{\exp!\left(-\frac{|\mathbf{q}i - \mathbf{k}_j|^2}{2\sigma^2}\right)}{\sum$$}^{M} \exp!\left(-\frac{|\mathbf{q}_i - \mathbf{k}_m|^2}{2\sigma^2}\right)

where $\sigma$ is the bandwidth parameter controlling attention sharpness: - Small $\sigma$ concentrates attention on the nearest keys (sharp, selective). - Large $\sigma$ distributes attention uniformly (broad, smoothing).

The output for query $i$ is:

$$\mathbf{o}i = \sum_j$$}^{M} \alpha_{ij} \mathbf{v

Numerical stability is maintained by subtracting the maximum log-weight before exponentiation (log-sum-exp trick).

Fusion¶

The final cell embedding is the arithmetic mean of the S-Encoder and K-Encoder outputs:

$$\mathbf{z} = \frac{\mathbf{z}_S + \mathbf{z}_K}{2}$$

Gene importance scoring¶

Gene importance is derived from Gaussian attention column sums. For the attention weight matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$:

$$\text{importance}(g) = \sum_{i=1}^{n} \alpha_{ig}$$

Genes that receive high total incoming attention are those that other genes attend to most strongly — indicating biological centrality.

1.3 GeneformerInterface — Masked gene prediction¶

Theodoris et al. (2023) introduced Geneformer, a foundation model pretrained on ~30 million single-cell transcriptomes (v1) and ~95 million (v2). The model uses rank-value tokenisation and learns gene network dynamics via masked gene prediction.

Tokenisation¶

The tokenisation procedure applies rank-value encoding (Section 1.1), then filters to the gene vocabulary of size $V$:

Compute rank-value encoding of the expression vector.
Retain only gene indices $g < V$ (within vocabulary).
Return as an ordered token sequence.

Multi-head self-attention (Vaswani et al. 2017)¶

Given a sequence of $n$ gene token embeddings $\mathbf{X} \in \mathbb{R}^{n \times d}$ and $H$ attention heads with head dimension $d_h = d / H$:

For each head $h \in {1, \ldots, H}$:

$$\mathbf{Q}^{(h)} = \mathbf{X} \mathbf{W}_Q^{(h)}, \quad \mathbf{K}^{(h)} = \mathbf{X} \mathbf{W}_K^{(h)}, \quad \mathbf{V}^{(h)} = \mathbf{X} \mathbf{W}_V^{(h)}$$

$$\text{score}^{(h)} = \frac{\mathbf{Q}^{(h)} (\mathbf{K}^{(h)})^\top}{\sqrt{d_h}}$$

$$\alpha^{(h)}{ij} = \frac{\exp(\text{score}^{(h)}$$})}{\sum_{m} \exp(\text{score}^{(h)}_{im})

$$\text{head}^{(h)} = \mathbf{A}^{(h)} \mathbf{V}^{(h)}$$

The heads are concatenated and projected:

$$\text{MultiHead}(\mathbf{X}) = [\text{head}^{(1)}; \ldots; \text{head}^{(H)}] \mathbf{W}_O$$

In the implementation, $\mathbf{W}_Q$, $\mathbf{W}_K$, $\mathbf{W}_V$ are stored as single $d \times d$ matrices, with head-specific slices extracted at runtime. Numerical stability uses the log-sum-exp trick (max subtraction before softmax).

Masked gene prediction (MLM objective)¶

Following BERT (Devlin et al. 2018), Geneformer masks 15% of gene tokens and predicts them from context:

Select $\lfloor 0.15 \cdot n \rfloor$ positions uniformly at random (minimum 1).
Replace selected token embeddings with zero vectors (sentinel).
Apply multi-head self-attention to the partially masked sequence.
For each masked position, project the attended representation through the MLM head: $\text{logits} = \mathbf{h}\text{masked} \cdot \mathbf{W}\text{MLM}^\top$, where $\mathbf{W}_\text{MLM} \in \mathbb{R}^{V \times d}$.
Predict gene identity as $\hat{g} = \arg\max(\text{logits})$.

The MLM objective forces the model to learn gene-gene dependencies: predicting a masked gene requires understanding what other genes in the cell co-express with it.

Cell embedding¶

A cell's embedding is obtained by mean-pooling over the attended gene representations (without masking):

$$\mathbf{z}\text{cell} = \frac{1}{n} \sum_i$$}^{n} \mathbf{h

where $\mathbf{h}_i$ is the output of multi-head self-attention for gene token $i$.

Gene network attention¶

Theodoris et al. (2023) demonstrated that the averaged attention weights across heads encode network hierarchy. The gene-gene interaction matrix is:

$$\mathbf{A}\text{net} = \frac{1}{H} \sum$$}^{H} \mathbf{A}^{(h)

Each row sums to 1.0 (proper probability distribution). Column sums indicate how much other genes attend to a given gene — a proxy for network centrality. Hub genes in regulatory networks receive higher total incoming attention.

2. Theoretical Context¶

2.1 Single-cell transcriptomics¶

Single-cell RNA sequencing (scRNA-seq) measures gene expression at individual cell resolution, producing a count matrix with rows as cells and columns as genes. A typical experiment yields 10,000--100,000 cells and 20,000--30,000 genes. The data is sparse: most genes in most cells have zero counts.

scRNA-seq enables cell-type-specific gene expression profiling, revealing the molecular identity of each cell. Distinct cell types (neurons, astrocytes, microglia, oligodendrocytes) express characteristic gene sets — particularly ion channels, receptors, and signalling molecules.

2.2 Foundation model paradigm¶

Foundation models for single-cell transcriptomics follow the same principle as large language models: pretrain on a large, unlabelled corpus, then transfer the learned representations to downstream tasks. The cell is the "document", genes are "words", and expression levels determine the "word order" (rank-value encoding).

Pretraining objectives: - Masked gene prediction (Geneformer): analogous to masked language modelling in BERT. - Contrastive learning (scKGBERT): learn embeddings where similar cells cluster.

Transfer tasks include cell type classification, gene perturbation prediction, disease state identification, and regulatory network inference.

2.3 scKGBERT innovation: knowledge graph integration¶

Li et al. (2025) identified a limitation of pure sequence-based models: they learn gene relationships only from co-expression patterns, ignoring known biological interactions. scKGBERT addresses this by incorporating the STRING protein-protein interaction database (Szklarczyk et al. 2023).

STRING aggregates interactions from: - Experimental data (co-crystallisation, yeast two-hybrid, etc.) - Curated pathway databases (KEGG, Reactome) - Text mining of biomedical literature - Gene co-expression across species - Genomic context (gene neighbourhood, fusion, co-occurrence)

The database covers >5,000 species and >30 billion interactions. Each interaction has a combined confidence score in [0, 1].

The K-Encoder aggregates PPI neighbourhood information: for each gene, it computes a confidence-weighted mean of neighbour embeddings. This injects prior biological knowledge into the representation, allowing the model to leverage known gene-gene interactions even when co-expression data is sparse.

The Gaussian attention mechanism (replacing standard dot-product attention) was motivated by the observation that gene-gene interactions have a distance-dependent structure: genes with similar functions have similar embeddings, and the Gaussian kernel naturally captures this proximity.

2.4 Geneformer innovation: nonparametric tokenisation¶

Theodoris et al. (2023) introduced rank-value encoding, a nonparametric tokenisation scheme that avoids the pitfalls of raw count input:

Library size invariance: ranking by expression eliminates the need for normalisation by total counts. A gene ranked first in a cell with 1,000 total UMIs and a cell with 100,000 UMIs is treated identically.
Inverse frequency weighting: analogous to IDF in TF-IDF, this upweights rare genes that carry more cell-type-specific information.
Zero filtering: excluding zero-expression genes produces variable-length sequences, with the model attending only to expressed genes.

Geneformer v1 was pretrained on ~30 million single-cell transcriptomes from Genecorpus-30M. Geneformer v2 expanded the corpus to ~95 million cells and increased model capacity.

2.5 Relevance to spiking neural networks¶

Transcriptomic profiles directly determine a neuron's electrophysiological properties. Ion channel gene expression (e.g., SCN1A for Na_v 1.1, KCNQ2 for K_v 7.2) sets the membrane conductances that govern spike shape, firing rate, and adaptation.

In SC-NeuroCore, transcriptomic models provide the bridge between molecular and circuit-level neuroscience:

Gene expression → ion channel densities: Transcriptomic cell embeddings capture the ion channel expression profile of each cell type.
Ion channel densities → neuron parameters: The GeneticRegulatoryLayer translates protein levels into threshold and conductance modulations.
Neuron parameters → network dynamics: These parameters feed into the biophysical neuron models (Hodgkin-Huxley, Izhikevich, etc.).

This pipeline allows SC-NeuroCore to initialise neuron parameters from transcriptomic data rather than hand-tuning, grounding simulations in measured biological data.

3. Pipeline Position¶

3.1 Module location¶

Text Only

sc_neurocore/
  bio/
    __init__.py          # exports rank_value_encode, ScKGBERTInterface, GeneformerInterface
    transcriptomic.py    # this module
    grn.py               # GeneticRegulatoryLayer (downstream consumer)
    neuromodulation.py   # NeuromodulatorSystem
    dna_storage.py       # DNAEncoder

The sc_neurocore.bio package is classified as research tier, requiring explicit import. It is not loaded by default when importing sc_neurocore.

3.2 Data flow¶

Text Only

Raw scRNA-seq counts (or normalised expression)
    │
    ▼
rank_value_encode()          ← shared utility
    │
    ├──► ScKGBERTInterface
    │       ├── encode_expression()        → cell embedding (S-Encoder only)
    │       ├── encode_with_knowledge()    → cell embedding (S+K fusion)
    │       ├── predict_cell_type()        → cell type label
    │       └── gene_importance()          → gene importance scores
    │
    └──► GeneformerInterface
            ├── tokenise()                 → gene token sequence
            ├── encode_cell()              → cell embedding
            ├── predict_masked_genes()     → masked gene predictions
            └── gene_network_attention()   → gene-gene attention matrix

3.3 Integration with GeneticRegulatoryLayer¶

The GeneticRegulatoryLayer in sc_neurocore.bio.grn models the pathway from neural activity to protein production to parameter modulation:

Text Only

dP/dt = alpha * spikes - beta * P

Transcriptomic models feed into this layer by providing baseline protein levels and gene importance scores. The flow is:

scRNA-seq data → transcriptomic model → cell type classification.
Cell type → ion channel expression profile → initial protein levels.
GeneticRegulatoryLayer.step() evolves protein levels during simulation.
get_threshold_modulators() returns current protein levels for threshold modulation.

3.4 Rust acceleration¶

The Gaussian attention kernel used by ScKGBERTInterface has a parallel Rust implementation in engine/src/analysis/neural_decoders.rs:

Rust

pub fn gaussian_attention(
    queries: &[f64],
    keys: &[f64],
    values: &[f64],
    nq: usize,
    nk: usize,
    d: usize,
    sigma: f64,
) -> Vec<f64>

This implementation uses rayon parallel iteration over query rows, providing wall-clock speedup on multi-core systems. The Python implementation uses vectorised numpy operations for the same computation.

3.5 Inputs and outputs¶

Interface	Input	Output
`rank_value_encode`	expression [n_genes], medians	gene indices [n_expressed]
`ScKGBERTInterface`	expression [n_genes]	embedding [d_model]
`GeneformerInterface`	expression [n_genes], medians	embedding [d_model]

Output types by method:

Method	Return type
`encode_expression`	`np.ndarray` shape `(d_model,)`
`encode_with_knowledge`	`np.ndarray` shape `(d_model,)`
`predict_cell_type`	`str`
`gene_importance`	`np.ndarray` shape `(n_genes,)`
`tokenise`	`np.ndarray` dtype `int64`
`encode_cell`	`np.ndarray` shape `(d_model,)`
`predict_masked_genes`	`tuple[ndarray, ndarray, ndarray]`
`gene_network_attention`	`np.ndarray` shape `(n_expr, n_expr)`

4. Features¶

4.1 Interfaces provided¶

Name	Type	Source publication
`ScKGBERTInterface`	dataclass	Li et al. (2025) GB 26:402
`GeneformerInterface`	dataclass	Theodoris et al. (2023) Nat 619
`rank_value_encode`	function	Theodoris et al. (2023)

4.2 No external deep learning dependencies¶

Both interfaces are implemented in pure numpy. There is no dependency on PyTorch, TensorFlow, JAX, or HuggingFace Transformers for the core operations. This is deliberate: SC-NeuroCore is a spiking neural network framework, not a deep learning framework. The interfaces implement the publication-exact algorithms (attention, tokenisation, masking, prediction) to allow integration with transcriptomic data without pulling in GPU-oriented dependencies.

For production use with actual pretrained weights, one would load weights from the respective model checkpoints and inject them into the interface's internal arrays. The interface classes are designed to be weight-agnostic: their logic is correct regardless of whether weights are random initialisations or trained.

4.3 Gaussian attention (scKGBERT-specific)¶

The gaussian_attention method implements the Gaussian kernel attention from Li et al. (2025). This is distinct from the standard scaled dot-product attention used in Transformers:

Property	Dot-product attention	Gaussian attention
Similarity metric	Dot product (cosine-like)	Euclidean distance (L2)
Kernel	$\exp(q \cdot k / \sqrt{d})$	$\exp(-\|q - k\|^2 / 2\sigma^2)$
Bandwidth param	$\sqrt{d}$ (fixed)	$\sigma$ (tuneable)
Geometric interp	Angular similarity	Spatial proximity

4.4 Multi-head self-attention (Geneformer-specific)¶

Standard Vaswani et al. (2017) multi-head self-attention with $H$ heads. The implementation slices the single $d \times d$ projection matrices into per-head blocks of dimension $d_h = d / H$ rather than maintaining separate per-head weight matrices.

4.5 Rust acceleration path¶

The Gaussian attention kernel has a Rust implementation in engine/src/analysis/neural_decoders.rs. The Rust kernel uses rayon for parallel row computation. This accelerates the inner loop of ScKGBERTInterface.gaussian_attention for large gene sets.

4.6 Deterministic initialisation¶

Both interface classes accept a seed parameter (default 42) that controls the random number generator for weight initialisation. This ensures reproducible embeddings across runs, which is important for testing and benchmarking.

5. Usage Examples¶

5.1 Rank-value encoding¶

Python

import numpy as np
from sc_neurocore.bio.transcriptomic import rank_value_encode

# Simulated expression for 5 genes
expression = np.array([0.0, 5.0, 1.0, 10.0, 3.0])

# Basic ranking (uniform weighting)
ranked = rank_value_encode(expression)
# Result: array([3, 1, 4, 2])
# Gene 3 (expr=10) first, gene 0 (expr=0) excluded

# With corpus medians: rare genes get upweighted
medians = np.array([10.0, 0.1, 1.0, 5.0, 0.5])
ranked_weighted = rank_value_encode(expression, medians)
# Gene 1 has low median → high weight → moves up in ranking

5.2 scKGBERT: cell type prediction¶

Python

import numpy as np
from sc_neurocore.bio.transcriptomic import ScKGBERTInterface

# Initialise with small vocabulary for demonstration
model = ScKGBERTInterface(d_model=64, n_genes=200, sigma=1.0, seed=42)

# Simulate expression profiles for two cell types
rng = np.random.default_rng(0)
neuron_expr = rng.poisson(5, 200).astype(np.float64)
glia_expr = rng.poisson(1, 200).astype(np.float64)

# Build prototypes from reference cells
proto_neuron = model.encode_with_knowledge(neuron_expr)
proto_glia = model.encode_with_knowledge(glia_expr)
prototypes = np.array([proto_neuron, proto_glia])
labels = ["neuron", "glia"]

# Classify a query cell
query = rng.poisson(4, 200).astype(np.float64)
cell_type = model.predict_cell_type(query, prototypes, labels)
print(f"Predicted cell type: {cell_type}")

5.3 scKGBERT: gene importance analysis¶

Python

import numpy as np
from sc_neurocore.bio.transcriptomic import ScKGBERTInterface

model = ScKGBERTInterface(d_model=64, n_genes=500, sigma=1.0, seed=42)
expression = np.random.default_rng(1).poisson(3, 500).astype(np.float64)

importance = model.gene_importance(expression)

# Top 10 most important genes
top_genes = np.argsort(-importance)[:10]
for g in top_genes:
    print(f"Gene {g}: importance = {importance[g]:.4f}")

5.4 scKGBERT: comparing S-Encoder vs dual encoder¶

Python

import numpy as np
from sc_neurocore.bio.transcriptomic import ScKGBERTInterface

model = ScKGBERTInterface(d_model=64, n_genes=100, sigma=1.0, seed=42)
expr = np.random.default_rng(7).poisson(3, 100).astype(np.float64)

# S-Encoder only (ignores PPI knowledge graph)
s_embedding = model.encode_expression(expr)

# Dual encoder (S-Encoder + K-Encoder with PPI fusion)
fused_embedding = model.encode_with_knowledge(expr)

# The two embeddings differ because the K-Encoder incorporates
# neighbourhood information from the STRING PPI graph
distance = np.linalg.norm(s_embedding - fused_embedding)
print(f"L2 distance between S-only and fused: {distance:.4f}")

5.5 Geneformer: cell embedding extraction¶

Python

import numpy as np
from sc_neurocore.bio.transcriptomic import GeneformerInterface

model = GeneformerInterface(d_model=256, n_genes=2000, n_heads=4, seed=42)
expression = np.random.default_rng(0).poisson(3, 2000).astype(np.float64)

# Extract cell-level embedding (mean-pooled over attended gene tokens)
cell_embedding = model.encode_cell(expression)
print(f"Cell embedding shape: {cell_embedding.shape}")  # (256,)

5.6 Geneformer: masked gene prediction¶

Python

import numpy as np
from sc_neurocore.bio.transcriptomic import GeneformerInterface

model = GeneformerInterface(d_model=128, n_genes=500, n_heads=4, seed=42)
expression = np.random.default_rng(5).poisson(3, 500).astype(np.float64)

# Predict masked genes (15% of tokens masked by default)
mask, true_ids, predicted_ids = model.predict_masked_genes(
    expression, rng_seed=42,
)

n_correct = (true_ids == predicted_ids).sum()
n_total = len(true_ids)
print(f"Masked gene prediction: {n_correct}/{n_total} correct")

5.7 Geneformer: gene network attention matrix¶

Python

import numpy as np
from sc_neurocore.bio.transcriptomic import GeneformerInterface

model = GeneformerInterface(d_model=64, n_genes=100, n_heads=4, seed=42)
expression = np.random.default_rng(3).poisson(2, 100).astype(np.float64)

# Extract the attention-derived gene-gene interaction matrix
attn_matrix = model.gene_network_attention(expression)
print(f"Attention matrix shape: {attn_matrix.shape}")

# Each row sums to 1.0 (probability distribution)
row_sums = attn_matrix.sum(axis=1)
print(f"Row sums (should be ~1.0): {row_sums[:5]}")

# Column sums indicate network centrality (hub genes)
col_sums = attn_matrix.sum(axis=0)
hub_genes = np.argsort(-col_sums)[:5]
print(f"Top 5 hub genes by attention centrality: {hub_genes}")

5.8 Integration with GeneticRegulatoryLayer¶

Python

import numpy as np
from sc_neurocore.bio.transcriptomic import ScKGBERTInterface
from sc_neurocore.bio.grn import GeneticRegulatoryLayer

# Step 1: classify cell types from transcriptomic data
model = ScKGBERTInterface(d_model=64, n_genes=200, seed=42)

# (In practice, prototypes come from reference atlas data)
rng = np.random.default_rng(0)
proto_excitatory = model.encode_with_knowledge(
    rng.poisson(5, 200).astype(np.float64)
)
proto_inhibitory = model.encode_with_knowledge(
    rng.poisson(2, 200).astype(np.float64)
)

# Step 2: use gene importance to set initial protein levels
importance = model.gene_importance(
    rng.poisson(3, 200).astype(np.float64)
)

# Step 3: feed into GeneticRegulatoryLayer
grn = GeneticRegulatoryLayer(n_neurons=10)
# protein_levels can be initialised from transcriptomic data
grn.protein_levels = importance[:10] / importance.max()

# Step 4: evolve during simulation
spikes = rng.integers(0, 2, 10).astype(np.float64)
grn.step(spikes)
threshold_mod = grn.get_threshold_modulators()
print(f"Threshold modulation: {threshold_mod}")

6. Technical Reference¶

6.1 `rank_value_encode`¶

Python

def rank_value_encode(
    expression: np.ndarray,
    global_medians: np.ndarray | None = None,
) -> np.ndarray

Parameters:

Name	Type	Description
`expression`	`np.ndarray` [n_genes]	Raw counts or normalised expression values
`global_medians`	`np.ndarray \| None`	Corpus median per gene; `None` for uniform

Returns: np.ndarray dtype int64 — gene indices sorted by weighted expression (descending), with zero-expression genes excluded.

Raises: No explicit exceptions. Empty expression vectors return an empty array.

Algorithmic detail: - Weighting: weighted = expression * (1 / (global_medians + 1e-10)) if medians provided. - Filter: indices where expression > 0. - Sort: np.argsort(-weighted[nonzero_indices]). - Cast to int64.

6.2 `ScKGBERTInterface`¶

Python

@dataclass
class ScKGBERTInterface:
    d_model: int = 64
    n_genes: int = 2000
    sigma: float = 1.0
    seed: int = 42

Dataclass fields:

Field	Type	Default	Description
`d_model`	`int`	64	Embedding dimensionality
`n_genes`	`int`	2000	Gene vocabulary size
`sigma`	`float`	1.0	Gaussian attention bandwidth parameter
`seed`	`int`	42	RNG seed for weight initialisation

Internal state (set in __post_init__):

Attribute	Shape	Description
`_gene_embeddings`	`(n_genes, d_model)`	Shared gene token embeddings
`_kg_adjacency`	`(n_genes, n_genes)`	PPI confidence scores (symmetric)
`_w_q`	`(d_model, d_model)`	Query projection matrix
`_w_k`	`(d_model, d_model)`	Key projection matrix
`_w_v`	`(d_model, d_model)`	Value projection matrix
`_cls_embedding`	`(d_model,)`	Classification head embedding

The knowledge graph adjacency matrix is initialised with n_genes * 5 random edges (capped at the maximum possible undirected edges), each with a confidence score uniformly sampled from [0.15, 1.0]. The matrix is symmetric.

`gaussian_attention`¶

Python

def gaussian_attention(
    self,
    queries: np.ndarray,
    keys: np.ndarray,
    values: np.ndarray,
) -> np.ndarray

Parameter	Shape	Description
`queries`	`(n, d)`	Query vectors
`keys`	`(m, d)`	Key vectors
`values`	`(m, d)`	Value vectors
Return	`(n, d)`	Attended output

Uses the instance's sigma parameter for bandwidth.

`encode_expression`¶

Python

def encode_expression(self, expression: np.ndarray) -> np.ndarray

Parameter	Shape	Description
`expression`	`(n_genes,)`	Raw counts or normalised expr
Return	`(d_model,)`	S-Encoder cell embedding

Returns a zero vector if no genes are expressed or all gene indices exceed the vocabulary size.

`encode_with_knowledge`¶

Python

def encode_with_knowledge(self, expression: np.ndarray) -> np.ndarray

Parameter	Shape	Description
`expression`	`(n_genes,)`	Raw counts or normalised expr
Return	`(d_model,)`	Fused S-Encoder + K-Encoder embedding

Falls back to S-Encoder-only embedding if no expressed genes are within the vocabulary.

`predict_cell_type`¶

Python

def predict_cell_type(
    self,
    expression: np.ndarray,
    prototypes: np.ndarray,
    labels: list[str],
) -> str

Parameter	Shape	Description
`expression`	`(n_genes,)`	Query cell expression
`prototypes`	`(n_types, d_model)`	Mean embeddings per cell type
`labels`	list of str	Cell type names (same order)
Return	`str`	Predicted cell type label

Classification is nearest-neighbour in embedding space (L2 distance).

`gene_importance`¶

Python

def gene_importance(self, expression: np.ndarray) -> np.ndarray

Parameter	Shape	Description
`expression`	`(n_genes,)`	Raw counts or normalised expr
Return	`(n_genes,)`	Gene importance scores (0 for unexpressed)

Importance is the column sum of the Gaussian attention weight matrix. Only expressed genes receive non-zero scores.

6.3 `GeneformerInterface`¶

Python

@dataclass
class GeneformerInterface:
    d_model: int = 256
    n_genes: int = 2000
    n_heads: int = 4
    mask_ratio: float = 0.15
    seed: int = 42

Dataclass fields:

Field	Type	Default	Description
`d_model`	`int`	256	Embedding dimensionality
`n_genes`	`int`	2000	Gene vocabulary size
`n_heads`	`int`	4	Number of attention heads
`mask_ratio`	`float`	0.15	Fraction of tokens masked in MLM
`seed`	`int`	42	RNG seed for weight initialisation

Internal state (set in __post_init__):

Attribute	Shape	Description
`_gene_embeddings`	`(n_genes, d_model)`	Gene token embeddings
`_w_q`	`(d_model, d_model)`	Query projection
`_w_k`	`(d_model, d_model)`	Key projection
`_w_v`	`(d_model, d_model)`	Value projection
`_w_o`	`(d_model, d_model)`	Output projection
`_mlm_head`	`(n_genes, d_model)`	MLM prediction head

`tokenise`¶

Python

def tokenise(
    self,
    expression: np.ndarray,
    global_medians: np.ndarray | None = None,
) -> np.ndarray

Parameter	Shape	Description
`expression`	`(n_genes,)`	Raw counts or normalised expr
`global_medians`	`(n_genes,)` or `None`	Corpus medians for weighting
Return	int64 array	Gene token IDs within vocabulary

`mask_tokens`¶

Python

def mask_tokens(
    self,
    token_ids: np.ndarray,
    rng_seed: int | None = None,
) -> tuple[np.ndarray, np.ndarray]

Parameter	Shape	Description
`token_ids`	`(n,)`	Gene token IDs from `tokenise`
`rng_seed`	`int \| None`	RNG seed; falls back to `self.seed`
Return[0]	`(n,)`	Masked token IDs (-1 at masked positions)
Return[1]	`(n,)`	Boolean mask (True where masked)

`multi_head_attention`¶

Python

def multi_head_attention(self, x: np.ndarray) -> np.ndarray

Parameter	Shape	Description
`x`	`(seq_len, d)`	Input token representations
Return	`(seq_len, d)`	Attended representations

`encode_cell`¶

Python

def encode_cell(
    self,
    expression: np.ndarray,
    global_medians: np.ndarray | None = None,
) -> np.ndarray

Parameter	Shape	Description
`expression`	`(n_genes,)`	Raw counts or normalised expr
`global_medians`	or `None`	Corpus medians
Return	`(d_model,)`	Mean-pooled cell embedding

Returns zero vector if no genes are expressed.

`predict_masked_genes`¶

Python

def predict_masked_genes(
    self,
    expression: np.ndarray,
    global_medians: np.ndarray | None = None,
    rng_seed: int | None = None,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]

Parameter	Shape	Description
`expression`	`(n_genes,)`	Raw counts or normalised expr
`global_medians`	or `None`	Corpus medians
`rng_seed`	`int \| None`	RNG seed for mask selection
Return[0]	`(n,)` bool	Mask positions
Return[1]	int64 array	True gene IDs at masked positions
Return[2]	int64 array	Predicted gene IDs at masked positions

Returns empty arrays if fewer than 2 genes are expressed.

`gene_network_attention`¶

Python

def gene_network_attention(
    self,
    expression: np.ndarray,
    global_medians: np.ndarray | None = None,
) -> np.ndarray

Parameter	Shape	Description
`expression`	`(n_genes,)`	Raw counts or normalised expr
`global_medians`	or `None`	Corpus medians
Return	`(n, n)`	Head-averaged attention matrix

Each row sums to 1.0. Returns [[]] if fewer than 2 genes are expressed.

7. Performance Benchmarks¶

7.1 Measurement methodology¶

All benchmarks measured on: - CPU: Intel Core i5-11600K (6 cores / 12 threads, 3.9 GHz base, 4.9 GHz boost) - Python: CPython 3.12 - NumPy: system-installed (no MKL) - Method: pytest-benchmark with automatic round calibration

Default interface parameters unless otherwise noted: - ScKGBERTInterface: d_model=64, n_genes=2000, sigma=1.0 - GeneformerInterface: d_model=256, n_genes=2000, n_heads=4

7.2 ScKGBERTInterface benchmarks¶

Method	Mean (ns/call)	Notes
`encode_expression`	516,602	S-Encoder only, single cell
`encode_with_knowledge`	3,237,677	S+K dual encoder, PPI neighbourhood walk

The encode_with_knowledge method is approximately 6.3x slower than encode_expression due to the K-Encoder's per-gene PPI neighbourhood aggregation loop. This loop iterates over expressed genes and, for each, computes a confidence-weighted mean of neighbour embeddings. The cost scales with the number of expressed genes multiplied by the average PPI neighbourhood size.

For batch processing of large cell populations, the Rust gaussian_attention kernel in engine/src/analysis/neural_decoders.rs can accelerate the attention computation. The Rust kernel parallelises across query rows with rayon, yielding wall-clock improvement on multi-core systems.

7.3 GeneformerInterface benchmarks¶

Method	Mean (ns/call)	Notes
`encode_cell`	1,035,978	Mean-pooled cell embedding
`predict_masked_genes`	1,447,413	MLM prediction with 15% masking
`gene_network_attention`	885,411	Head-averaged attention matrix

The encode_cell and predict_masked_genes timings are dominated by the multi-head self-attention computation, which involves three matrix multiplications (Q, K, V projections), per-head score computation, softmax, and the output projection. The predict_masked_genes is slower than encode_cell because it additionally runs the MLM head (a matrix multiplication of masked representations against the full vocabulary embedding table).

gene_network_attention is the fastest of the three because it skips the value computation and output projection — it only computes the Q and K projections and the per-head attention weight matrices.

7.4 Scaling considerations¶

The attention computation is $O(n^2 d)$ where $n$ is the number of expressed genes and $d$ is the embedding dimension. For a typical scRNA-seq cell with ~3,000 expressed genes (within a vocabulary of 20,000), the attention matrix is $3000 \times 3000$, which is manageable in numpy but becomes the bottleneck for large-scale batch processing.

The K-Encoder neighbourhood aggregation in ScKGBERTInterface has additional cost of $O(n \cdot \bar{k})$ where $\bar{k}$ is the mean PPI neighbourhood size. In the STRING database, the mean degree for human proteins is approximately 10--20 high-confidence interactions, so this is typically a modest multiplicative factor.

7.5 Comparison of attention mechanisms¶

Mechanism	Time complexity	Used in
Gaussian attention	$O(n^2 d)$	ScKGBERTInterface
Scaled dot-product	$O(n^2 d)$	GeneformerInterface

Both mechanisms have the same asymptotic complexity. Gaussian attention is marginally more expensive per element due to the L2 distance computation (requiring $q^2 + k^2 - 2qk$ vs. the single $qk$ dot product), but the difference is small relative to the overall $O(n^2 d)$ cost.

8. Citations¶

Primary references¶

scKGBERT:

Li Y, Qiao G, Du H, Gao X, Wang G. "scKGBERT: a knowledge-enhanced foundation model for single-cell transcriptomics." Genome Biology 26:402 (2025). DOI: 10.1186/s13059-025-03564-3

Geneformer:

Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, Hill MC, Manber H, Bowman AGL, Dias N, Tsao PS, Bhatt DL, Ellinor PT. "Transfer learning enables predictions in network biology." Nature 619:616--624 (2023). DOI: 10.1038/s41586-023-06139-9

Supporting references¶

Transformer attention:

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. "Attention is all you need." Advances in Neural Information Processing Systems 30 (NeurIPS 2017). arXiv: 1706.03762

BERT masked language modelling:

Devlin J, Chang MW, Lee K, Toutanova K. "BERT: Pre-training of deep bidirectional transformers for language understanding." Proceedings of NAACL-HLT 2019:4171--4186 (2019). arXiv: 1810.04805

STRING database:

Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, Gable AL, Fang T, Doncheva NT, Pyysalo S, Bork P, Jensen LJ, von Mering C. "The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest." Nucleic Acids Research 51(D1):D599--D606 (2023). DOI: 10.1093/nar/gkac1000

scRNA-seq methodology:

Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, Gregory MT, Shuga J, Montesclaros L, Underwood JG, Masquelier DA, Nishimura SY, Schnall-Levin M, Wyatt PW, Hindson CM, Bharadwaj R, Wong A, Ness KD, Beppu LW, Deeg HJ, McFarland C, Loeb KR, Valente WJ, Ericson NG, Stevens EA, Radich JP, Mikkelsen TS, Hindson BJ, Bielas JH. "Massively parallel digital transcriptional profiling of single cells." Nature Communications 8:14049 (2017). DOI: 10.1038/ncomms14049

Geneformer v2 corpus:

Theodoris CV. "Geneformer v2: Foundation model trained on 95 million single-cell transcriptomes." Hugging Face model repository (2024). https://huggingface.co/ctheodoris/Geneformer

Appendix A: Edge Cases and Defensive Behaviour¶

A.1 All-zero expression¶

Both interfaces handle the case where all genes have zero expression:

rank_value_encode returns an empty int64 array.
ScKGBERTInterface.encode_expression returns a zero vector of shape (d_model,).
ScKGBERTInterface.encode_with_knowledge returns a zero vector.
GeneformerInterface.encode_cell returns a zero vector of shape (d_model,).
GeneformerInterface.predict_masked_genes returns three empty arrays.
GeneformerInterface.gene_network_attention returns [[]].

A.2 Gene indices exceeding vocabulary¶

If rank_value_encode produces gene indices >= n_genes, they are filtered out in all methods before embedding lookup. This prevents index-out-of-bounds errors when the expression vector is longer than the model's vocabulary.

A.3 Single expressed gene¶

ScKGBERTInterface: works normally (attention over a single token is trivially itself).
GeneformerInterface.predict_masked_genes: returns empty arrays (requires >= 2 tokens).
GeneformerInterface.gene_network_attention: returns [[]] (requires >= 2 tokens).

A.4 Numerical stability¶

Both attention mechanisms use the log-sum-exp trick: 1. Subtract the maximum log-weight per row before exponentiation. 2. Add $10^{-30}$ to the denominator to prevent division by zero.

This prevents floating-point overflow in the exponential and division-by-zero in the normalisation, even for extreme distance/score values.

Appendix B: Relationship to Full Model Architectures¶

The interfaces in this module implement the core algorithmic components of scKGBERT and Geneformer, not the complete training pipelines. Specifically:

What is implemented: - Rank-value encoding (tokenisation) - Gaussian attention (scKGBERT) - Multi-head self-attention (Geneformer) - Dual-encoder fusion (scKGBERT) - Masked gene prediction (Geneformer) - Gene importance extraction (scKGBERT) - Gene network attention extraction (Geneformer) - Cell type prediction via prototype matching (scKGBERT)

What is not implemented (out of scope): - Pretraining loops (gradient descent, loss computation, backpropagation) - Layer normalisation, residual connections, feed-forward sub-layers - Positional encoding beyond rank-position weighting - Multiple Transformer layers (both models use single-layer attention here) - Actual pretrained weights (weights are randomly initialised)

The interfaces are designed for integration into the SC-NeuroCore pipeline, where the concern is extracting biologically meaningful representations from gene expression data, not training foundation models from scratch. For production use with actual pretrained weights, the weight arrays can be replaced by loading from model checkpoints (e.g., HuggingFace).

Transcriptomic Foundation Model Interfaces¶

1. Mathematical Formalism¶

1.1 Shared: Rank-value encoding¶

1.2 ScKGBERTInterface — Dual-encoder architecture¶

S-Encoder (Sequence Encoder)¶

K-Encoder (Knowledge Graph Encoder)¶

Gaussian attention¶

Fusion¶

Gene importance scoring¶

1.3 GeneformerInterface — Masked gene prediction¶

Tokenisation¶

Multi-head self-attention (Vaswani et al. 2017)¶

Masked gene prediction (MLM objective)¶

Cell embedding¶

Gene network attention¶

2. Theoretical Context¶

2.1 Single-cell transcriptomics¶

2.2 Foundation model paradigm¶

2.3 scKGBERT innovation: knowledge graph integration¶

2.4 Geneformer innovation: nonparametric tokenisation¶

2.5 Relevance to spiking neural networks¶

3. Pipeline Position¶

3.1 Module location¶

3.2 Data flow¶

3.3 Integration with GeneticRegulatoryLayer¶

3.4 Rust acceleration¶

3.5 Inputs and outputs¶

4. Features¶

4.1 Interfaces provided¶

4.2 No external deep learning dependencies¶

4.3 Gaussian attention (scKGBERT-specific)¶

4.4 Multi-head self-attention (Geneformer-specific)¶

4.5 Rust acceleration path¶

4.6 Deterministic initialisation¶

5. Usage Examples¶

5.1 Rank-value encoding¶

5.2 scKGBERT: cell type prediction¶

5.3 scKGBERT: gene importance analysis¶

5.4 scKGBERT: comparing S-Encoder vs dual encoder¶

5.5 Geneformer: cell embedding extraction¶

5.6 Geneformer: masked gene prediction¶

5.7 Geneformer: gene network attention matrix¶

5.8 Integration with GeneticRegulatoryLayer¶

6. Technical Reference¶

6.1 rank_value_encode¶

6.2 ScKGBERTInterface¶

gaussian_attention¶

encode_expression¶

encode_with_knowledge¶

predict_cell_type¶

gene_importance¶

6.3 GeneformerInterface¶

tokenise¶

mask_tokens¶

multi_head_attention¶

encode_cell¶

predict_masked_genes¶

gene_network_attention¶

7. Performance Benchmarks¶

7.1 Measurement methodology¶

7.2 ScKGBERTInterface benchmarks¶

7.3 GeneformerInterface benchmarks¶

7.4 Scaling considerations¶

7.5 Comparison of attention mechanisms¶

8. Citations¶

Primary references¶

Supporting references¶

Appendix A: Edge Cases and Defensive Behaviour¶

A.1 All-zero expression¶

A.2 Gene indices exceeding vocabulary¶

A.3 Single expressed gene¶

A.4 Numerical stability¶

Appendix B: Relationship to Full Model Architectures¶

6.1 `rank_value_encode`¶

6.2 `ScKGBERTInterface`¶

`gaussian_attention`¶

`encode_expression`¶

`encode_with_knowledge`¶

`predict_cell_type`¶

`gene_importance`¶

6.3 `GeneformerInterface`¶

`tokenise`¶

`mask_tokens`¶

`multi_head_attention`¶

`encode_cell`¶

`predict_masked_genes`¶

`gene_network_attention`¶