Tutorial 46: NeuroBench Benchmarking¶

Generate standardised evaluation reports compatible with the NeuroBench framework for fair comparison with other neuromorphic systems. Reports include accuracy, spike efficiency, parameter count, and energy estimates.

Why NeuroBench¶

Every framework reports different metrics in different ways. NeuroBench provides a standard: same tasks, same metrics, same reporting format. This makes cross-framework comparison honest and reproducible.

Compute Metrics¶

Python

import numpy as np
from sc_neurocore.benchmarks import compute_metrics

rng = np.random.default_rng(42)

result = compute_metrics(
    predictions=np.array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0]),
    targets=np.array([0, 1, 2, 0, 0, 2, 0, 1, 2, 1]),
    spike_counts=rng.integers(10, 50, size=10),
    weights=[rng.standard_normal((128, 64)).astype(np.float32)],
    timesteps=16,
    task="mnist",
)

print(result.summary())
# Task: mnist
# Accuracy: 80.0%
# Total spikes: 312
# Spikes per sample: 31.2
# Spike efficiency: 2.56% accuracy per spike
# Parameters: 8,192
# Compute ops: 131,072 (params × timesteps)
# Energy estimate: 0.013 mJ (at 45nm CMOS)

Export NeuroBench JSON¶

The standard NeuroBench reporting format:

Python

json_report = result.to_neurobench_json()
print(json_report)
# {
#   "task": "mnist",
#   "accuracy": 0.80,
#   "n_parameters": 8192,
#   "n_timesteps": 16,
#   "total_spikes": 312,
#   "spikes_per_sample": 31.2,
#   "energy_mj": 0.013,
#   "framework": "sc-neurocore",
#   "version": "3.14.0"
# }

# Save for submission
from pathlib import Path
Path("neurobench_result.json").write_text(json_report)

Available Tasks¶

NeuroBench defines standardised benchmarks:

Python

from sc_neurocore.benchmarks import TASKS

for name, task in TASKS.items():
    print(f"{name:20s} | {task.n_classes:>3d} classes | "
          f"baseline {task.baseline_accuracy:.1%} | {task.description}")

Task	Classes	Baseline	Description
mnist	10	99.5% (ANN)	Handwritten digits
shd	20	85% (SNN)	Spiking Heidelberg Digits (speech)
dvs_gesture	11	95% (SNN)	DVS128 Gesture recognition
heartbeat	2	90% (ANN)	ECG anomaly detection
keyword	12	96% (ANN)	Speech keyword spotting

Full Benchmark Pipeline¶

Python

from sc_neurocore.training import SpikingNet, train_epoch, evaluate, auto_device
from sc_neurocore.benchmarks import compute_metrics
from sc_neurocore.training.utils import SpikeMonitor

# 1. Train model
device = auto_device()
model = SpikingNet(n_input=784, n_hidden=128, n_output=10).to(device)
# ... training loop ...

# 2. Evaluate with spike counting
monitor = SpikeMonitor(model)
test_loss, test_acc = evaluate(model, test_loader, n_timesteps=25, device=device)
total_spikes = sum(
    monitor.get(name).sum().item() for name in monitor.layer_names
    if monitor.get(name) is not None
)

# 3. Generate NeuroBench report
result = compute_metrics(
    predictions=all_predictions,
    targets=all_targets,
    spike_counts=spike_counts_per_sample,
    weights=[p.detach().cpu().numpy() for p in model.parameters()],
    timesteps=25,
    task="mnist",
)

print(result.summary())
print(result.to_neurobench_json())

SC-NeuroCore vs NeuroBench Leaderboard¶

Honest comparison (measured, not estimated):

Task	SC-NeuroCore	Best Published	Gap
MNIST	99.49% (ConvSNN)	99.72% (SEW-ResNet)	-0.23%
SHD	not yet measured	95.1% (SpikFormer)	—
DVS Gesture	not yet measured	98.2% (TET)	—

We publish measured numbers only. SHD and DVS Gesture benchmarks are pending — we'll update this table when results are available.

References¶

Yik et al. (2024). "NeuroBench: Advancing Neuromorphic Computing through Collaborative, Fair and Representative Benchmarking." Nature Communications.
Cramer et al. (2020). "The Heidelberg Spiking Data Sets for the Systematic Evaluation of Spiking Neural Networks." IEEE TNNLS.