Research Proposal: New Inference Paradigms for ML at the Edge

Author

Jinming Ren

Abstract

This project targets real-time, on-device ML through RISC-V + FPGA co-design and neuromorphic inference. We first optimize an ANN detector (e.g., RT-DETR) via sub-\(8\)-bit quantization and CFU-based acceleration to achieve deterministic low latency. We then develop an SNN path with few time-steps, plus a portable SNN IR and quantization API for cross-backend deployment. We study internal complexity (richer neuron dynamics) to replace external depth/width and explore compute-in/near-memory for memory-bound kernels. Using NeuroBench-style evaluation on video and event-camera tasks, we aim for \(<10\) ms latency and \(\ge 30\sim 50\%\) energy savings at \(\le 1\sim 2\%\) accuracy drop versus a GPU-edge baseline, releasing all artifacts for reproducibility.

Keywords

Edge AI, RISC-V, FPGA, CFU, Quantization, Spiking Neural Networks, Neuromorphic Computing, ANN-to-SNN, Compute-in-Memory, Real-time Detection, Event Cameras, Co-design.

Motivation

Over the past decade, cloud-based training and inference pipelines face growing issues of high latency, bandwidth bottlenecks, data privacy issues [1] and escalating training energy cost [2], etc. Edge GPUs partially alleviate these issues but, on the one hand, remain too general-purpose, lacking flexibility for custom numeric precisions, memory hierarchy and data movement [3]. On the other hand, different applications stress different aspects:

Autonomous systems, UAVs, Controlled nuclear fusion: Demand sub-millisecond latency for control stability and decision safety [4], [5], [6].
Medical and IoT devices: Prioritize data privacy and local analytics to meet regulatory and ethical standards [1].

Recent evidence [7] shows FPGA-based accelerators can outperform GPUs in both energy efficiency and deterministic latency when tuned for application-specific workloads. Therefore, FPGA-based hardware–software co-design emerges as a promising solution for low-power, low-latency edge computing tailored to specific application needs.

Problem Statement & Research Questions

Problem. Edge AI is constrained by end-to-end latency, energy per inference, and privacy. General-purpose accelerators lack support for application-specific numerics and event-driven workloads. SNNs promise ultra-low-power inference but lack portable toolchains (IR/quantization) and competitive accuracy at low time-steps.

How far can RISC-V + FPGA co-design push latency/energy for real-time perception while preserving accuracy?
Can SNNs with internal complexity (multi-timescale, adaptive thresholds) match ANN accuracy at few-time-step inference on FPGA?
What portable SNN IR + quantization API enables “write once, target multiple back-ends” without accuracy regressions?
Which dataflows (streaming vs memory-centric) minimize data motion for attention/convolution bottlenecks at the edge?

Project Objectives & Expected Contributions

An open source, reproducible full-stack edge ML system (models \(\to\) compiler \(\to\) RTL \(\to\) bitstream) that surpasses the ANN GPU-edge baseline on \(\ge 2\) of {latency, energy per inference, accuracy} for a selected application scenario.
A portable SNN IR + quantization API for spike dynamics, demonstrated on FPGA.
Evidence that internal complexity can reduce depth/width and time-steps while maintaining accuracy on edge tasks.
An open benchmark harness with NeuroBench-style reporting for fair cross-hardware comparison.

There are two variables to choose though:

Application scenario: Suits edge computing and extremely low latency (possibly video stream processing / controlled nuclear fusion).
Target network structure: Possibly accelerating RT-DETR [8] or YOLO, or directly optimize existing Spiking Neural Networks (SNNs).

Methodology

To achieve the project objectives, the proposed research will follow a three-stage methodology: The warm-up phase establishes an ANN baseline and a RISC-V-coupled FPGA accelerator with deterministic low-latency dataflows. The mid-term phase investigates brain-inspired computing, especially SNNs, focusing on improving accuracy, enhancing LIF model (increasing “internal complexity” [9], [10]), sparse computation, quantization / pruning models and establishing SNN standard APIs. The long-term phase explores emergent intelligence by investigating hypercomputation beyond Turing limits. The details are as shown in the following sections.

Warm-up: Von Neumann path (RISC-V + FPGA, ANN first)

This phase has been planned as my graduation project, with the aim to pushing the limits of von Neumann architecture (RISC-V) in edge computing by accelerating and optimizing existing ANNs. Take target network RT-DETR [8] as an example, the steps are as follows:

Quantization & pruning for edge: We adopt sub-8-bit quantization (INT8 \(\to\) INT4/INT2 if accuracy allows) and structured sparsity to minimize off-chip transfers. We will report accuracy-latency-energy trade-offs under identical datasets and input resolutions.
RISC-V-based accelerator with CFU Playground: As a undergraduate warm-up project, we will imitate the design flow in [11] as shown in Figure fig-warm-up. We will implement a VexRiscv + Custom function unit (CFU) design within the open-source CFU Playground framework [12] shown in Figure fig-cfu-playground. The difficulties are to design customized extended instructions in RISC-V for computational-intensive operators (e.g., depthwise/pointwise convolution, \(QK^T\), low-bit GEMM), write the corresponding RTL for the hardware.

Figure 1: Warm-up phase design flow (adapted from [11])

Figure 2: CFU-playground: ML Accelerators in RISC-V ISA [11], [12]
Full-stack design and DSE: As a learning experience, we will reinvent the wheel by designing the entire accelerator in Chisel from scratch and open-sourcing it on Github. Then, we also explore a large multi-dimensional design space exploration (DSE) using automated methods (such as heuristic or evolutionary algorithms) to identify optimal configurations balancing accuracy, energy, and latency. Finally, we will use Arty A7-100T FPGA as the hardware platform for real measurements.

Mid-term: Brain-Inspired path (SNN on FPGA, portable toolchain)

This phase is planned to be my potential PhD research topic. We will explore Brain-Inspired Computing (BIC) in computer vision tasks with a particular focus on EdgeSNNs (parameters \(< 100\text{ M}\) [1]) and In-Memory Computing (CIM).

Figure 3: Leaky integrate-and-fire (LIF) neuron dynamics in SNN [10]

SNNs have shown promise for ultra-low-power, event-driven inference at the edge. SNNs model neurons with explicit membrane dynamics (LIF model as shown in Figure fig-lif-model). Unlike conventional ANNs, SNNs process information in the temporal domain using binary spikes (event-driven coding), this is particularly suitable for SpikeCV cameras [13], which in my view are the next-generation vision sensors for edge applications such as autonomous driving.

Previous work on SNNs in autonomous driving includes Spiking-YOLO [14], EMS-YOLO [15], etc. However, the performance is still not good in general compared to ANN [16]. ECS-LIF in [16] suggests high-performance spiking detectors are possible with ANN-level matched accuracy and ultra-low energy costs. However, one still need to carefully select hardware architectures. For video classification, research is even more nascent [13]. Early attempts include spiking recurrent networks (processing up to 300 time steps of video frames), or hybrid ANN-SNN approaches for action recognition [13], SpikeVideoTransformer [17], SpikeYOLO [18]. However, there is no SNN equivalent yet for many popular video models (e.g. no published spiking variant of SlowFast or DETR detection-transformer as of 2025).

Since the research steps are not as clear as the warm-up phase, I list some potential research directions in parallel below:

Explore SNN design / training method: There are two routes to get optimized SNNs: ANN2SNN conversion [1], [8], or direct-training SNNs. We will explore both routes to find the best-performing SNNs for our target application. We will also investigate advanced training techniques such as surrogate gradients, temporal backpropagation, and biologically inspired learning rules to improve SNN performance.
Develop unified SNN toolchain: SNNs lack standard APIs and SNN-specific IRs (analogous to ONNX [1], [19]) for quantization strategies tailored to spike dynamics [19]. We prototype a minimal graph-level SNN IR (ops, neuron nodes, timing semantics) plus a quantization API (e.g., \(\text{INT8} \times \text{INT2}\) spike ops, ternary spikes, per-layer time-step budgets) to decouple front-end training from back-end compilers, following the direction of NIR and recent co-design work that targets FPGA-friendly spike arithmetic. This addresses today’s fragmentation across neuromorphic stacks and enables “write once, target FPGA/Loihi/MCU.” [20]
Further explore SNN internal complexity: A recent study [9], [10] by Network model with internal complexity bridges artificial intelligence and neuroscience shows a pivotal shift in thinking: Instead of simply growing neural networks by adding more layers or parameters (“external complexity”), we can embed richer dynamics inside each neuron or module — a paradigm the authors term “small model with internal complexity.”

Long-term: Emergent Intelligence

This phase is my long-term aspiration toward artificial general intelligence (AGI). This stage will explore the theoretical and practical computational boundary and brand-new distributed computing paradigms inspired by the human brain under the guidance of recent theoretical investigations into emergence in artificial systems, such as Berti et al. (2025) [21] who survey emergent abilities in LLMs and identify conditions like scaling, criticality and compression that contribute to spontaneous capability gains. Continuing the exploration of internal complexity in #sec-mid-term, there are two major directions: Turing-complete machine and hypercomputation beyond Turing limits.

Figure 4: Conway’s Game of Life (CGOL): local update rules [22].

Figure 5: Gosper’s glider gun: a self-replicating pattern proving Turing-completeness [23], [24].

Decentralized, event-driven architectures: We will first explore turing-complete hardware and software systems that mimic the highly-distributed, asynchronous nature and learning-while-inferencing feature of the brain. Turing-equivalent cellular automata such as CGOL [25] (Figure fig-cgol), Langton’s ant, Particle Life already demonstrate how simple local rules can give rise to complex, emergent patterns (Figure fig-gosper-glider-gun, Figure fig-pl1, Figure fig-pl2). While using the CGOL itself as a practical “computer” is inefficient, it serves as a proof-of-concept that emergence can arise from simple components. The challenge is discovering the right set of rules (i.e., internal complexity) or learning algorithms that yield robust emergent intelligence, not just (external) complexity for its own sake [26]. Then turn out to nature (the hardware) to find out if there is a machine under our control that performs this set of rules intrinsically.

Figure 6: Emergent chasing behavior (Particle Life).

Figure 7: Stable pattern formation (Particle Life).

Hypercomputation beyond Turing limits: Human brain might be exploiting computational principles beyond the scope of traditional Turing machines. Penrose and others (like Stuart Hameroff) have hypothesized that quantum effects in neural microstructures (e.g. microtubules) could enable the brain to do things standard computers cannot [27]. Achieving AI with brain-like cognition might then require tapping into quantum computing [28], ONNs [29], Organoid Intelligence (OI) [30], [31] and beyond.

References

[1]

S. Deng et al., “Edge intelligence with spiking neural networks.” 2025. Available: https://arxiv.org/abs/2507.14069

[2]

“CHAPTER 1: Research and development.” Artificial Intelligence Index Report 2025, 2025. Available: https://hai.stanford.edu/assets/files/hai_ai-index-report-2025_chapter1_final.pdf. [Accessed: Nov. 03, 2025]

[3]

S. G. at ETH Zürich, “Introduction to FPGAs and ML inference with hls4ml (benjamin ramhorst, 8 november 2024).” YouTube, Nov. 2024. Available: https://www.youtube.com/watch?v=2y3GNY4tf7A. [Accessed: Nov. 04, 2025]

[4]

S. M. Sali, M. Meribout, and A. A. Majeed, “Real time FPGA based CNNs for detection, classification, and tracking in autonomous systems: State of the art designs and optimizations.” 2025. Available: https://arxiv.org/abs/2509.04153

[5]

J. Duarte et al., “Fast inference of deep neural networks in FPGAs for particle physics,” JINST, vol. 13, no. 7, p. P07027, 2018, doi: 10.1088/1748-0221/13/07/P07027. Available: https://arxiv.org/abs/1804.06913

[6]

U. D. of Energy, “AI tackles disruptive tearing instability in fusion plasma.” Energy.gov, 2025. Available: https://www.energy.gov/science/fes/articles/ai-tackles-disruptive-tearing-instability-fusion-plasma

[7]

F. Yan, A. Koch, and O. Sinnen, “A survey on FPGA-based accelerator for ML models.” 2024. Available: https://arxiv.org/abs/2412.15666

[8]

Y. Zhao et al., “DETRs beat YOLOs on real-time object detection.” 2023. Available: https://arxiv.org/abs/2304.08069

[9]

L. He et al., “Network model with internal complexity bridges artificial intelligence and neuroscience,” Nature Computational Science, vol. 4, no. 8, pp. 584–599, 2024, doi: 10.1038/s43588-024-00674-9. Available: https://doi.org/10.1038/s43588-024-00674-9

[10]

G. Li et al., “Brain-inspired computing: A systematic survey and future trends,” Proceedings of the IEEE, vol. 112, no. 6, pp. 544–584, 2024, doi: 10.1109/JPROC.2024.3429360

[11]

M. Sabih, A. Karim, J. Wittmann, F. Hannig, and J. Teich, “Hardware/software co-design of RISC-v extensions for accelerating sparse DNNs on FPGAs.” 2025. Available: https://arxiv.org/abs/2504.19659

[12]

S. Prakash et al., “CFU playground: Full-stack open-source framework for tiny machine learning (TinyML) acceleration on FPGAs,” in 2023 IEEE international symposium on performance analysis of systems and software (ISPASS), IEEE, Apr. 2023, pp. 157–167. doi: 10.1109/ispass57527.2023.00024. Available: http://dx.doi.org/10.1109/ISPASS57527.2023.00024

[13]

Y. Ashraf, A. Sharshar, V. Bojkovic, and B. Gu, “SPACT18: Spiking human action recognition benchmark dataset with complementary RGB and thermal modalities.” 2025. Available: https://arxiv.org/abs/2507.16151

[14]

S. Kim, S. Park, B. Na, and S. Yoon, “Spiking-YOLO: Spiking neural network for energy-efficient object detection.” 2019. Available: https://arxiv.org/abs/1903.06530

[15]

Q. Su et al., “Deep directly-trained spiking neural networks for object detection.” 2023. Available: https://arxiv.org/abs/2307.11411

[16]

M. Jin, X. Wang, C. Guo, and S. Yang, “Research on target detection for autonomous driving based on ECS-spiking neural networks,” Scientific Reports, vol. 15, no. 1, p. 13725, 2025, doi: 10.1038/s41598-025-97913-4. Available: https://doi.org/10.1038/s41598-025-97913-4

[17]

S. Zou et al., “SpikeVideoFormer: An efficient spike-driven video transformer with hamming attention and \(\mathcal{O}(T)\) complexity.” 2025. Available: https://arxiv.org/abs/2505.10352

[18]

X. Luo, M. Yao, Y. Chou, B. Xu, and G. Li, “Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection.” 2025. Available: https://arxiv.org/abs/2407.20708

[19]

A. Carpegna, A. Savino, and S. D. Carlo, “Spiker+: A framework for the generation of efficient spiking neural networks FPGA accelerators for inference at the edge,” IEEE Transactions on Emerging Topics in Computing, vol. 13, no. 3, pp. 784–798, Jul. 2025, doi: 10.1109/tetc.2024.3511676. Available: http://dx.doi.org/10.1109/TETC.2024.3511676

[20]

J. E. Pedersen et al., “Neuromorphic intermediate representation: A unified instruction set for interoperable brain-inspired computing,” Nature Communications, vol. 15, no. 1, p. 8122, 2024, doi: 10.1038/s41467-024-52259-9. Available: https://doi.org/10.1038/s41467-024-52259-9

[21]

L. Berti, F. Giorgi, and G. Kasneci, “Emergent abilities in large language models: A survey.” 2025. Available: https://arxiv.org/abs/2503.05788

[22]

T. Hirose and T. Sawaragi, “Extended FRAM model based on cellular automaton to clarify complexity of socio-technical systems and improve their safety,” Safety Science, vol. 123, p. 104556, Mar. 2020, doi: 10.1016/j.ssci.2019.104556

[23]

I. Lokhov, “Game of life | datawrapper blog.” Datawrapper, Jun. 2021. Available: https://www.datawrapper.de/blog/game-of-life

[24]

W. Contributors, “Gun (cellular automaton).” Wikipedia; Wikimedia Foundation, Aug. 2023. Available: https://en.wikipedia.org/wiki/Gun_(cellular_automaton)

[25]

J. McCrum and T. P. Kee, “Conways game of life as an analogue to a habitable world livingness beyond the biological.” 2024. Available: https://arxiv.org/abs/2410.22389

[26]

D. C. Krakauer, J. W. Krakauer, and M. Mitchell, “Large language models and emergence: A complex systems perspective.” 2025. Available: https://arxiv.org/abs/2506.11135

[27]

D. Orf, “Quantum entanglement in your brain is what generates consciousness, radical study suggests.” Popular Mechanics, Jul. 2025. Available: https://www.popularmechanics.com/science/a65368553/quantum-entanglement-in-brain-consciousness/

[28]

G. A. Silva, “Leveraging quantum superposition to infer the dynamic behavior of a spatial-temporal neural network signaling model.” 2025. Available: https://arxiv.org/abs/2403.18963

[29]

T. Fu et al., “Optical neural networks: Progress and challenges,” Light: Science & Applications, vol. 13, no. 1, p. 263, 2024, doi: 10.1038/s41377-024-01590-3. Available: https://doi.org/10.1038/s41377-024-01590-3

[30]

L. Smirnova et al., “Organoid intelligence (OI): The new frontier in biocomputing and intelligence-in-a-dish,” Frontiers in Science, vol. 1, p. 1017235, Feb. 2023, doi: 10.3389/fsci.2023.1017235

[31]

T. Oh, “The future of artificial intelligence — wetware.” Medium, Jul. 2023. Available: https://medium.com/@timothyohoc/the-future-of-artificial-intelligence-wetware-da2e396b3c55. [Accessed: Nov. 04, 2025]