A real-time and energy-efficient SRAM with mixed-signal in-memory computing near CMOS sensors (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s11554-024-01520-x.pdf

A real-time and energy-efficient SRAM with mixed-signal in-memory computing near CMOS sensors

Journal of Real-Time Image Processing (2024) 21:143 https://doi.org/10.1007/s11554-024-01520-x RESEARCH A real‑time and energy‑efficient SRAM with mixed‑signal in‑memory computing near CMOS sensors Jose‑Angel Diaz‑Madrid1 · Gines Domenech‑Asensi2 · Ramon Ruiz‑Merino2 · Juan‑Francisco Zapata‑Perez2 Received: 21 February 2024 / Accepted: 17 July 2024 © The Author(s) 2024 Abstract In-memory computing (IMC) represents a promising approach to reducing latency and enhancing the energy efficiency of operations required for calculating convolution products of images. This study proposes a fully differential current-mode architecture for computing image convolutions across all four quadrants, intended for deep learning applications within CMOS imagers utilizing IMC near the CMOS sensor. This architecture processes analog signals provided by a CMOS sensor without the need for analog-to-digital conversion. Furthermore, it eliminates the necessity for data transfer between memory and analog operators as convolutions are computed within modified SRAM memory. The paper suggests modifying the structure of a CMOS SRAM cell by incorporating transistors capable of performing multiplications between binary (−1 or +1) weights and analog signals. Modified SRAM cells can be interconnected to sum the multiplication results obtained from individual cells. This approach facilitates connecting current inputs to different SRAM cells, offering highly scalable and parallelized calculations. For this study, a configurable module comprising nine modified SRAM cells with peripheral circuitry has been designed to calculate the convolution product on each pixel of an image using a 3 × 3 mask with binary values (−1 or 1). Subsequently, an IMC module has been designed to perform 16 convolution operations in parallel, with input currents shared among the 16 modules. This configuration enables the computation of 16 convolutions simultaneously, processing a column per cycle. A digital control circuit manages both the readout or memorization of digital weights, as well as the multiply and add operations in real-time. The architecture underwent testing by performing convolutions between binary masks of 3 × 3 values and images of 32 × 32 pixels to assess accuracy and scalability when two IMC modules are vertically integrated. Convolution weights are stored locally as 1-bit digital values. The circuit was synthesized in 180 nm CMOS technology, and simulation results indicate its capability to perform a complete convolution in 3.2 ms, achieving an efficiency of 11,522 1-b TOPS/W (1-b tera-operations per second per watt) with a similarity to ideal processing of 96%. Keywords Processing near sensor · In-memory computing · CMOS · Computer vision · Binarized-weight neural network · SRAM · Real-time processing 1 Introduction Currently, numerous computer vision applications demand real-time operation and low power consumption, relying on algorithms tailored for low-resolution images. These characteristics allow for partial implementation on a single CMOS silicon chip, forming what are known as smart image * Jose‑Angel Diaz‑Madrid 1 Departamento de Ingeniería y Técnicas Aplicadas, CUDUPCT, San Javier, Spain 2 Departamento de Electrónica, UPCT, Pl. del Hospital 1, 30202 Cartagena, Spain sensors. However, in recent years, we have witnessed the rapid expansion of vision algorithms based on deep neural networks (DNNs) and their variant, convolutional neural networks (CNNs). While these algorithms offer unprecedented accuracy, they come with a significant cost in terms of hardware and energy consumption. Due to their inherent parallel nature, most high-accuracy DNN algorithms operate on multi-core computer architectures like graphical processing units (GPUs), providing a cost-effective solution for high-performance computing. However, these platforms often suffer from two major drawbacks. Firstly, they rely on the traditional von Neumann architecture, which presents a significant challenge: the physical separation between processing units and memory modules. This separation leads Vol.:(0123456789) 143 Page 2 of 14 to inefficiencies in data transfer and power consumption as data volume increases. Specifically, the architecture is constrained by the bottleneck created by data transfer between processing and memory units, resulting in delays and increased energy consumption. Additionally, the layer connecting the CMOS sensor and the DNNs typically requires analog-to-digital conversion or proves inefficient in terms of speed or energy efficiency compared to other modules. Addressing these challenges is crucial for advancing the efficiency and performance of computer vision systems. In this context, in-memory computing (IMC) is being developed as a promising approach to improving the performance and efficiency of computers [1]. The basic idea behind IMC is to execute computational tasks inside the memory itself, eliminating the need to transfer data between memory and the processor. This approach can significantly enhance speed and energy efficiency since data can be processed locally and swiftly, bypassing time-consuming data transfers. In terms of computational nature, the majority of a typical DNN’s algorithmic load relies on multiply and accumulate (MAC) operations, which IMC architectures can address in both digital and mixed analog/digital domains. In the latter, input vectors and/or operation results are encoded as voltage, current, electrical charge, or width-modulated pulses. Currently, there is no consensus on the optimal type of IMC architecture, and research efforts have focused on achieving better trade-offs between various performance metrics, such as bandwidth, latency, and energy consumption, while maintaining result accuracy. This has spurred considerable research to identify the most efficient and effective IMC architectures, highlighting the fundamental importance of benchmarking IMC architecture to compare different proposals quantitatively and qualitatively [2]. Moreover, IMC architectures can be implemented using various types of memory, including static randomaccess memory (SRAM), dynamic random-access memory (DRAM), resistive random-access memory (RRAM), phase change memory (PCM), and magnetic random-access memory (MRAM) [3]. SRAM is commonly used due to its high-speed access to memory cells, low power consumption, and high endurance, crucial for IMC operations where data requires frequent access and modification within memory cells. Additionally, SRAM is relatively easy to modify to integrate operations in the mixed analog/digital domain. Regarding SRAM cells, more compact designs typically rely on 6-transistor (6T) bit cell designs similar to those in conventional SRAMs, with the bit cell grid’s peripheral circuitry completing computing operations. However, activating multiple rows may create short-circuit paths, leading to stochastic flipping of cell states. To mitigate this issue, 8T bit (...truncated)