A real-time and energy-efficient SRAM with mixed-signal in-memory computing near CMOS sensors
Journal of Real-Time Image Processing
(2024) 21:143
https://doi.org/10.1007/s11554-024-01520-x
RESEARCH
A real‑time and energy‑efficient SRAM with mixed‑signal in‑memory
computing near CMOS sensors
Jose‑Angel Diaz‑Madrid1 · Gines Domenech‑Asensi2 · Ramon Ruiz‑Merino2 · Juan‑Francisco Zapata‑Perez2
Received: 21 February 2024 / Accepted: 17 July 2024
© The Author(s) 2024
Abstract
In-memory computing (IMC) represents a promising approach to reducing latency and enhancing the energy efficiency of
operations required for calculating convolution products of images. This study proposes a fully differential current-mode
architecture for computing image convolutions across all four quadrants, intended for deep learning applications within
CMOS imagers utilizing IMC near the CMOS sensor. This architecture processes analog signals provided by a CMOS
sensor without the need for analog-to-digital conversion. Furthermore, it eliminates the necessity for data transfer between
memory and analog operators as convolutions are computed within modified SRAM memory. The paper suggests modifying
the structure of a CMOS SRAM cell by incorporating transistors capable of performing multiplications between binary (−1
or +1) weights and analog signals. Modified SRAM cells can be interconnected to sum the multiplication results obtained
from individual cells. This approach facilitates connecting current inputs to different SRAM cells, offering highly scalable
and parallelized calculations. For this study, a configurable module comprising nine modified SRAM cells with peripheral
circuitry has been designed to calculate the convolution product on each pixel of an image using a 3 × 3 mask with binary
values (−1 or 1). Subsequently, an IMC module has been designed to perform 16 convolution operations in parallel, with
input currents shared among the 16 modules. This configuration enables the computation of 16 convolutions simultaneously,
processing a column per cycle. A digital control circuit manages both the readout or memorization of digital weights, as
well as the multiply and add operations in real-time. The architecture underwent testing by performing convolutions between
binary masks of 3 × 3 values and images of 32 × 32 pixels to assess accuracy and scalability when two IMC modules are
vertically integrated. Convolution weights are stored locally as 1-bit digital values. The circuit was synthesized in 180 nm
CMOS technology, and simulation results indicate its capability to perform a complete convolution in 3.2 ms, achieving
an efficiency of 11,522 1-b TOPS/W (1-b tera-operations per second per watt) with a similarity to ideal processing of 96%.
Keywords Processing near sensor · In-memory computing · CMOS · Computer vision · Binarized-weight neural network ·
SRAM · Real-time processing
1 Introduction
Currently, numerous computer vision applications demand
real-time operation and low power consumption, relying
on algorithms tailored for low-resolution images. These
characteristics allow for partial implementation on a single
CMOS silicon chip, forming what are known as smart image
* Jose‑Angel Diaz‑Madrid
1
Departamento de Ingeniería y Técnicas Aplicadas, CUDUPCT, San Javier, Spain
2
Departamento de Electrónica, UPCT, Pl. del Hospital 1,
30202 Cartagena, Spain
sensors. However, in recent years, we have witnessed the
rapid expansion of vision algorithms based on deep neural
networks (DNNs) and their variant, convolutional neural
networks (CNNs). While these algorithms offer unprecedented accuracy, they come with a significant cost in terms
of hardware and energy consumption. Due to their inherent parallel nature, most high-accuracy DNN algorithms
operate on multi-core computer architectures like graphical
processing units (GPUs), providing a cost-effective solution
for high-performance computing. However, these platforms
often suffer from two major drawbacks. Firstly, they rely on
the traditional von Neumann architecture, which presents a
significant challenge: the physical separation between processing units and memory modules. This separation leads
Vol.:(0123456789)
143
Page 2 of 14
to inefficiencies in data transfer and power consumption as
data volume increases. Specifically, the architecture is constrained by the bottleneck created by data transfer between
processing and memory units, resulting in delays and
increased energy consumption. Additionally, the layer connecting the CMOS sensor and the DNNs typically requires
analog-to-digital conversion or proves inefficient in terms
of speed or energy efficiency compared to other modules.
Addressing these challenges is crucial for advancing the efficiency and performance of computer vision systems.
In this context, in-memory computing (IMC) is being
developed as a promising approach to improving the performance and efficiency of computers [1]. The basic idea
behind IMC is to execute computational tasks inside the
memory itself, eliminating the need to transfer data between
memory and the processor. This approach can significantly
enhance speed and energy efficiency since data can be
processed locally and swiftly, bypassing time-consuming
data transfers. In terms of computational nature, the majority of a typical DNN’s algorithmic load relies on multiply
and accumulate (MAC) operations, which IMC architectures can address in both digital and mixed analog/digital domains. In the latter, input vectors and/or operation
results are encoded as voltage, current, electrical charge,
or width-modulated pulses. Currently, there is no consensus on the optimal type of IMC architecture, and research
efforts have focused on achieving better trade-offs between
various performance metrics, such as bandwidth, latency,
and energy consumption, while maintaining result accuracy.
This has spurred considerable research to identify the most
efficient and effective IMC architectures, highlighting the
fundamental importance of benchmarking IMC architecture
to compare different proposals quantitatively and qualitatively [2]. Moreover, IMC architectures can be implemented
using various types of memory, including static randomaccess memory (SRAM), dynamic random-access memory (DRAM), resistive random-access memory (RRAM),
phase change memory (PCM), and magnetic random-access
memory (MRAM) [3]. SRAM is commonly used due to its
high-speed access to memory cells, low power consumption,
and high endurance, crucial for IMC operations where data
requires frequent access and modification within memory
cells. Additionally, SRAM is relatively easy to modify to
integrate operations in the mixed analog/digital domain.
Regarding SRAM cells, more compact designs typically
rely on 6-transistor (6T) bit cell designs similar to those
in conventional SRAMs, with the bit cell grid’s peripheral
circuitry completing computing operations. However, activating multiple rows may create short-circuit paths, leading
to stochastic flipping of cell states. To mitigate this issue, 8T
bit (...truncated)