Reposted from SIGDA Newsletter Seq 1, 2019.
Classical computer architecture used to adopt separate hardware components for computation (e.g., CPUs and GPGPUs) and those for data storage (e.g., DRAM-based main memory or disk-based secondary storage) . As a result, the performance of modern data-intensive applications is often limited by the speeds of the memory/storage devices, as well as the bandwidth of the datapath between CPUs/GPUs and memory/storage, thereby creating the well-known von Neumann bottleneck . To overcome this problem, the processing in memory (PIM) technology is proposed to incorporate the processing capability to random-access memory (RAM) in a single chip . By moving some computation from processor to memory , the overheads to transmit data between processor and memory can be remarkably alleviated. As reported by prior literature , the PIM technology becomes more feasible as the main memory capacity increases, and are especially valuable for the emerging data-centric computing scenarios, such as data mining and machine learning applications.
As early proposals of PIM introduce fully programmable computation units (e.g., general-purpose processors or field programmable gate arrays) to the memory, the design efforts could be high, and the changes to the hardware/software stack might be inevitable . Thus, the recent development of PIM technology is driven by the emergence of modern nonvolatile memories (NVMs), such as phase-change memory (PCM), metal-oxide resistive RAM (ReRAM), and spin-transfer torque RAM (STT-RAM), which can directly perform logical and arithmetic operations in memory . Among these choices, ReRAM can support efficient matrix–vector multiply operations , and is widely used for neural computation , graphic algorithms , or performing bulk bitwise operations . Nevertheless, the PIM technology is not preferred by all flavors of computations against CPUs/GPUs, and extra research efforts are needed to fully unleash its power.
Observing the heterogeneous computation capabilities of CPUs/GPGPUs and those of PIM, it is a brilliant approach to wisely determine whether to execute specific instructions, referred to as PIM-enabled instructions (PEIs), by PIM on the main memory, as suggested by Ahn et al. . With PEIs, programmers can assign the instructions that should be executed by PIM, therefore optimizing the system performance. Representative examples of PEIs include the increment of integers, getting the minimum element, addition of floating-point numbers, computation of Euclidean distance, and computation of the dot product of vectors .
The PIM technology is especially suitable for applications in the artificial intelligence area, as observed by , , . This is because that certain NVMs, such as ReRAM, can inherently support neural computation with its crossbar physical architecture. By allocating a full function (FF) subarray in the ReRAM space , a remarkable improvement of energy saving is observed at slight area overheads. Because the FF subarray is established dynamically in the ReRAM, the dynamic morphing of the ReRAM space for storing data and that for keeping the FF subarray becomes possible, allowing further performance enhancements.
In next-generation computing systems with explosive scales, the datapath between processing components and memory/storage components might become one of the major performance bottlenecks. In this case, PIM is promising to assist the establishing of active memory cubes (AMCs) , which provides not only data storage but also energy-efficient computation functionalities in a single component. A plural of AMCs can then be interconnected to construct a coherent and scalable main memory device for performance-demanding applications such as scientific computation . Furthermore, the importance of the PIM technology is expected to be emphasized in the foreseeable future, due to the rapidly growing scale of computing systems.
As new applications keep emerging on the horizon, existing computer architecture is pushed further to deliver better performance and energy efficiency for the whole system. Among the potential technologies, PIM provides amazingly high performance and energy efficiency, and is a promising candidate for the key technologies in data-centric computing scenarios. We believe that the PIM technology is still worth more research attentions to reveal its value to a wider spectrum of applications.
 Chi, Ping, et al. “Prime: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory.” ACM SIGARCH Computer Architecture News. Vol. 44. No. 3. IEEE Press, 2016.
 Margaret Rouse, “What is processing in memory (PIM)?,” online available at: https://searchbusinessanalytics.techtarget.com/definition/processing-in-memory-PIM
 Xu Yang, Yumin Hou, and Hu He, “A Processing-in-Memory Architecture Programming Paradigm for Wireless Internet-of-Things Applications,” MDPI Sensors, Vol. 19, No. 140, 2019.
 Song, Linghao, et al. “Pipelayer: A pipelined ReRAM-based accelerator for deep learning.” 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017.
 Long, Yun, Taesik Na, and Saibal Mukhopadhyay. “ReRAM-based processing-in-memory architecture for recurrent neural network acceleration.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26.12 (2018): 2781-2794.
 Li, Shuangchen, et al. “Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories.” Proceedings of the 53rd Annual Design Automation Conference. ACM, 2016.
 J. Ahn, S. Yoo, O. Mutlu and K. Choi, “PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,” 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, 2015, pp. 336–348.
 R. Nair et al., “Active Memory Cube: A processing-in-memory architecture for exascale systems,” IBM Journal of Research and Development, vol. 59, no. 2/3, pp. 17:1-17:14, March–May 2015.