• [CNN Accelerator on FPGA, FGPA 2016] Going Deeper with Embedded FPGA Platform for Convolutional Neural Network.
• [LSTM Accelerator on FPGA, FGPA 2017 Best Paper] ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA.
• [FPGA in the Cloud, CF 2014] Enabling FPGAs in the Cloud.
• [MapReduce on FPGA, FPGA 2010] FPMR: MapReduce Framework on FPGA A Case Study of RankBoost Acceleration.
• [PIM with ReRAM, ISCA 2016] PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory.

1. Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu, J. Tang, T., Xu, N., Song, S., Wang, Yu (2016): Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. Accepted in: The 24th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’16).

The CNN-based approach has achieved great success in many applications, however, the computational-intensive and resource-consuming nature of CNN makes it difficult to run CNN on embedded systems. In this paper, we make an in-depth investigation of the memory footprint and bandwidth problem in order to accelerate state-of-the-art CNN models for Imagenet classification on the embedded FPGA platform and show that CONV layers are computational-centric and Fully-Connected layers are memory-centric. In addition, a dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. A data arrangement method for FC layers is proposed to further ensure a high utilization of the external memory bandwidth. Empirical experiments have shown that the proposed method is very efficient, so CNN can run at high speed on the embedded platform without significantly reducing the accuracy. The analysis in this work makes key inspiration for the subsequent FPGA-based CNN accelerators.

2. Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y, Xie, D., Luo, H., Yao, S., Wang, Yu, Yang, H., Dally, W. J. (2017): ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA. Accepted in: The 25th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’17).

Increasingly large LSTM models are used to improve the accuracy of speech recognition tasks, which is both computation and memory intensive and leads to high power consumption. In this paper, we propose a method to compress the LSTM model by $20\times$ with high hardware utilization while without sacrificing the prediction accuracy. Then we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedules the complicated LSTM data flow. And a hardware architecture, named ESE, is designed to deals with the irregularity caused by compression, so that it can work directly on the sparse LSTM model. Experiments show that LSTM implemented on Xilinx FPGAs with ESE is superior to LSTM on top-level CPUs and GPUs, and has higher energy efficiency. This work allows us to not only accelerate CNN on the FPGA, but also accelerate LSTM, giving our FPGA-based accelerators a wider range of applications and higher scalability potential.

3. Chen, F., Shan, Y., Zhang, Y., Wang, Yu, Franke, H., Chang, X., Wang, K. (2014): Enabling FPGAs in the Cloud. Accepted in: Proceedings of the 11th ACM Conference on Computing Frontiers (CF’14).

Cloud computing is becoming a major trend for delivering and accessing infrastructure on demand via the network. Many types of workloads in the cloud can be accelerated by FPGAs, as FPGAs have the ability to achieve high throughput and predictable latency while providing programmability, low power consumption and time-to-value. However, integrating FPGAs into the cloud is nontrivial due to some FPGA-related issues. In this paper, we analyze the impediments to bringing FPGAs as a shareable resource to the cloud. To overcome these impediments, we provide a framework and a prototype to provide an FPGA cloud solution within the scope of FPGA technology at the time. The prototype enables isolation between multiple processes in multiple VMs, precise quantitative acceleration resource allocation, and priority-based workload scheduling. Given the prototype, we also demonstrate how abstraction, sharing, compatibility and security can be achieved while using FPGAs in the cloud. This work provides a viable solution to use FPGAs for computational acceleration in the cloud.

4. Shan, Y., Wang, B., Yan, J., Wang, Yu, Xu, N., Yang, H. (2010): FPMR: MapReduce Framework on FPGA A Case Study of RankBoost Acceleration. Accepted in: The 18th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’10).

FPGA provides a highly parallel, low power, and flexible hardware platform for machine learning and data mining, while the difficulty of programming FPGA greatly limits its prevalence. MapReduce is a parallel programming framework that could easily utilize inherent parallelism in algorithms. In this paper, we propose a MapReduce framework on FPGA, called FPMR, which provides programming abstraction, hardware architecture and basic building blocks to developers. High Parallelism can be easily achieved on FPMR, while the programming efforts are alleviated. Using this framework, designers only need to map the applications onto the mapper modules and the reducer modules. Task scheduling, communication, and data synchronization are done by the framework automatically. In addition, we discuss the tradeoffs among resources, performance, and memory bandwidth, and show that the bandwidth of memory will be the limiting factor during the application acceleration based on FPMR. This work can help developers to build and test machine learning accelerators on FPGAs faster, which is beneficial to the development of the community.

5. Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Yu, Xie, Y. (2016): PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. Accepted in: The 43rd ACM/IEEE Annual International Symposium on Computer Architecture (ISCA’16).

Processing-in-memory (PIM) is a promising solution to address the “memory wall” challenges for future computer systems. Instead of putting additional computation logic in or near memory, we propose a novel PIM architecture, called PRIME, to accelerate NN applications in ReRAM-based main memory, where ReRAM is an emerging metal-oxide resistive random access memory. In PRIME, a portion of ReRAM crossbar arrays can be configured as accelerators for NN applications or as normal memory for a larger memory space, which takes advantage of both the PIM architecture and the efficiency of ReRAM-based computation. Then we present our designs from circuit-level to system-level and take experiments to demonstrate the ability of the proposed architecture to save energy and accelerate various NN applications using MLP and CNN. This work uses the PIM architecture to reduce memory limitations while speeding up NN applications, providing a new type of accelerator structure.