logo

SCIENTIA SINICA Informationis, Volume 49, Issue 3: 277-294(2019) https://doi.org/10.1360/N112018-00291

Accelerating convolutional neural networks on FPGAs

More info
  • ReceivedOct 30, 2018
  • AcceptedFeb 26, 2019
  • PublishedMar 19, 2019

Abstract

In recent years, convolutional neural networks (CNNs) have become widely adopted for computer vision tasks. FPGAs have been adequately explored as a promising hardware accelerator for CNNs owing to their high performance, energy efficiency, and reconfigurability. However, previous FPGA methods, which are based on the conventional convolutional algorithm, are often bounded by the computational capability of FPGAs. This paper first introduces four convolution algorithms: 6-loop algorithm, general matrix-matrix multiplication (GEMM), Winograd algorithm, and fast Fourier transform (FFT) algorithm. Then, we present the implementations of these algorithms at home and abroad, and also introduce their corresponding optimization techniques.


Funded by

国家自然科学基金(61672048)

北京自然科学基金(L172004)


References

[1] He K M, Zhang X Y, Ren S Q, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. Google Scholar

[2] Ross G, Jeff D, Trevor D, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. Google Scholar

[3] Joseph R, Santosh D, Ross G, et al. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Google Scholar

[4] Karen S, Andrew Z. Very deep convolutional networks for large-scale image recognition. 2015,. arXiv Google Scholar

[5] Anderson A, Vasudevan A, Keane C, et al. Low-memory GEMM-based convolution algorithms for deep neural networks. 2017,. arXiv Google Scholar

[6] Winograd S. Arithmetic complexity of computations. volume 33. Siam, 1980. Google Scholar

[7] Qiu J T, Wang J, Yao S, et al. Going deeper with embedded FPGA platform for convolutional neural network. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2016. 1--10. Google Scholar

[8] Zhang C, Li P, Sun G Y, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2015. 1--10. Google Scholar

[9] Zhang C, Fang Z M, Zhou P P, et al. Caffeine: towards uniformed representation and acceleration for deep convolutional neural networks. In: Proceedings of International Conference On Computer Aided Design (ICCAD), Austin, 2016. 1--9. Google Scholar

[10] Li H M, Fan X T, Jiao L, et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In: Proceedings of International Conference on Field Programmable Logic and Applications (FPL), 2016. 1--9. Google Scholar

[11] Wei X C, Yu C H, Zhang P, et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In: Proceedings of Annual Design Automation Conference (DAC), Austin, 2017. 1--6. Google Scholar

[12] Lin X H, Yin S Y, Tu F B, et al. LCP: a layer clusters paralleling mapping method for accelerating inception and residual networks on FPGA. In: Proceedings of Annual Design Automation Conference (DAC), San Francisco, 2018. 1--6. Google Scholar

[13] Zhao W L, Fu H H, Luk W, et al. F-CNN: an FPGA-based framework for training convolutional neural networks. In: Proceedings of Annual IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), London, 2016. 1--8. Google Scholar

[14] Venieris S I, Bouganis C. fpagConvNet: a framework for mapping convolutional neural networks on FPGAs. In: Proceedings of IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016. 1--8. Google Scholar

[15] Zhao R Z, Luk W. Optimizing CNN-based object detection algorithms on embedded FPGA platforms. In: Proceedings of Americas Regional Council meeting (ARC), 2017. 1--13. Google Scholar

[16] Alwani M, Chen H, Ferdman M, et al. Fused-layer CNN accelerators. In: Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016. 1--12. Google Scholar

[17] Shen Y M, Ferdman M, Milder P. Escher: a CNN accelerator with flexible buffering to minimize off-chip transfer. In: Proceedings of IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017. 1--8. Google Scholar

[18] Shen Y M, Ferdman M, Milder P. Maximizing CNN accelerator efficiency through resource partitioning. In: Proceedings of International Symposium on Computer Architecture (ISCA), Toronto, 2017. 1--13. Google Scholar

[19] Ma Y F, Suda N, Cao Y, et al. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In: Proceedings of International Conference on Field Programmable Logic and Applications (FPL), 2016. 1--8. Google Scholar

[20] Ma Y F, Cao Y, Vrudhula S, et al. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, 2017. 1--10. Google Scholar

[21] Ma Y F, Cao Y, Vrudhula S, et al. Optimizing the convolution operation to accelerate deep neural networks on FPGA. In: Proceedings of Symposia on VLSI Technology and Circuits (VLSI), Hawaii, 2018. 1--14. Google Scholar

[22] Sharma H, Park J, Mahajan D, et al. From high-level deep neural models to FPGAs. In: Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016. 1--12. Google Scholar

[23] Zhang J L, Li J. Improving the performance of OpenCL-based FPGA accelerator for convolutional neural network. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, 2017. 1--10. Google Scholar

[24] Zhang X F, Wang J S, Zhu C, et al. AccDNN: an IP-based DNN generator for FPGAs. In: Proceedings of IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018. 1--8. Google Scholar

[25] Liu X H, Kim D H, Wu C, et al. Resource and data optimization for hardware implementation of deep neural networks targeting FPGA-based edge devices. In: Proceedings of Annual Design Automation Conference (DAC), San Francisco, 2018. 1--6. Google Scholar

[26] Guan Y J, Liang H, Xu N Y, et al. FP-DNN: an automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In: Proceedings of IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017. 1--8. Google Scholar

[27] Cong J, Xiao B J. Minimizing computation in convolutional neural networks. In: Proceedings of International Conference on Artificial Neural Networks, Hamburg, 2014. 1--10. Google Scholar

[28] Shen J Z, Huang Y, Wang Z L, et al. Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, 2018. 1--10. Google Scholar

[29] Lu L Q, Liang Y, Xiao Q C, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs. In: Proceedings of IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017. 1--8. Google Scholar

[30] Xiao Q C, Liang Y, Lu L Q, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In: Proceedings of the 54th Annual Design Automation Conference (DAC), 2017. Google Scholar

[31] Aydonat U, O'Connell S, Capalija D, et al. An OpenCL deep learning accelerator on Arria 10. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, 2017. 1--10. Google Scholar

[32] Podili A, Zhang C, Prasanna V, et al. Fast and efficient implementation of convolutional neural networks on FPGA. In: Proceedings of Annual IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2017. 1--8. Google Scholar

[33] Lu L Q, Liang Y, Xiao Q C, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs. In: Proceedings of IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2018. Google Scholar

[34] Zhang C, Prasanna V. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, 2017. 1--10. Google Scholar

[35] Ko J H, Mudassar B, Na T, et al. Design of an energy-efficient accelerator for training of convolutional neural networks using frequency-domain computation. In: Proceedings of Annual Design Automation Conference (DAC), Austin, 2017. 1--6. Google Scholar

[36] Wei X C, Liang Y, Li X H, et al. TGPA: tile-grained pipeline architecture for low latency CNN inference. In: Proceedings of 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2018. Google Scholar

[37] Lu L Q, Liang Y. SpWA: an efficient sparse winograd convolutional neural networks accelerator on FPGAs. In: Proceedings of the 55th Annual Design Automation Conference (DAC), 2018. Google Scholar

[38] Zhao R, Song W N, Zhang W T, et al. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, 2017. 1--10. Google Scholar

[39] DiCecco R, Lacey G, Vasiljevic J, et al. Caffeinated FPGAs: FPGA framework for convolutional neural networks. In: Proceedings of International Conference on Field-Programmable Technology (FPT), 2016. 1--8. Google Scholar

[40] Suda N, Chandra V, Mohanty A, et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2016. 1--10. Google Scholar

[41] Motamedi M, Gysel P, Akella V, et al. Design space exploration of FPGA-based deep convolutional neural networks. In: Proceedings of Asia and South Pacific Design Automation Conference (ASP-DAC), 2016. 1--6. Google Scholar

[42] Venieris S I, Bouganis C S. Design space exploration of FPGA-based deep convolutional neural networks. In: Proceedings of IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016. 1--8. Google Scholar

[43] Sharma H, Park J, Amaro E, et al. DnnWeaver: from high-level deep network models to FPGA acceleration. In: Proceedings of the Workshop on Cognitive Architectures, 2016. Google Scholar

[44] Jia Y Q, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding. 2014,. arXiv Google Scholar

[45] Wang Y, Xu J, Han Y H, et al. DeepBurning: automatic generation of FPGAbased learning accelerators for the neural network family. In: Proceedings of the Design Automation Conference (DAC), 2016. Google Scholar

[46] Zeng H Q, Chen R, Zhang C, et al. A framework for generating high throughput CNN implementations on FPGAs. In: Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 2018. Google Scholar

[47] Puschel M, Moura J M F, Johnson J R, et al. SPIRAL: code generation for DSP transforms. In: Proceedings of the IEEE, Special Issue on Program Generation, Optimization, and Adaptation, 2005. 232--275. Google Scholar

[48] Han S, Mao H Z, Dally W J. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. 2015,. arXiv Google Scholar

[49] Zeng H, Chen R, Zhang C, et al. A framework for generating high throughput CNN implementations on FPGAs. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, 2018. 1--10. Google Scholar

[50] Lavin A, Gray S. Fast algorithms for convolutional neural networks. 2015,. arXiv Google Scholar

  • Figure 1

    (Color online) CNN applications. (a) Face recognition; (b) lane detection; (c) stereo matching; (d) speech recognition

  • Figure 2

    (Color online) FPGA structure and a general architecture for accelerating CNNs on FPGA. (a) FPGA structure; (b) architecture for accelerating CNNs on FPGA

  • Figure 3

    (Color online) Process flow of HLS

  • Figure 4

    (Color online) Converting convolution operation to GEMM operation

  • Table 1   The conclusion of four convolution algorithms and related work
    Algorithm Arithmetic Transformation Inner Domestic related Abroad related
    reduction overhead computation work work
    Spatial None None MAC [7-13] [14-25]
    GEMM None Low Vector product [26] [27]
    Winograd High High EWMM [28-30] [31,32]
    FFT Medium Medium EWMM (complex) [33] [34,35]
  • Table 2   The implementation of non-conventional convolution algorithms on FPGA
    Algorithm Winograd FFT GEMM
    [28] [29] [38] [31] [34] [39] [8] [26]
    FPGA VCU440 ZCU102 Virtex7 Arria10 Stratix5 Stratix5 Kintex Stratix-V
    VX690T GX1150 QPI GXA7 KU060 GSMD5
    DSP 2880 2520 3683 1576 224 256 1058 1590
    Logic (K) 5541 600 505 246 201 228 150 172
    Frequency (MHz) 200 200 200 303 200 194 200 150
    Precision 16 bit 16 bit FP32 FP16 16 bit 16 bit 16 bit FP16
    Network VGG VGG Alexnet Alexnet VGG Alexnet VGG VGG
    Performance (GOP/s) 821 3045 46 1382 123 66 360 364.36
    Power 23.6 44.3 13.2 33.9 25.0 25.0
  • Table 31  
  • Table 4   Optimization techniques of accelerating CNNs on FPGA
    Design space Data transfer On-chip resource Hardware unit
    exploration optimization optimization generatio
    Related work [7,11,20,21,29,40,41] [11,12,16,23] [18,25,30] [15,22-24,42]
  • Table 5   Tool-chains for deploying CNNs on FPGA
    [14] [44] [42] [8] [26] [45] [29]
    FPGA Zynq ZC702 Zynq ZC702 Zynq Z7045 Kintex KU060 Stratix-V Intel Zynq ZC706
    Zynq ZC706 Stratix-V SGSD5 Virtex7 VX960T GSMD5 HARP Zynq ZCU102
    Language HLS HDL ISA HLS OpenCL SPIRAL[46] HLS
    Software Caffe & Torch Caffe Caffe Caffe Tensorflow Specific C$++$
    interface input program

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1