logo

SCIENTIA SINICA Informationis, Volume 49, Issue 3: 256-276(2019) https://doi.org/10.1360/N112018-00288

Deep learning hardware acceleration based on general vector DSP

More info
  • ReceivedDec 12, 2018
  • AcceptedFeb 26, 2019
  • PublishedMar 19, 2019

Abstract

As deep learning (DL) plays an increasingly significant role in several fields, designing a high performance, low power, low-latency hardware accelerator for DL has become a topic of interest in the field of architecture. Based on the structure and optimization method of DL algorithms, this study aims to analyze the difficulties and challenges in DL hardware design. In comparison with the current mainstream DL hardware acceleration platform, advantages of the DL hardware acceleration based on general vector DSP are discussed. Besides, acceleration techniques, such as vector broadcasting and matrix conversion, are described. From the viewpoint of the shortcomings of the general vector DSP discussed herein, optimization techniques such as reconfigurable computing arrays that take into account the general vector calculations as well as specific DL acceleration are discussed in depth.


Funded by

国家自然科学基金(61832018)

国家自然科学基金(61572025)


References

[1] Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets.. Neural Computation, 2006, 18: 1527-1554 CrossRef PubMed Google Scholar

[2] Rumelhart D E. Learning internal representations by error propagation, parallel distributed processing. In: Explorations in the Microstructure of Cognition. Cambridge: MIT Press, 1986. Google Scholar

[3] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of International Conference on Neural Information Processing Systems, 2012. 1097--1105. Google Scholar

[4] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014,. arXiv Google Scholar

[5] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. Google Scholar

[6] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Google Scholar

[7] Park E, Kim D, Yoo S. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In: Proceedings of International Symposium on Computer Architecture, 2018. 688--698. Google Scholar

[8] Parashar A, Rhu M, Mukkara A, et al. SCNN: an accelerator for compressed-sparse convolutional neural networks. In: Proceedings of International Symposium on Computer Architecture, 2017. 27--40. Google Scholar

[9] Yu J, Lukefahr A, Palframan D. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. SIGARCH Comput Archit News, 2017, 45: 548-560 CrossRef Google Scholar

[10] Chen Y H, Emer J, Sze V. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In: Proceedings of International Symposium on Computer Architecture, 2016. 367--379. Google Scholar

[11] Akhlaghi V, Yazdanbakhsh A, Samadi K, et al. SnaPEA: predictive early activation for reducing computation in deep convolutional neural networks. In: Proceedings of International Symposium on Computer Architecture, 2018. 662--673. Google Scholar

[12] Hegde K, Yu J, Agrawal R, et al. UCNN: exploiting computational reuse in deep neural networks via weight repetition. In: Proceedings of International Symposium on Computer Architecture, 2018. 674--687. Google Scholar

[13] Zhang S J, Du Z D, Zhang L, et al. Cambricon-X: an accelerator for sparse neural networks. In: Proceedings of International Symposium on Microarchitecture, 2016. Google Scholar

[14] Peemen M, Setio A A A, Mesman B, et al. Memory-centric accelerator design for convolutional neural networks. In: Proceedings of International Conference on Computer Design, 2013. Google Scholar

[15] Yazdani R, Riera M, Arnau J M, et al. The dark side of DNN pruning. In: Proceedings of International Symposium on Computer Architecture, 2018. 790--801. Google Scholar

[16] Yu J, Lukefahr A, Palframan D, et al. Scalpel: customizing DNN pruning to the underlying hardware parallelism. In: Proceeding of the 44th Annual International Symposium on Computer Architecture, 2017. 548--560. Google Scholar

[17] Sze V, Chen Y H, Yang T J. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc IEEE, 2017, 105: 2295-2329 CrossRef Google Scholar

[18] Lavin A, Gray S. Fast algorithms for convolutional neural networks. In: Proceedings of Computer Vision and Pattern Recognition, 2016. 4013--4021. Google Scholar

[19] Cong J S, Xiao B J. Minimizing computation in convolutional neural networks. In: Proceedings of International Conference on Artificial Neural Networks, 2014. 281--290. Google Scholar

[20] Zhang J Y, Guo Y, Hu X. Design and implementation of deep neural network for edge computing. IEICE Trans Inf Syst, 2018, 101: 1982--1996. Google Scholar

[21] Lu L, Liang Y, Xiao Q, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs. In: Proceedings of the 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017. 101--108. Google Scholar

[22] Kung H T. Why systolic architectures?. Computer, 1982, 15: 37-46 CrossRef Google Scholar

[23] Jouppi N P, Young C, Patil N, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017. Google Scholar

[24] Du Z D, Fasthuber R, Chen T S, et al. Shidiannao: shifting vision processing closer to the sensor. In: Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture, 2015. Google Scholar

[25] Desoli G, Chawla N, Boesch T, et al. 14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28 nm for intelligent embedded systems. In: Proceedings of Solid-State Circuits Conference, 2017. 238--239. Google Scholar

[26] Liu S, Du Z, Tao J. Cambricon: an instruction set architecture for neural networks. SIGARCH Comput Archit News, 2016, 44: 393-405 CrossRef Google Scholar

[27] Li M, Huang R. Device and integration technologies for VLSI in post-Moore era (in Chinese). Sci Sin Inform, 2018, 48: 963-977 CrossRef Google Scholar

[28] Intel Xeon Phi Knights Mill for Machine Learning. 2017-10-18. https://www.servethehome.com/intel-knights-mill-for-machine-learning/. Google Scholar

[29] Oh K S, Jung K. GPU implementation of neural networks. Pattern Recogn, 2004, 37: 1311--1314. Google Scholar

[30] Coates A, Baumstarck P, Le Q, et al. Scalable learning for object detection with GPU hardware. In: Proceedings of International Conference on Intelligent Robots and Systems, 2009. 4287--4293. Google Scholar

[31] Yun S B, Kim Y J, Dong S S, et al. Hardware implementation of neural network with expansible and reconfigurable architecture. In: Proceedings of International Conference on Neural Information Processing, 2002. 970--975. Google Scholar

[32] Farabet C, Martini B, Corda B, et al. NeuFlow: a runtime reconfigurable dataflow processor for vision. In: Proceedings of Computer Vision and Pattern Recognition Workshops, 2011. 109--116. Google Scholar

[33] Zhang C, Prasanna V. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017. 35--44. Google Scholar

[34] Pham P H, Jelaca D, Farabet C, et al. NeuFlow: dataflow vision processing system-on-a-chip. In: Proceedings of the 55th International Midwest Symposium on Circuits and Systems, 2012. 1044--1047. Google Scholar

[35] Chen T, Du Z, Sun N, et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices, 2014, 49: 269-284. Google Scholar

[36] Chen Y J, Luo T, Liu S L, et al. DaDianNao: a machine-learning supercomputer. In: Proceedings of International Symposium on Microarchitecture, 2014. 609--622. Google Scholar

[37] Liu D, Chen T, Liu S. PuDianNao: A Polyvalent Machine Learning Accelerator. SIGARCH Comput Archit News, 2015, 43: 369-381 CrossRef Google Scholar

[38] Du Z D, Fasthuber R, Chen T S, et al. Shidiannao: shifting vision processing closer to the sensor. In: Proceedings of the 42nd International Symposium on Computer Architecture, 2015. Google Scholar

[39] Han S, Liu X, Mao H. EIE: efficient inference engine on compressed deep neural network. SIGARCH Comput Archit News, 2016, 44: 243-254 CrossRef Google Scholar

[40] Merolla P A, Arthur J V, Alvarez-Icaza R. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 2014, 345: 668-673 CrossRef PubMed ADS Google Scholar

[41] Kumar S. Introducing qualcomm zeroth processors: brain-inspired computing, 2013. https://www.qualcomm.com/news/onq/2013/10/10/introducing-qualcomm-zeroth-processors-brain-inspired-computing. Google Scholar

[42] Demler M. CEVA XM6 Accelerates Neural Nets. 2016. https://www.ceva-dsp.com/wp-content/uploads/2017/02/MPR-CEVA-XM6-Accelerates-Neural-Nets.pdf. Google Scholar

[43] Demler M. CEVA NeuPro Accelerates Neural Nets. 2018. https://www.ceva-dsp.com/wp-content/uploads/2018/02/Ceva-NeuPro-Accelerates-Neural-Nets.pdf. Google Scholar

[44] Chen S, Wang Y, Liu S. FT-Matrix: A Coordination-Aware Architecture for Signal Processing. IEEE Micro, 2014, 34: 64-73 CrossRef Google Scholar

[45] Tan H B, Chen H Y, Liu S, et al. Modeling and evaluation for gather/scatter operations in vector-SIMD architectures. In: Proceedings of the 28th International Conference on Application-specific Systems, Architectures and Processors, 2017. Google Scholar

  • Figure 1

    (Color online) Evolution of CNN algorithms [2-5]

  • Figure 2

    (Color online) Algorithm model of LeNet5

  • Figure 3

    (Color online) Multidimensional convolution algorithm

  • Figure 4

    (Color online) Data repetition in convolution

  • Figure 5

    (Color online) Relationship among precision, bitwidth, and accuracy in LeNet algorithm

  • Figure 6

    (Color online) Acceleration of convolution with Toeplitz matrix

  • Figure 7

    (Color online) Acceleration of convolution with FFT

  • Figure 8

    (Color online) Systolic array in TPU

  • Figure 9

    (Color online) Requirements of deep learning hardware acceleration

  • Figure 10

    (Color online) Architecture of CEVA Neu-protect łinebreakPro [43]

  • Figure 11

    (Color online) Architecture of Synopsys EV6x

  • Figure 12

    (Color online) Cadence Tensilica Vision C5

  • Figure 13

    (Color online) Architecture of STM DSP [25]

  • Figure 14

    (Color online) Multi-core architecture of FT-M7002 DSP

  • Figure 15

    (Color online) Software support for FT-M7002 DSP

  • Figure 16

    (Color online) Architecture of single core of FT-Matrix

  • Figure 17

    (Color online) Matrix multiplication on FT-Matrix DSP

  • Figure 18

    (Color online) Matrix transposed transmission of DMA in FT-Matrix

  • Figure 19

    (Color online) Problem of sparse matrix in deep learning hardware

  • Figure 20

    (Color online) Configurable deep learning acceleration based on FT-Matrix

  • Figure 21

    (Color online) Configurable computing array based on FT-Matrix VPU

  • Figure 22

    (Color online) Comparison of MAC structures before and after optimization

  • Figure 23

    (Color online) VGather/VScatter instruction of FT-Matrix

  • Table 1   Comparison of deep learning hardware acceleration platform
    Hardware platform Advantages Disadvantages Technical solutions
    CPU/GPU Flexible programming, For general-purpose computing, CPU/GPU + function
    supporting for multiple algorithms, high energy consumption for expansion/specialized
    and have lots of technical accumulation. deep learning acceleration. processing module.
    FPGA Configurable, short design cycle. Relatively high unit energy Customized for algorithm.
    consumption and delay.
    ASIC Customized, performance, Long development cycle, Customized for algorithm.
    power consumption, required manpower and material
    and latency all have advantages. resources; not flexible enough.
    Imitation Low power consumption, Low precision, Imitate biological neural
    biological consistent with neural limited by current technology. networks, using new
    chips network prototype. technologies and materials.
  • Table 2   Comparison among FT-Matrix DSP, CPU and GPU
    Chips Convolution: 144$\times$5 Convolution: 16$\times$5 Matrix$\times$matrix: 144$\times$144 Matrix$\times$vector: 144
    CPU 0.0013558 0.0001026 0.000802 0.0004747
    GPU 0.0002323 0.0001902 0.0002809 0.0002488
    Matrix 0.000218406 0.000005787 0.000026067 0.000000541
    Matrix/CPU 6.207704917 17.72939347 30.76686999 877.4491682
    Matrix/GPU 1.063615468 32.86677035 10.77607703 459.8890943
  • Table 3   Instruction set of FT-Matrix
    Instruction type Main function
    1. Flow control Scalar branch, vector branch, wait, nop, etc.
    2. Scalar load/stroe Scalar load/stroe of half/one/double/quad word with linear or circular addressing.
    3. Scalar MAC1 Basic operation (+/$-$, $\times$, /), FMA, dot, complex multiplication, square root, elementary,
    4. Scalar MAC2 functions (sine/cosine/exp/log), format conversion, floating-point logic ops etc.
    5. Scalar BP Fixed-point +/$-$, shift, test (=, !=, $>$ , $<$ , etc.), logical ops, bit ops, broadcast ops, etc.
    6. Vector load/stroe 1 (16x) vector load/store of half/one/double/quad word with linear or circular addressing.
    7. Vector load/stroe 2
    8. Vector MAC1 (16x) vector basic operation (+/$-$, $\times$), FMA, dot, complex multiplication,
    9. Vector MAC2 data format conversion, floating-point logic ops etc.
    10. Vector MAC3
    11. Vector BP (16x) vector fixed-point +/$-$, shift, test, logical ops, bit ops, shuffle, reduction, etc.
  • Table 4   MCPC probability distribution of different SIMD widths (SIMD: Bank = 1:1)
    Banks MCPC $\leq~$ 2 MCPC $\leq~$ 3 MCPC $\leq~$ 4 MCPC $\leq~$ 5 MCPC expect
    4 0.7969 0.9844 0.9999 0.9999 2.1249
    8 0.5008 0.9101 0.9902 0.9993 2.5934
    16 0.1948 0.7688 0.9629 0.9956 3.0515
    32 0.0293 0.5456 0.9073 0.9866 3.4508
    64 0.0000 0.2740 0.8041 0.9682 3.7608

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1