logo

SCIENTIA SINICA Informationis, Volume 48, Issue 8: 978-999(2018) https://doi.org/10.1360/N112017-00024

GPU-based adaptive compacting neighborhood tabu search for hardware/software partitioning

More info
  • ReceivedFeb 7, 2017
  • AcceptedSep 18, 2017
  • PublishedFeb 7, 2018

Abstract

Hardware/software (HW/SW) partitioning is an essential step in hardware/software co-design as it determines the functions to be implemented in the hardware and software. HW/SW partitioning problems are NP hard. The increasing complexities of modern embedded systems worsen the problem, thus motivating the search for solutions by heuristics. The tabu search algorithm is an effective method for HW/SW partitioning. However, the process is time consuming. The existing tabu search methods for HW/SW partitioning focus on sequential implementations that involve a trade-off between solution time and solution quality. This trade-off sacrifices the solution quality. This paper presents a GPU-based adaptive compacting neighborhood tabu search for HW/SW partitioning. First, we present an adaptive strategy that enhances the search intensification to improve the solution quality. The massive parallelism of GPU architecture can reduce the solution time of the proposed strategy. Next, to ensure that the algorithm execute efficiently on the GPU, we further propose several implementation strategies such as the representation of task graph on the GPU, the mapping between the GPU thread and candidate, and data layout and memory access optimization. Finally, we realize the proposed method in a computing unified device architecture, namely CUDA, and verify the effectiveness according to the related benchmark with different communication-to-computation ratios and real-time constraint requirements. The results show that the proposed method can yield a better solution quality compared to the existing methods. By comparing with the naive GPU implementation of adaptive compacting neighborhood tabu search, the proposed implementation strategies on the GPU significantly reduced the solution time. In addition, we verified the advantage of the GPU-based method for very large HW/SW partitioning.


Funded by

国家自然科学基金(61472289)

湖北省自然科学基金(2015CBF254)


Acknowledgment

在此, 我们对给予本文的工作提出宝贵修改意见和建议的匿名审稿人表示感谢.


References

[1] Teich J. Hardware/software codesign: the past, the present, and predicting the future. Proc IEEE, 2012, 100:łinebreak 1411--1430. Google Scholar

[2] Sha E, Wang L, Zhuge Q. Power Efficiency for Hardware/Software Partitioning with Time and Area Constraints on MPSoC. Int J Parallel Prog, 2015, 43: 381-402 CrossRef Google Scholar

[3] Lin G, Zhu W X, Ali M M. A tabu search-based memetic algorithm for hardware/software partitioning. Math Probl Eng, 2014: 1--16. Google Scholar

[4] Zhang Y D, Wu L N, Wei G, et al. Hardware/software partition using adaptive ant colony algorithm. Control Decis, 2009, 24: 1385--1389. Google Scholar

[5] Erbas C, Cerav-Erbas S, Pimentel A D. Multiobjective optimization and evolutionary algorithms for the application mapping problem in multiprocessor system-on-chip design. IEEE Trans Evol Computat, 2006, 10: 358-374 CrossRef Google Scholar

[6] Luo S Q, Ma X X, Lu Y. An advanced non-dominated sorting genetic algorithm based SOC hardware/software partitioning. Acta Electron Sin, 2009, 37: 2595--2599. Google Scholar

[7] Luo L, Xia J, He H J, et al. Effective multi-objective genetic algorithm for hardware-software partitioning. Comput Sci, 2010, 37: 275--279. Google Scholar

[8] Nath P K, Datta D. Multi-objective hardware-software partitioning of embedded systems: A case study of JPEG encoder. Appl Soft Computing, 2014, 15: 30-41 CrossRef Google Scholar

[9] Shi W, Wu J, Lam S. Algorithms for bi-objective multiple-choice hardware/software partitioning. Comput Electrical Eng, 2016, 50: 127-142 CrossRef Google Scholar

[10] Wang R, Hung W, Yang G W, et al. Uncertainty model for configurable hardware/software and resource partitioning. IEEE Trans Comput, 2016, 66: 3217--3223. Google Scholar

[11] Arato P, Juhasz S, Mann Z A, et al. Hardware-software partitioning in embedded system design. In: Proceedings of IEEE International Symposium on Intelligent Signal Processing, Budapest, 2003. 197--202. Google Scholar

[12] Arató P, Mann Z, Orbán A. Algorithmic aspects of hardware/software partitioning. ACM Trans Des Autom Electron Syst, 2005, 10: 136-156 CrossRef Google Scholar

[13] Zhu J F, Wu J G, Shi W J, et al. Computign model and dynamic programming algorithm for multi-choice hardware/software partitioning on MPSoC. Comput Eng Sci, 2015, 37: 41--46. Google Scholar

[14] Jiang G, Wu J, Lam S K. Algorithmic aspects of graph reduction for hardware/software partitioning. J Supercomput, 2015, 71: 2251-2274 CrossRef Google Scholar

[15] Ma Y C, Zhang C, Luk W. Hybrid two-stage HW/SW partitioning algorithm for dynamic partial reconfigurable FPGAs. J Tsinghua Univ: Nat Sci Ed, 2016, 56: 246--252. Google Scholar

[16] Ma Y C, Liu J L, Zhang C, et al. HW/SW partitioning for region-based dynamic partial reconfigurable FPGAs. In: Proceedings of the 32nd International Conference on Computer Design, Seoul, 2014. 470--476. Google Scholar

[17] Chatha K S, Vemuri R. Hardware-software partitioning and pipelined scheduling of transformative applications. IEEE Trans VLSI Syst, 2002, 10: 193-208 CrossRef Google Scholar

[18] Wu J G, Chang B F, Srikanthan T. A hybrid branch-and-bound strategy for hardware/software partitioning. In: Proceedings of the 8th IEEE/ACIS International Conference on Computer and Information Science, Shanghai, 2009. 641--644. Google Scholar

[19] Gupta R K, De Micheli G. Hardware-software cosynthesis for digital systems. IEEE Des Test Comput, 1993, 10: 29-41 CrossRef Google Scholar

[20] Ernst R, Henkel J, Benner T. Hardware-software cosynthesis for microcontrollers. IEEE Des Test Comput, 1993, 10: 64-75 CrossRef Google Scholar

[21] Mishra A, Vakharia D, Hati A J, et al. Hardware software partitioning of task graph using genetic algorithm. In: Proceedings of Recent Advances and Innovations in Engineering, Jaipur, 2014. 1--5. Google Scholar

[22] Ferrandi F, Lanzi P L, Pilato C, et al. Ant colony optimization for mapping, scheduling and placing in reconfigurable systems. In: Proceedings of IEEE NASA/ESA Conference on Adaptive Hardware and Systems, Torino, 2013. 47--54. Google Scholar

[23] Abdelhalim M B, Habib S E D. An integrated high-level hardware/software partitioning methodology. Des Autom Embed Syst, 2011, 15: 19-50 CrossRef Google Scholar

[24] Henkel J, Ernst R. An approach to automated hardware/software partitioning using a flexible granularity that is driven by high-level estimation techniques. IEEE Trans VLSI Syst, 2001, 9: 273-289 CrossRef Google Scholar

[25] Zhu W B, Jin T B, Yin J Y. Hardware/software partitioning of RISP based on improved simulated annealing algorithm. Comput Measur Contrl, 2014, 22: 2991--2993. Google Scholar

[26] Zhang Y, Luo W, Zhang Z. A hardware/software partitioning algorithm based on artificial immune principles. Appl Soft Computing, 2008, 8: 383-391 CrossRef Google Scholar

[27] Koudil M, Benatchba K, Tarabet A. Using artificial bees to solve partitioning and scheduling problems in codesign. Appl Math Computation, 2007, 186: 1710-1722 CrossRef Google Scholar

[28] Wiangtong T, Cheung P Y K, Luk W. Comparing three heuristic search methods for functional partitioning in hardware/software co-design. Des Automation Embedded Syst, 2002, 6: 425-449 CrossRef Google Scholar

[29] Xiong Z H. Hardware/Software Partitioning Based on Dynamic Combination of Genetic Algorithm and Ant Algorithm. J Software, 2005, 16: 503-512 CrossRef Google Scholar

[30] Liu A, Feng J, Liang X. Algorithm of hardware/software partitioning based on genetic particle swarm optimization. J Comput-Aided Des Comput Graphics, 2010, 22: 927-933 CrossRef Google Scholar

[31] Jiang Y, Zhang H, Jiao X, et al. Uncertain model and algorithm for hardware/software partitioning. In: Proceesings of IEEE Computer Society Annual Symposium on VLSI, Amherst, 2012. 243--248. Google Scholar

[32] Li G S, Feng J F, Wang C, et al. Hardware/software partitioning algorithm based on the combination of genetic algorithm and tabu search. Eng Rev, 2014, 34: 151--160. Google Scholar

[33] Farahani A F, Kamal M, Salmani-Jelodar M. Parallel genetic algorithm based HW/SW partitioning. In: Proceedings of International Symposium on Parallel Computing in Electrical Engineering, Bialystok, 2006. 337--342. Google Scholar

[34] Wu Y, Zhang H, Yang H B. Research on parallel HW/SW partitioning based on hybrid PSO algorithm. In: Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing, Taipei, 2009. 449--459. Google Scholar

[35] Yan X H, He F Z, Chen Y L. A parallel hardware/software partitioning method based on conformity particle-swarm optimization with harmony search. Sci Sin Inform, 2016, 46: 1321--1338. Google Scholar

[36] Owens J D, Houston M, Luebke D. GPU Computing. Proc IEEE, 2008, 96: 879-899 CrossRef Google Scholar

[37] Luong T V, Melab N, Talbi E G. GPU Computing for Parallel Local Search Metaheuristic Algorithms. IEEE Trans Comput, 2013, 62: 173-185 CrossRef Google Scholar

[38] Zhu W, Curry J, Marquez A. SIMD tabu search for the quadratic assignment problem with graphics hardware acceleration. Int J Production Res, 2010, 48: 1035-1047 CrossRef Google Scholar

[39] Czapi??ski M. An effective Parallel Multistart Tabu Search for Quadratic Assignment Problem on CUDA platform. J Parallel Distributed Computing, 2013, 73: 1461-1468 CrossRef Google Scholar

[40] Czapi??ski M, Barnes S. Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform. J Parallel Distributed Computing, 2011, 71: 802-811 CrossRef Google Scholar

[41] Bukata L, ??cha P, Hanzálek Z. Solving the Resource Constrained Project Scheduling Problem using the parallel Tabu Search designed for the CUDA platform. J Parallel Distributed Computing, 2015, 77: 58-68 CrossRef Google Scholar

[42] Zhu F J, Wu J G, Song Q Z. Heuristic algorithm comparisons for multi-choice hardware/software partitioning. Comput Applic Soft, 2015, 32: 215--219. Google Scholar

[43] Jigang W, Srikanthan T, Jiao T. Algorithmic aspects for functional partitioning and scheduling in hardware/software co-design. Des Autom Embed Syst, 2008, 12: 345-375 CrossRef Google Scholar

[44] Wu J, Wang P, Lam S K. Efficient heuristic and tabu search for hardware/software partitioning. J Supercomput, 2013, 66: 118-134 CrossRef Google Scholar

[45] Wu J, Srikanthan T, Chen G. Algorithmic Aspects of Hardware/Software Partitioning: 1D Search Algorithms. IEEE Trans Comput, 2010, 59: 532-544 CrossRef Google Scholar

[46] Quan H J, Zhang T, Liu Q, et al. Comments on “algorithmic aspects of hardware/software partitioning: 1D search algorithms". IEEE Trans Comput, 2014, 4: 1055--1056. Google Scholar

[47] Chen Z, Wu J G, Song G Z, et al. NodeRank: an efficient algorithm for hardware/software partitioning. Chin J Comput, 2013, 36: 2033--2040. Google Scholar

[48] Glover F, Taillard E, Taillard E. A user's guide to tabu search. Ann Oper Res, 1993, 41: 1-28 CrossRef Google Scholar

[49] Billeter M, Olsson O, Assarsson U. Efficient stream compaction on wide SIMD many-core architectures. In: Proceedings of the Conference on High Performance Graphics, New Orleans, 2009. 159--166. Google Scholar

[50] Wilt N. The CUDA Handbook: a Comprehensive Guide to GPU Programming. Upper Saddle River: Pearson Education, 2013. 385--419. Google Scholar

[51] Harris M. Optimizing parallel reduction in CUDA. NVIDIA Developer Technology. 2007. http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf. Google Scholar

  • Figure 1

    Neighborhood generation

  • Figure 2

    GPU-based framework of proposed algorithm

  • Figure 3

    CSR representation of task graph

  • Figure 4

    Thread-candidate mapping. (a) Sequential and parallel semantic of naive loop; (b) sequential and parallel semantic of proposed loop

  • Figure 5

    GPU compacting neighborhood based on reduce-scan-scan pattern

  • Figure 6

    (Color online) Average hardware cost over 30 random instances by different algorithms under different CCR and $R$. (a) CCR $=0.1,~R={\rm~low}$; (b) CCR $=0.1,~R={\rm~high}$; (c) CCR $=1,~R={\rm~low}$; (d) CCR $=1,~R={\rm~high}$; (e) CCR $=10,~R={\rm~low}$; (f) CCR $=10,~R={\rm~high}$

  • Figure 7

    (Color online) Average improvement of GACNTS over HEUR in each case

  • Figure 8

    (Color online) Comparison of excution time between sequential implemtation and GPU implementation.protect łinebreak (a) CCR $=0.1,~R={\rm~low}$; (b) CCR $=0.1,~R={\rm~high}$; (c) CCR $=1,~R={\rm~low}$; (d) CCR $=1,~R={\rm~high}$; (e) CCR $=10,~R={\rm~low}$; (f) CCR $=10,~R={\rm~high}$

  •   

    Algorithm 1 Update of communication cost

    Require:Communication cost of current solution $C(\boldsymbol{x}_{\rm~current})$;

    Output:Communication cost of new solution $C(\boldsymbol{x}_{\rm~new})$;

    $C(\boldsymbol{x}_{\rm~new})~\Leftarrow~C(\boldsymbol{x}_{\rm~current})$;

    while adjacent nodes $x_{\rm~adjacent}$ of $x_i$ exist do

    if $x_i=x_{\rm~adjacent}$ then

    $C(\boldsymbol{x}_{\rm~new})~\Leftarrow~C(\boldsymbol{x}_{\rm~new})-c(x_i,~x_{\rm~adjcent})$;

    else

    $C(\boldsymbol{x}_{\rm~new})~\Leftarrow~C(\boldsymbol{x}_{\rm~new})+c(x_i,~x_{\rm~adjcent})$;

    end if

    end while

  • Table 1   Benchmark
    Name $n$ $m$ Size
    crc32 25 34 152
    patricia 21 50 192
    dijkstra 26 71 265
    clustering 150 333 1299
    rc6 329 448 2002
    random1 1000 1000 5000
    random2 1000 2000 8000
    random3 1000 3000 11000
    random4 1500 1500 7500
    random5 1500 3000 12000
    random6 1500 4500 16500
    random7 2000 2000 10000
    random8 2000 4000 16000
    random9 2000 6000 22000
  •   

    Algorithm 2 Tabu evaluation

    Require:Feasible candidates, tabu list, tabu status table;

    Output:Optimal candidate;

    Initialize the optimal candidate as maximal value;

    for all feasible candidates

    if tabu status is true then

    if $H(\boldsymbol{x})~\leq~H(\boldsymbol{x}_{\rm~global~optimal})$ then

    Update the optimal candidate by aspiration criteria;

    end if

    else

    if $H(\boldsymbol{x})~\leq~H(\boldsymbol{x}_{\rm~optimal})$ then

    $H(\boldsymbol{x}_{\rm~optimal})~\Leftarrow~H(\boldsymbol{x})$;

    end if

    end if

    end for

  • Table 2   Comparison among improvements of algorithm over alg-new3
    CCR $R$ TABU NODERANK ACNTS GACNTS
    0.1 low 5.2 7.9 9.6 9.6
    0.1 high 20.8 11.3 27.8 27.8
    1 low 9.4 11.5 13.2 13.7
    1 high 30.7 26.6 37.5 37.4
    10 low 3.8 6.2 6.5 6.5
    10 high 10.5 11.1 14.4 14.8
  •   

    Algorithm 3 GPU-based adaptive compacting neighborhood tabu search for hardware/software partitioning

    Require:Initial solution, task graph, node information;

    Output:Optimal solution;

    Initialize the current solution and global optimal;

    Initialize the tabu list at host side;

    Initialize the tabu status at host side;

    Allocate the node cost of task graph space on GPU memory;

    Allocate the neighborhood space on GPU memory;

    Bind the task graph on GPU texture memory;

    Transfer the node cost of task graph to GPU memory;

    while stop criterion isn't satisfied do

    Copy the current solution to GPU memory;

    Compute the costs of candidates in the neighborhood at GPU side;

    Compating neighborhood at GPU side;

    if the number of feasible solutions is 0 then

    Reinitialize the current solution;

    goto Compute the costs of candidates in the neighborhood at GPU side;

    end if

    Launch tabu evaluations kernel;

    Transfer back the optimal candidate to the host side;

    if all feasible solutions are tabu then

    Randomly choose a feasible solution;

    Update the optimal candidate;

    end if

    Update the tabu list at the host side;

    Update the tabu status table at the host side;

    Replace the current solution with the optimal candidate solution at the host side;

    if $H(\boldsymbol{x}_{\rm~current})~\leq~H(\boldsymbol{x}_{\rm~global})$ then

    Update the global solution;

    Clear ${\rm~count}_{\rm~no~improvement}$;

    else

    Increase ${\rm~count}_{\rm~no~improvement}$ by 1;

    end if

    end while

  • Table 3   Running time of serial execution and GPU-based execution when the number of nodes is 10000 (min)
    $n$ $m$ ACNTS GACNTS
    10000 10000 102 9
    10000 20000 148 12
    10000 30000 215 16

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1