logo

SCIENCE CHINA Information Sciences, Volume 62 , Issue 12 : 220103(2019) https://doi.org/10.1007/s11432-019-2673-8

Irregular scene text detection via attention guided border labeling

More info
  • ReceivedJun 20, 2019
  • AcceptedSep 25, 2019
  • PublishedNov 8, 2019

Abstract

Scene text detection plays an important role inmany computer vision applications. With thehelp of recent deep learning techniques, multi-orientedtext detection that was considered to bequite challenging has been solved to some extent.However, most existing methods still performpoorly for curved text detection, mainly due to thelimitation of their text representations (e.g., horizontalboxes, rotated rectangles or quadrangles). Tosolve this problem, we propose a novel method todetect irregular scene texts based on instance-awaresegmentation. The key idea is to design an attentionguided semantic segmentation model to preciselylabel the weighted borders of text regions. Experimentsconducted on several widely-used benchmarksdemonstrate that our method achieves superior results on curved textdatasets (i.e., with F-score 80.1% and 78.8% forthe CTW1500 and Total-Text, respectively) andobtains comparable performance on multi-orientedtext datasets compared to the state-of-the-art approaches.


Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant Nos. 61672056, 61672043) and Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology).


References

[1] Shi B G, Bai X, Belongie S. Detecting oriented text in natural images by linking segments. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2550--2558. Google Scholar

[2] Tian Z, Huang W L, He T, et al. Detecting text in natural image with connectionist text proposal network. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 56--72. Google Scholar

[3] Lyu P Y, Yao C, Wu W H, et al. Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7553--7563. Google Scholar

[4] Yao C, Bai X, Sang N, et al. Scene text detection via holistic, multi-channel prediction. 2016,. arXiv Google Scholar

[5] Zhang Z, Zhang C Q, Shen W, et al. Multi-oriented text detection with fully convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 4159--4167. Google Scholar

[6] He D F, Yang X, Liang C, et al. Multi-scale FCN with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 3519--3528. Google Scholar

[7] Wu Y, Natarajan P. Self-organized text detection with minimal post-processing via border learning. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 5000--5009. Google Scholar

[8] Polzounov A, Ablavatski A, Escalera S, et al. Wordfence: text detection in natural images with border awareness. In: Proceedings of IEEE International Conference on Image Processing, Beijing, 2017. 1222--1226. Google Scholar

[9] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision, Munich, 2018. 3-19. Google Scholar

[10] Epshtein B, Ofek E, Wexler Y. Detecting text in natural scenes with stroke width transform. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 2010. 2963--2970. Google Scholar

[11] Neumann L, Matas J. A method for text localization and recognition in real-world images. In: Proceedings of Asian Conference on Computer Vision, Queenstown, 2010. 770--783. Google Scholar

[12] Tian S X, Lu S J, Li C S. Wetext: scene text detection under weak supervision. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 1492--1500. Google Scholar

[13] Tian S X, Pan Y F, Huang C, et al. Text flow: a unified text detection system in natural scene images. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. 4651--4659. Google Scholar

[14] Liao M H, Shi B G, Bai X, et al. Textboxes: a fast text detector with a single deep neural network. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, 2017. Google Scholar

[15] Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 21--37. Google Scholar

[16] Ma J, Shao W, Ye H. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Trans Multimedia, 2018, 20: 3111-3122 CrossRef Google Scholar

[17] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, Palais, 2015. 91--99. Google Scholar

[18] Lyu P Y, Yao C, Wu W H, et al. Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7553--7563. Google Scholar

[19] Xu Y, Wang Y, Zhou W. TextField: Learning a Deep Direction Field for Irregular Scene Text Detection. IEEE Trans Image Process, 2019, 28: 5566-5579 CrossRef PubMed ADS arXiv Google Scholar

[20] Xue C H, Lu S J, Zhan F N. Accurate scene text detection through border semantics awareness and bootstrapping. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 355--372. Google Scholar

[21] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 3431--3440. Google Scholar

[22] Lin T Y, Dollar P, Girshick R, et al. Feature pyramid networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2117--2125. Google Scholar

[23] Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, 2015. 234--241. Google Scholar

[24] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014,. arXiv Google Scholar

[25] Milletari F, Navab N, Ahmadi S A. V-net: fully convolutional neural networks for volumetric medical image segmentation. In: Proceedings of the 4th International Conference on 3D Vision (3DV), California, 2016. 565--571. Google Scholar

[26] Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 2315--2324. Google Scholar

[27] Yuliang L, Lianwen J, Shuaitao Z, et al. Detecting curve text in the wild: new dataset and new solution. 2017,. arXiv Google Scholar

[28] Ch'ng C K, Chan C S. Total-text: a comprehensive dataset for scene text detection and recognition. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017. 935--942. Google Scholar

[29] Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading. In: Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Nancy, 2015. 1156--1160. Google Scholar

[30] Yao C, Bai X, Liu W Y, et al. Detecting texts of arbitrary orientations in natural images. In: Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 1083--1090. Google Scholar

[31] Kingma D P, Ba J. Adam: a method for stochastic optimization. 2014,. arXiv Google Scholar

[32] Zhou X Y, Yao C, Wen H, et al. EAST: an efficient and accurate scene text detector. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 5551--5560. Google Scholar

[33] Liu Y L, Jin L W. Deep matching prior network: Toward tighter multi-oriented text detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 1962--1969. Google Scholar

[34] Liu Y, Jin L, Zhang S. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 2019, 90: 337-345 CrossRef Google Scholar

[35] Long S B, Ruan J Q, Zhang W J, et al. Textsnake: a flexible representation for detecting text of arbitrary shapes. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 20--36. Google Scholar

[36] Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. 1520--1528. Google Scholar

[37] Hu H, Zhang C Q, Luo Y X, et al. Wordsup: exploiting word annotations for character based text detection. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 4940--4949. Google Scholar

[38] Wang F F, Zhao L M, Li X, et al. Geometry-aware scene text detection with instance transformation network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 1381--1389. Google Scholar

[39] Deng D, Liu H F, Li X L, et al. Pixellink: detecting scene text via instance segmentation. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, 2018. Google Scholar

[40] He W H, Zhang X Y, Yin F, et al. Deep direct regression for multi-oriented scene text detection. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 745--753. Google Scholar

  • Figure 1

    (Color online) Detection results of methods with different representations for text instances. (a) Horizontal box; (b) oriented rectangle; (c) quadrilateral; (d) simple text border; (e) ours. The proposed method is able to precisely locate arbitrary-shaped texts, while others tend to locate excess background regions.

  • Figure 2

    (Color online) Pipeline of the proposed method. Given an image, the network first outputs the text center and border maps which are then fused into one map. Based on the fused map, text instances are obtained via a simple post-processing.

  • Figure 3

    (Color online) Network architecture. We employ VGG16 as the backbone network and gradually merge the features of pooling $1$–$5$. In the feature merging process, we introduce attention mechanisms before each unpooling layer.

  • Figure 4

    (Color online) An illustration of generating weighted text border.

  • Figure 5

    (Color online) Detection results of the proposed method. Sample images in column 1–4 are from CTW1500, Total-Text, MSRA-TD500 and ICDAR2015, respectively. Some failure cases are also presented in the last column, where red contours are ground truth annotations and green contours are our detection results.

  • Figure 6

    (Color online) Effects of the proposed weighted text border. Detection results using our method with and without weighted text border are shown in the first and second rows, respectively. Red contours are ground-truth annotations and green contours are the predicted results.

  • Figure 7

    (Color online) Effects of the attention mechanisms. Detection results using our method with and without attention mechanisms are shown in the first and second rows, respectively. Red contours are ground-truth annotations and green contours are the predicted results.

  • Table 1   Comparing text detection performance of different methods on CTW1500
    Method Recall Precision F-score
    SegLink [1] 40.0 42.3 40.8
    CTPN [2] 53.8 60.4 56.9
    EAST [32] 49.1 78.7 60.4
    DMPNet [33] 56.0 69.9 62.2
    CTD [34] 65.2 74.3 69.5
    CTD+TLOC [34] 69.8 77.4 73.4
    TextSnake [35] 85.3 67.9 75.6
    TextField [19] 79.8 83.0 81.4
    Ours 76.6 83.9 80.1
  • Table 2   Comparing text detection performance of different methods on Total-Text
    Method Recall Precision F-score
    SegLink [1] 23.8 30.3 26.7
    EAST [32] 36.2 50.0 42.0
    DeconvNet [36] 56.0 69.9 62.2
    CTD+TLOC [34] 71.0 74.0 73.0
    TextSnake [35] 74.5 82.7 78.4
    TextField [19] 79.9 81.2 80.6
    Ours 73.5 84.9 78.8
  • Table 3   Comparing text detection performance of different methods on ICDAR2015
    Method Recall Precision F-score FPS
    Zhang et al. [5] 43.0 70.8 53.6 0.48
    CTPN [2] 51.6 74.2 60.9 7.1
    Yao et al. [4] 58.7 72.3 64.8 1.61
    DMPNet [33] 68.2 73.2 70.6
    SegLink [1] 76.8 73.1 75.0
    EAST [32] 72.8 80.5 76.4 6.52
    RRPN [16] 73.0 82.0 77.0
    WordSup [37] 77.0 79.3 78.2 2
    ITN [38] 74.1 85.7 79.5
    TextField [19] 80.5 84.3 82.4 5.2
    TextSnake [35] 80.4 84.9 82.6 1.1
    PixelLink [39] 82.0 85.5 83.7 3.0
    Ours 81.0 84.3 82.6 4.1
  • Table 4   Comparing text detection performance of different methods on MSRA-TD500
    Method Recall Precision F-score
    He et al. [40] 61.0 76.0 69.0
    EAST [32] 61.6 81.7 70.2
    ITN [38] 65.6 80.3 72.2
    RRPN [16] 68.0 82.0 74.0
    Zhang et al. [5] 67.0 83.0 74.0
    Yao et al. [4] 75.3 76.5 75.9
    Xue et al. (ResNet) [20] 73.3 80.7 76.8
    SegLink [1] 70.0 86.0 77.0
    PixelLink [39] 73.2 83.0 77.8
    TextSnake [35] 73.9 83.2 78.3
    TextField [19] 75.9 87.4 81.3
    Ours 72.0 86.6 78.6
  • Table 5   Ablation studies of our method conducted on Total-Text
    Weighted border Attention mechanisms Recall Precision F-score
    $\times$ $\times$ 72.9 78.9 75.8
    $\checkmark$ $\times$ 73.2 82.1 77.4
    $\times$ $\checkmark$ 72.1 85.3 78.1
    $\checkmark$ $\checkmark$ 73.5 84.9 78.8

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号