SCIENTIA SINICA Informationis, Volume 48, Issue 5: 531-544(2018) https://doi.org/10.1360/N112018-00003

## Deep learning for scene text detection and recognition

• AcceptedMar 12, 2018
• PublishedMay 11, 2018
Share
Rating

### Abstract

Scene text detection and recognition is a universal text recognition technology, which has become a hot research topic in the field of computer vision and document analysis in recent years. It is widely applied in geographical positioning, license plate recognition, and driverless applications. Compared to traditional document text detection and recognition, scene text varies more dramatically in font, color, scale, layout, and background. Owing to its excellent performance, deep learning has been widely adopted in this field. In this paper, we mainly review our representative studies based on deep learning in this field and describe the future research trends in this field.

### References

[1] Zhu Y Y, Yao C, Bai X. Scene text detection and recognition: recent advances and future trends. Front Comput Sci, 2016, 10: 19-36 CrossRef Google Scholar

[2] Ye Q X, Doermann D S. Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intel, 2015, 37: 1480-1500 CrossRef PubMed Google Scholar

[3] Mori S, Suen C Y, Yamamoto K. Historical review of OCR research and development. Proc IEEE, 1992, 80: 1029-1058 CrossRef Google Scholar

[4] Huang W L, Qiao Y, Tang X O. Robust scene text detection with convolution neural network induced mser trees. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 497--511. Google Scholar

[5] Neumann L, Matas J. Real-time scene text localization and recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 3538--3545. Google Scholar

[6] Yao C, Bai X, Liu W Y, et al. Detecting texts of arbitrary orientations in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 1083--1090. Google Scholar

[7] Liao M H, Shi B G, Bai X, et al. TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, 2017. Google Scholar

[8] Jaderberg M, Simonyan K, Vedaldi A. Reading Text in the Wild with Convolutional Neural Networks. Int J Comput Vision, 2016, 116: 1-20 CrossRef Google Scholar

[9] Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. Google Scholar

[10] Ren S Q, He K M, Girshick R. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intel, 2017, 39: 1137-1149 CrossRef PubMed Google Scholar

[11] Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. Google Scholar

[12] Ross G, Jeff D, Trevor D, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. Google Scholar

[13] Girshick R B. Fast R-CNN. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. Google Scholar

[14] Zhang Z, Zhang C Q, Shen W, et al. Multi-oriented text detection with fully convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. Google Scholar

[15] Zhang Z, Shen W, Yao C, et al. Symmetry-based text line detection in natural scenes. In: Proceedings of Computer Vision and Pattern Recognition, Boston, 2015. 2558--2567. Google Scholar

[16] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. Google Scholar

[17] Lecun Y, Bottou L, Bengio Y. Gradient-based learning applied to document recognition. Proc IEEE, 1998, 86: 2278-2324 CrossRef Google Scholar

[18] Shahab A, Shafait F, Dengel A. ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In: Proceedings of International Conference on Document Analysis and Recognition, Beijing, 2011. 1491--1496. Google Scholar

[19] Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition. In: Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, 2013. 1484--1493. Google Scholar

[20] Shi B G, Bai X, Belongie S. Detecting oriented text in natural images by linking segments. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. Google Scholar

[21] Tian Z, Huang W L, He T, et al. Detecting text in natural image with connectionist text proposal network. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. Google Scholar

[22] He P, Huang W L, He T, et al. Single shot text detector with regional attention. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 3066--3074. Google Scholar

[23] Hu H, Zhang C Q, Luo Y X, et al. WordSup: exploiting word annotations for character based text detection. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 4950--4959. Google Scholar

[24] He W H, Zhang X Y, Yin F, et al. Deep direct regression for multi-oriented scene text detection. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017. 745--753. Google Scholar

[25] Zhou X Y, Yao C, Wen H, et al. EAST: an efficient and accurate scene text detector. In: Proceedins of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2642--2651. Google Scholar

[26] Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading. In: Proceedings of the 13th International Conference on Document Analysis and Recognition, Tunis, 2015. 1156--1160. Google Scholar

[27] Mishra A, Alahari K, Jawahar C J. Scene text recognition using higher order language priors. In: Proceedings of British Machine Vision Conference, Surrey, 2012. Google Scholar

[28] Yao C, Bai X, Shi B G, et al. Strokelets: a learned multi-scale representation for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 4042--4049. Google Scholar

[29] Bai X, Yao C, Liu W Y. Strokelets: a learned multi-scale mid-level representation for scene text recognition. IEEE Trans Image Process, 2016, 25: 2789-2802 CrossRef PubMed ADS Google Scholar

[30] Alsharif O, Pineau J. End-to-end text recognition with hybrid HMM maxout models. CoRR, 2013,. arXiv Google Scholar

[31] Almazán J, Gordo A, Fornés A, et al. Handwritten word spotting with corrected attributes. In: Proceedings of IEEE International Conference on Computer Vision, Sydney, 2013. 1017--1024. Google Scholar

[32] Bissacco A, Joseph M, Netzer Y, et al. PhotoOCR: reading text in uncontrolled conditions. In: Proceedings of IEEE International Conference on Computer Vision, Sydney, 2013. 785--792. Google Scholar

[33] Shi B G, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intel, 2017, 39: 2298-2304 CrossRef PubMed Google Scholar

[34] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput, 1997, 98: 1735--1780. Google Scholar

[35] Lucas S M, Panaretos A, Sosa L. ICDAR 2003 robust reading competitions: entries, results, and future directions. Int J Doc Anal Recogn, 2005, 7: 105-122 CrossRef Google Scholar

[36] Wang K, Babenko B, Belongie S J. End-to-end scene text recognition. In: Proceedings of International Conference on Computer Vision, Barcelona, 2011. Google Scholar

[37] Shi B G, Wang X G, Lyu P Y, et al. Robust scene text recognition with automatic rectification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 4168--4176. Google Scholar

[38] Jaderberg M, Simonyan L, Zisserman A, et al. Spatial transformer networks. In: Proceedings of Conference on Neural Information Processing Systems, Montreal, 2015. 2017--2025. Google Scholar

[39] Phan T Q, Shivakumara P, Tian S X, et al. Recognizing text with perspective distortion in natural scenes. In: Proceedings of IEEE International Conference on Computer Vision, Sydney, 2013. Google Scholar

[40] Risnumawan A, Shivakumara P, Chan C S. A robust arbitrary text detection system for natural scene images. Expert Syst Appl, 2014, 41: 8027-8048 CrossRef Google Scholar

[41] Yang S L, Bo L F, Wang J, et al. Unsupervised template learning for fine-grained object recognition. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, 2012. 3122--3130. Google Scholar

[42] Jia D, Jonathan K, Li F F. Fine-grained crowdsourcing for fine-grained recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013. 580--587. Google Scholar

[43] Zhang N, Donahue J, Girshick R, et al. Part-based R-CNNs for fine-grained category detection. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 834--849. Google Scholar

[44] Bai X, Yang M K, Lyu P Y, et al. Integrating scene text and visual appearance for fine-grained image classification with convolutional neural networks. CoRR, 2017,. arXiv Google Scholar

[45] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. Google Scholar

[46] Karaoglu S, van Gemert J C, Gevers T. Con-text: text detection using background connectivity for fine-grained object classification. In: Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, 2013. 757--760. Google Scholar

[47] Karaoglu S, Tao R, Gevers T. Words matter: scene text for image classification and retrieval. IEEE Trans Multim, 2017, 19: 1063-1076 CrossRef Google Scholar

[48] Liu Y L, Jin L W, Zhang S T, et al. Detecting curve text in the wild: new dataset and new solution. CoRR, 2017,. arXiv Google Scholar

[49] Shi B G, Yao C, Liao M H, et al. Competition on reading chinese text in the wild. In: Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, 2017. Google Scholar

[50] Lyu P Y, Bai X, Yao C, et al. Auto-encoder guided GAN for chinese calligraphy synthesis. CoRR, 2017,. arXiv Google Scholar

[51] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, 2014. Google Scholar

• Figure 1

(Color online) Visualization of different detection targets

• Figure 2

(Color online) Architecture of TextBoxes: a simple fully convolutional network and a standard non-maximum suppression

• Figure 3

(Color online) Architecture of SegLink. Convolutional filters between the feature layers (some have one more convolutional layer between them) are represented in the format of “(#filters), $k$ (kernel size), $s$ (stride)". Segments (yellow boxes) and links (not displayed) aredetected by the predictors on multiple feature layers (indexed by $l$), then combined into whole words by a combining algorithm

• Figure 4

(Color online) Network structure of CRNN

• Figure 5

(Color online) (a) Structure of LSTM and (b) bidirectional LSTM network

• Figure 6

(Color online) Samples of irregular text.protect łinebreak (a) Perspectively distorted text; (b) curved text

• Figure 7

(Color online) Structure of the text rectifying network

• Figure 8

(Color online) Structure of the proposed attention model

• Figure 9

(Color online) Structure of the proposed FIAT

Citations

• #### 0

Altmetric

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有