SCIENCE CHINA Information Sciences, Volume 63 , Issue 2 : 120101(2020) https://doi.org/10.1007/s11432-019-2710-7

## Progressive rectification network for irregular text recognition

• AcceptedOct 10, 2019
• PublishedJan 14, 2020
Share
Rating

### Abstract

Scene text recognition has received increasing attention in the research community. Text in the wild often possesses irregular arrangements, which typically include perspective, curved, and oriented texts. Most of the existing methods do not work well for irregular text, especially for severely distorted text. In this paper, we propose a novel progressive rectification network (PRN) for irregular scene text recognition. Our PRN progressively rectifies the irregular text to a front-horizontal view and further boosts the recognition performance. The distortions are removed step by step by leveraging the observation that the intermediate rectified result provides good guidance for subsequent higher quality rectification. Additionally, by decomposing the rectification process into multiple procedures, the difficulty of each step is considerably mitigated. First, we specifically perform a rough rectification, and then adopt iterative refinement to gradually achieve optimal rectification. Additionally, to avoid the boundary damage problem in direct iterations, we design an envelope-refinement structure to maintain the integrity of the text during the iterative process. Instead of the rectified images, the text line envelope is tracked and continually refined, which implicitly models the transformation information. Then, the original input image is consistently utilized for transformation based on the refined envelope. In this manner, the original character information is preserved until the final transformation. These designs lead to optimal rectification to boost the performance of succeeding recognition. Extensive experiments on eight challenging datasets demonstrate the superiority of our method, especially on irregular benchmarks.

### Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant Nos. 61772527, 61806200).

### References

[1] Shi B G, Bai X, Yao C. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition.. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2298-2304 CrossRef PubMed Google Scholar

[2] He P, Huang W L, Qiao Y, et al. Reading scene text in deep convolutional sequences. In: Proceedings of AAAI Conference on Artificial Intelligence, 2016. 3501--3508. Google Scholar

[3] Lee C Y, Osindero S. Recursive recurrent nets with attention modeling for ocr in the wild. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2231--2239. Google Scholar

[4] Cheng Z Z, Bai F, Xu Y L, et al. Focusing attention: towards accurate text recognition in natural images. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 5086--5094. Google Scholar

[5] Shi B G, Wang X G, Lyu P Y, et al. Robust scene text recognition with automatic rectification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 4168--4176. Google Scholar

[6] Shi B G, Yang M K, Wang X G. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 2035-2048 CrossRef PubMed Google Scholar

[7] Yang M K, Guan Y S, Liao M H, et al. Symmetry-constrained rectification network for scene text recognition. In: Proceedings of IEEE International Conference on Computer Vision, 2019. Google Scholar

[8] Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015. 2017--2025. Google Scholar

[9] Wang K, Babenko B, Belongie S. End-to-end scene text recognition. In: Proceedings of IEEE International Conference on Computer Vision, 2011. 1457--1464. Google Scholar

[10] Bissacco A, Cummins M, Netzer Y, et al. Photoocr: Reading text in uncontrolled conditions. In: Proceedings of IEEE International Conference on Computer Vision, 2013. 785--792. Google Scholar

[11] Jaderberg M, Simonyan K, Vedaldi A. Reading Text in the Wild with Convolutional Neural Networks. Int J Comput Vis, 2016, 116: 1-20 CrossRef Google Scholar

[12] Rodriguez-Serrano J A, Gordo A, Perronnin F. Label Embedding: A Frugal Baseline for Text Recognition. Int J Comput Vis, 2015, 113: 193-207 CrossRef Google Scholar

[13] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of International Conference on Machine Learning, 2006. 369--376. Google Scholar

[14] Bai F, Cheng Z Z, Niu Y, et al. Edit probability for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1508--1516. Google Scholar

[15] Fang S C, Xie H T, Zhang Z J, et al. Attention and language ensemble for scene text recognition with convolutional sequence modeling. In: Proceedings of ACM Multimedia Conference on Multimedia Conference, 2018. 248--256. Google Scholar

[16] Phan T Q, Shivakumara P, Tian S, et al. Recognizing text with perspective distortion in natural scenes. In: Proceedings of IEEE International Conference on Computer Vision, 2013. 569--576. Google Scholar

[17] Yang X, He D F, Zhou Z H, et al. Learning to read irregular text with attention mechanisms. In: Proceedings of International Joint Conference on Artificial Intelligence, 2017. 3280--3286. Google Scholar

[18] Liu W, Chen C F, Wong K Y K. Char-net: a character-aware neural network for distorted scene text recognition. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018. Google Scholar

[19] Cheng Z Z, Liu X Y, Bai F, et al. Aon: Towards arbitrarily-oriented text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 5571--5579. Google Scholar

[20] Zhan F N, Lu S J. ESIR: end-to-end scene text recognition via iterative rectification. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2019. 2059--2068. Google Scholar

[21] Chen J, Lian Z H, Wang Y Z. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103 CrossRef Google Scholar

[22] Bookstein F L. Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans Pattern Anal Machine Intell, 1989, 11: 567-585 CrossRef Google Scholar

[23] Lin C-H, and Lucey S. Inverse compositional spatial transformer networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2568--2576. Google Scholar

[24] He K, Zhang X Y, Ren S Q, et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 1026--1034. Google Scholar

[25] Saxe A M, McClelland J L, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 2013,. arXiv Google Scholar

[26] Jaderberg M, Simonyan K, Vedaldi A, et al. Synthetic data and artificial neural networks for natural scene text recognition. 2014,. arXiv Google Scholar

[27] Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2315--2324. Google Scholar

[28] Risnumawan A, Shivakumara P, Chan C S. A robust arbitrary text detection system for natural scene images. Expert Syst Appl, 2014, 41: 8027-8048 CrossRef Google Scholar

[29] Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading. In: Proceedings of International Conference on Document Analysis and Recognition (ICDAR), 2015. 1156--1160. Google Scholar

[30] Mishra A, Alahari K, Jawahar C. Top-down and bottom-up cues for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012. 2687--2694. Google Scholar

[31] Lucas S M, Panaretos A, Sosa L. ICDAR 2003 robust reading competitions: entries, results, and future directions. IJDAR, 2005, 7: 105-122 CrossRef Google Scholar

[32] Karatzas D, Shafait F, Uchida S, et al. Icdar 2013 robust reading competition. In: Proceedings of International Conference on Document Analysis and Recognition, 2013. 1484--1493. Google Scholar

[33] Ch'ng C K, Chan C S. Total-text: a comprehensive dataset for scene text detection and recognition. In: Proceedings of International Conference on Document Analysis and Recognition, 2017. 935--942. Google Scholar

[35] Ketkar N. Introduction to pytorch. In: Proceedings of Deep Learning With Python, 2017. 195--208. Google Scholar

[36] Liu W, Chen C F, Wong K K. SAFE: scale aware feature encoder for scene text recognition. In: Proceedings of Asian Conference on Computer Vision, 2018. 196--211. Google Scholar

[37] Luo C J, Jin L W, Sun Z H. MORAN: A Multi-Object Rectified Attention Network for scene text recognition. Pattern Recognition, 2019, 90: 109-118 CrossRef Google Scholar

[38] Liu Y, Wang Z W, Jin H L, et al. Synthetically supervised feature learning for scene text recognition. In: Proceedings of European Conference on Computer Vision, 2018. 435--451. Google Scholar

[39] Lyu P Y, Yang Z C, Leng X H, et al. 2D Attentional Irregular Scene Text Recognizer. 2019,. arXiv Google Scholar

[40] Liao M H, Zhang J, Wan Z Y, et al. Scene text recognition from two-dimensional perspective. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019. 8714--8721. Google Scholar

[41] Li H, Wang P, Shen C H, et al. Show, attend and read: a simple and strong baseline for irregular text recognition. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019. 8610--8617. Google Scholar

[42] Wang T, Wu D J, Coates A, et al. End-to-end text recognition with convolutional neural networks. In: Proceedings of International Conference on Pattern Recognition, 2012. 3304--3308. Google Scholar

[43] Yao C, Bai X, Shi B G, et al. Strokelets: a learned multi-scale representation for scene text recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014. 4042--4049. Google Scholar

[44] Jaderberg M, Vedaldi A, Zisserman A. Deep features for text spotting. In: Proceedings of European Conference on Computer Vision, 2014. 512--528. Google Scholar

[45] Jaderberg M, Simonyan K, Vedaldi A, et al. Deep structured output learning for unconstrained text recognition. 2014,. arXiv Google Scholar

[46] Liu W, Chen C F, Wong K K, et al. Star-net: a spatial attention residue network for scene text recognition. In: Proceedings of British Machine Vision Conference, 2016. 7. Google Scholar

[47] Wang J F, Hu X L. Gated recurrent convolution neural network for ocr. In: Proceedings of Neural Information Processing Systems, 2017. 334--343. Google Scholar

[48] Liu Z C, Li Y X, Ren F B, et al. Squeezedtext: a real-time scene text recognition by binary convolutional encoder-decoder network. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018. Google Scholar

• Figure 1

(Color online) We propose a novel progressive rectification network that progressively rectifies irregular text to a front-horizontal view, leading to optimal rectification and easier recognition. As our experiments will demonstrate, the proposed method enables accurate rectification and considerably improves the performance of challenging irregular text benchmarks. Considering the tradeoff between accuracy and speed, we select three iterations.

• Figure 2

(Color online) The comparison of two iterative methods. The top and bottom rows are the direct iterations and envelope-refinement structures, respectively. (a) is the original input images; (b)–(d) are the rectified images in the first iteration, in the second iteration, and in the last iteration, respectively. The direct iteration structure discards the information outside the rectified images and leads to boundary damage, while the envelope-refinement structure can recover the missing information and preserve the intact structure of characters.

• Figure 3

(Color online) Overview of our PRN for irregular text recognition. The dotted lines represent the coordinates delivery.

• Figure 4

(Color online) Visualizations of the rectified images during progressive refinement.

• Table 1   Architecture of our progressive rectification network
 Layer name Configurations Block1 conv 3$\times$3, 32; pool 2$\times$2 Block2 conv 3$\times$3, 64; pool 2$\times$2 Block3 conv 3$\times$3, 128; pool 2$\times$2 Rectification network Block4 conv 3$\times$3, 256; pool 2$\times$2 Block5 conv 3$\times$3, 256; pool 2$\times$2 Block6 conv 3$\times$3, 256 Block7 fc1, 512; fc2, 40 Recognition network Convolution 3$\times$3, 32 Residual Unit1_X $\left[\begin{array}{cc}~1\times1,~~32\\~3\times3,~~~32\end{array}\right]\times3$ Residual Unit2_X $\left[\begin{array}{cc}~1\times1,~~64\\~3\times3,~~~64\end{array}\right]\times4$ Residual Unit3_X $\left[\begin{array}{cc}~1\times1,~~128\\~3\times3,~~~128\end{array}\right]\times6$ Residual Unit4_X $\left[\begin{array}{cc}~1\times1,~~256\\~3\times3,~~~256\end{array}\right]\times6$ Residual Unit5_X $\left[\begin{array}{cc}~1\times1,~~512\\~3\times3,~~~512\end{array}\right]\times3$ BLSTM 1 256 hidden units per LSTM BLSTM 2 256 hidden units per LSTM LSTM 256 hidden units LSTM 256 hidden units
• Table 2   Lexicon-free results on several benchmarks with different number of iterations$^{\rm~a)}$
 The number SVT IIIT5k IC03 IC13 SVTP CUTE80 IC15 Total-Text Total-Text Time (ms) of iterations (%) (%) (%) (%) (%) (%) (%) (multi-oriented) (%) (curved) (%) 0 86.1 91.0 91.7 89.4 74.6 75.0 71.8 70.6 53.0 16.34 1 86.2 92.6 92.9 90.7 77.0 82.3 72.0 72.3 56.4 20.03 2 86.6 92.8 92.9 92.0 78.2 83.3 74.5 74.4 58.4 24.28 3 89.0 93.6 93.6 92.2 80.8 87.2 76.4 75.6 67.1 28.48 4 88.7 94.3 94.0 93.3 81.2 88.2 76.8 77.1 69.3 32.57 5 88.4 94.3 94.0 93.0 79.8 88.5 78.4 76.0 69.0 36.70
• Table 3   The effect of the proposed envelope-refinement structure (ER)$^{\rm~a)}$
 Method SVT IIIT5k IC03 IC13 SVTP CUTE80 IC15 Total-Text Total-Text (%) (%) (%) (%) (%) (%) (%) (multi-oriented) (%) (curved) (%) PRN (w/o ER) 87.5 92.4 93.5 91.5 78.1 86.1 74.1 73.8 65.2 PRN 89.0 93.6 93.6 92.2 80.8 87.2 76.4 75.6 67.1
• Table 4   Scene text recognition accuracies on irregular datasets$^{\rm~a)}$
 Method SVTP (%) CUTE80 (%) IC15 (%) Total-Text Total-Text (multi-oriented) (%) (curved) (%) 50 Full None None None None None ABBYY [9] 40.5 26.1 – – – – – Mishra et al. [30] 45.7 24.7 – – – – – Phan et al. [16] 75.6 67.0 – – – – – Shi et al. [1] 92.6 72.6 66.8 54.9 – – – Shi et al. [5] 91.2 77.4 71.8 59.2 – – – Liu et al. [18] – – 73.5 – – – – Cheng et al. [19] 94.0 83.7 73.0 76.8 68.2 – – Fang et al. [15] – – – – 71.2 – – Liu et al. [36] – – 74.4 – – – – Shi et al. [6] – – 78.5 79.5 76.1 – – Luo et al. [37] _94.3 _86.7 76.1 77.4 68.8 – – Liu et al. [38] – – 73.9 62.5 – – – Zhan and Lu [20] – – 79.6 83.3 _76.9 – – Lyu et al. [39] – – 82.3 86.8 76.3 – – Yang et al. [17]$^{\rm~b)}$ 93.0 80.2 75.8 69.3 – – – Cheng et al. [4]$^{\rm~b)}$ 92.6 81.6 71.5 63.9 66.2 – – Yang et al. [7]$^{\rm~b)}$ – – 80.8 _87.5 78.7 – – Liao et al. [40]$^{\rm~b)}$ – – – 78.1 – – – PRN (ours) 95.1 92.6 _81.2 88.2 76.8 77.1 69.3

a

• Table 5   Scene text recognition accuracies on regular datasets$^{\rm~a)}$
 Method SVT (%) IIIT5k (%) IC03 (%) IC13 (%) 50 None 50 1000 None 50 Full None None Wang et al. [42] 70.0 – – – – 90.0 84.0 – – Bissacco et al. [10] 90.4 78.0 – – – – – – 87.6 Yao et al. [43] 75.9 – 80.2 69.3 – 88.5 80.3 – – Rodriguez-Serrano et al. [12] 70.0 – 76.1 57.4 – – – – – Jaderberg et al. [44] 86.1 – – – – 96.2 91.5 – – Jaderberg et al. [11] 95.4 80.7 97.1 92.7 – 98.7 98.6 93.1 90.8 Jaderberg et al. [45] 93.2 71.7 95.5 89.6 – 97.8 97.0 89.6 81.8 Shi et al. [1] _97.5 82.7 97.8 95.0 81.2 98.7 98.0 91.9 89.6 Lee et al. [3] 96.3 80.7 96.8 94.4 78.4 97.9 97.0 88.7 90.0 Liu et al. [46] 95.5 83.6 97.7 94.5 83.3 96.9 95.3 89.9 89.1 He et al. [2] 92.0 – 94.0 91.6 – 97.0 94.4 – – Wang and Hu [47] 96.3 81.5 98.0 95.6 80.8 98.8 97.8 91.2 – Bai et al. [14] 96.6 87.5 99.5 97.9 88.3 98.7 97.9 94.6 94.4 Luo et al. [37] 96.6 88.3 97.9 96.2 91.2 98.7 97.8 95.0 92.4 Liu et al. [36] 97.1 85.5 98.4 96.1 85.2 98.5 97.7 92.9 90.3 Shi et al.[5] 95.5 81.9 96.2 93.8 81.9 98.3 96.2 90.1 88.6 Liu et al. [18] – 84.4 – – 83.6 – – 91.5 90.8 Cheng et al. [19] 96.0 82.8 _99.6 98.1 87.0 98.5 97.1 91.5 – Liu et al. [38] 96.8 87.1 97.3 96.1 89.4 98.1 97.5 _94.7 _94.0 Shi et al. [6] 97.4 89.5 _99.6 98.8 93.4 98.8 98.0 94.5 91.8 Zhan and Lu [20] 97.4 90.2 97.4 98.8 93.3 – – – 91.3 Lyu et al. [39] 97.2 _90.1 99.8 99.1 _94.0 99.4 _98.1 94.3 92.7 Liu et al. [48]$^{\rm~b)}$ 96.1 – 96.9 94.3 86.6 98.4 97.9 93.1 92.7 Yang et al.[17]$^{\rm~b)}$ 95.2 – 97.8 96.1 – 97.7 – – – Cheng et al. [4]$^{\rm~b)}$ 97.1 85.9 99.3 97.5 87.4 _99.2 97.3 94.2 93.3 Liao et al. [40]$^{\rm~b)}$ 98.5 82.1 99.8 _98.9 92.0 – – – 91.4 PRN (ours) _97.5 88.7 _99.6 _98.9 94.3 98.6 98.0 94.0 93.3

a

Citations

• #### 0

Altmetric

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有