logo

SCIENCE CHINA Information Sciences, Volume 63, Issue 2: 120103(2020) https://doi.org/10.1007/s11432-019-2713-1

FACLSTM: ConvLSTM with focused attention for scene text recognition

More info
  • ReceivedJul 30, 2019
  • AcceptedNov 12, 2019
  • PublishedJan 15, 2020

Abstract

Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role.Owing to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images.In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., focused attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM.Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas.The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text images with large margins.


Acknowledgment

This work was supported by China Scholarship Council (Grant No. 201706140138), Shanghai Natural Science Foundation (Grant No. 19ZR1415900), and Shanghai Knowledge Service Platform Project (Grant No. ZF1213).


References

[1] Hochreiter S, Schmidhuber J. Long short-term memory.. Neural Computation, 1997, 9: 1735-1780 CrossRef PubMed Google Scholar

[2] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, 2015. Google Scholar

[3] Chorowski J, Bahdanau D, Serdyuk D, et al. Attention-based models for speech recognition. 2015,. arXiv Google Scholar

[4] Gao Y Z, Chen Y Y, Wang J Q, et al. Dense chained attention network for scene text recognition. In: Proceedings of International Conference on Image Processing, 2018. Google Scholar

[5] Cheng Z Z, Bai F, Xu Y L, et al. Focusing attention: towards accurate text recognition in natural images. In: Proceedings of IEEE International Conference on Computer Vision, 2017. Google Scholar

[6] Cheng Z Z, Xu Y L, Bai F, et al. AON: towards arbitrarily-oriented text recognition. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[7] Shi B, Bai X, Yao C. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition.. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2298-2304 CrossRef PubMed Google Scholar

[8] Shi B G, Wang X G, Lyu P Y, et al. Robust scene text recognition with automatic rectification. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[9] Bartz C, Yang H J, Meinel C. STN-OCR: a single neural network for text detection and recognition. 2017,. arXiv Google Scholar

[10] Liao M H, Zhang J, Wan Z Y, et al. Scene text recognition from two-dimensional perspective. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019. Google Scholar

[11] Shi X J, Chen Z R, Wang H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of Neural Information Processing Systems, 2015. Google Scholar

[12] Gao Y Z, Chen Y Y, Wang J Q, et al. Reading scene text with attention convolutional sequence modeling. 2017,. arXiv Google Scholar

[13] Wojna Z, Gorban A, Lee D, et al. Attention-based extraction of structured information from street view imagery. In: Proceedings of International Conference on Document Analysis and Recognition, 2017. Google Scholar

[14] Liu M, Zhu M L. Mobile video object detection with temporally-aware feature maps. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[15] Ye Q, Doermann D. Text Detection and Recognition in Imagery: A Survey.. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1480-1500 CrossRef PubMed Google Scholar

[16] Shi B, Yang M, Wang X. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 2035-2048 CrossRef PubMed Google Scholar

[17] Lee C Y, Osindero S. Recursive recurrent nets with attention modeling for OCR in the wild. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[18] Bai X, Liao M, Shi B. Deep learning for scene text detection and recognition. Sci Sin-Inf, 2018, 48: 531-544 CrossRef Google Scholar

[19] Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. 2015,. arXiv Google Scholar

[20] Bai F, Cheng Z Z, Niu Y, et al. Edit probability for scene text recognition. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[21] Su B, Lu S. Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recognition, 2017, 63: 397-405 CrossRef Google Scholar

[22] Su B L, Lu S J. Accurate scene text recognition based on recurrent neural network. In: Proceedings of Asian Conference on Computer Vision, 2014. Google Scholar

[23] Li H, Wang P, Shen C H, et al. Show, attend and read: a simple and strong baseline for irregular text recognition. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019. Google Scholar

[24] Jaderberg M, Vedaldi A, Zisserman A. Deep features for text spotting. In: Proceedings of European Conference on Computer Vision, 2014. Google Scholar

[25] Tian S, Bhattacharya U, Lu S. Multilingual scene character recognition with co-occurrence of histogram of oriented gradients. Pattern Recognition, 2016, 51: 125-134 CrossRef Google Scholar

[26] Liu Z C, Li Y X, Ren F B, et al. SqueezedText: a real-time scene text recognition by binary convolutional encoder-decoder network. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018. Google Scholar

[27] Huang T J, Tian Y H, Li J. Salient region detection and segmentation for general object recognition and image understanding. Sci China Inf Sci, 2011, 54: 2461-2470 CrossRef Google Scholar

[28] Li Z, Gavrilyuk K, Gavves E. VideoLSTM convolves, attends and flows for action recognition. Comput Vision Image Understanding, 2018, 166: 41-50 CrossRef Google Scholar

[29] Zhang L, Zhu G M, Mei L, et al. Attention in convolutional LSTM for gesture recognition. In: Proceedings of Neural Information Processing Systems, 2018. Google Scholar

[30] Zhu G, Zhang L, Shen P. Multimodal Gesture Recognition Using 3-D Convolution and Convolutional LSTM. IEEE Access, 2017, 5: 4517-4524 CrossRef Google Scholar

[31] Dai J F, Qi H Z, Xiong Y W, et al. Deformable convolutional networks. In: Proceedings of International Conference on Computer Vision, 2017. Google Scholar

[32] Chen J, Lian Z, Wang Y. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103 CrossRef Google Scholar

[33] Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[34] Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localization in natural images. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[35] Mishra A, Alahari K, Jawahar C V. Top-down and bottom-up cues for scene text recognition. In: Proceedings of International Conference on Computer Vision and Pattern Recognition, 2012. Google Scholar

[36] Wang K, Babenko B, Belongie S. End-to-end scene text recognition. In: Proceedings of International Conference on Computer Vision, 2011. Google Scholar

[37] Risnumawan A, Shivakumara P, Chan C S. A robust arbitrary text detection system for natural scene images. Expert Syst Appl, 2014, 41: 8027-8048 CrossRef Google Scholar

[38] Jaderberg M, Simonyan K, Vedaldi A, et al. Synthetic data and artificial neural networks for natural scene text recognition. 2014,. arXiv Google Scholar

  • Figure 1

    (Color online) Challenging samples of scene text recognition.

  • Figure 2

    (Color online) Current solutions for scene text recognition. (a) Solutions with LSTM; (b) solutions without LSTM. When using LSTM, 2-D feature maps are usually converted to 1-D space by pooling or flattening operations. When the LSTM is not used, additional parameters or post precessing steps are involved.

  • Figure 3

    (Color online) Illustration of the FC-LSTM (a) and the ConvLSTM (b). The FC-LSTM is performed in 1-D space, while the ConvLSTM is performed in 2-D space.

  • Figure 4

    (Color online) Overview of proposed FACLSTM. $F$ and $M$ denote the extracted feature maps and character center masks. $T$ groups of feature maps are produced by the proposed attention-eqipped ConvLSTM, where $T$ is the maximal string length, and the followed softmax classifier is responsible for producing $T$ groups of feature maps from extracted feature maps. Note that, the softmax classifier and previous fully connected layer are shared by the $T$ groups of feature maps.

  • Figure 5

    (Color online) Illustration of our proposed attention-equipped ConvLSTM, where the inputs are weighted by attention scores derived from previous cell states and cell outputs.

  • Figure 6

    (Color online) Visualization results of predicted mask and attention shift procedure.

  • Figure 7

    (Color online) Visualization results of attention predicted by FACLSTM and FLSTM_base1. Values of the attention maps are normalized and truncated for a better visualization. Note that FACLSTM directly produces 2-D attention maps, while FLSTM_base1 generates 1-D attention vectors, which are then reshaped to 2-D space.

  • Table 1   Result comparison across different methods and datasets$^{\rm~a)b)}$
    Method LSTM Samples IIIT5K_None IIIT5K_50 IIIT5K_1k SVT CUTE
    FAN [5] FC-LSTM 12M* 87.4 99.3 97.5 63.9
    AON [6] FC-LSTM 12M* 87.0 99.6 98.1 76.8
    CRNN [7] FC-LSTM 8M* 78.2 97.6 94.4
    Gao et al. [4] FC-LSTM 8M* 83.6 99.1 97.2
    RARE [8] FC-LSTM 8M* 81.9 96.2 93.859.2
    R$^2$AM [17] FC-LSTM 7M* 78.4 96.8 94.4
    SqueezedText_binary [26] FC-LSTM 1M 86.6 96.9 94.3
    SqueezedText [26] FC-LSTM 1M 87.0 97.0 94.1
    CA-FCN [10] No 7M 92.0 99.8 98.982.1 78.1
    Gao et al. [12] No 8M* 81.8 99.1 97.9
    STN-OCR [9] No 86.0 79.8
    FLSTM_base1 FC-LSTM 7M 73.7 99.0 97.4 58.7 67.4
    FAFLSTM_base2 FC-LSTM 7M 87.8 99.3 98.1 78.2 75.7
    FACLSTM (proposed) ConvLSTM 7M 90.5 99.5 98.6 82.2 83.3

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1