SCIENCE CHINA Information Sciences, Volume 63 , Issue 2 : 120105(2020) https://doi.org/10.1007/s11432-019-2737-0

## SynthText3D: synthesizing scene text images from 3D virtual worlds

More info
• ReceivedOct 15, 2019
• AcceptedDec 3, 2019
• PublishedJan 15, 2020
Share
Rating

### Abstract

With the development of deep neural networks, the demand for a significant amount of annotated training data becomes the performance bottlenecks in many fields of research and applications. Image synthesis can generate annotated images automatically and freely, which gains increasing attention recently. In this paper, we propose to synthesize scene text images from the 3D virtual worlds, where the precise descriptions of scenes, editable illumination/visibility, and realistic physics are provided. Different from the previous methods which paste the rendered text on static 2D images, our method can render the 3D virtual scene and text instances as an entirety. In this way, real-world variations, including complex perspective transformations, various illuminations, and occlusions, can be realized in our synthesized scene text images. Moreover, the same text instances with various viewpoints can be produced by randomly moving and rotating the virtual camera, which acts as human eyes. The experiments on the standard scene text detection benchmarks using the generated synthetic data demonstrate the effectiveness and superiority of the proposed method.

### Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant No. 61733007). Xiang BAI was supported by National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team (Grant No. 2017QYTD08).

### References

[1] Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016. 2315--2324. Google Scholar

[2] Zhan F, Lu S, Xue C. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In: Proceedings of European Conference on Computer Vision, 2018. Google Scholar

[3] Jaderberg M, Simonyan K, Vedaldi A, et al. Synthetic data and artificial neural networks for natural scene text recognition. 2014,. arXiv Google Scholar

[4] Zhu Z, Huang T, Shi B, et al. Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 2347--2356. Google Scholar

[5] Varol G, Romero J, Martin X, et al. Learning from synthetic humans. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017. 109--117. Google Scholar

[6] Papon J, Schoeler M. Semantic pose using deep networks trained on synthetic rgb-d. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 774--782. Google Scholar

[7] McCormac J, Handa A, Leutenegger S, et al. Scenenet RGB-D: 5m photorealistic images of synthetic indoor trajectories with ground truth. 2016,. arXiv Google Scholar

[8] Ros G, Sellart L, Materzynska J, et al. The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016. 3234--3243. Google Scholar

[9] Saleh F S, Aliakbarian M S, Salzmann M, et al. Effective use of synthetic data for urban scene semantic segmentation. In: Proceedings of European Conference on Computer Vision, 2018. 86--103. Google Scholar

[10] Peng X, Sun B, Ali K, et al. Learning deep object detectors from 3D models. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 1278--1286. Google Scholar

[11] Tremblay J, To T, Birchfield S. Falling things: a synthetic dataset for 3D object detection and pose estimation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition Workshop, 2018. 2038--2041. Google Scholar

[12] Hinterstoisser S, Pauly O, Heibel H, et al. An annotation saved is an annotation earned: using fully synthetic training for object instance detection. 2019,. arXiv Google Scholar

[13] Ye Y Y, Zhang C, Hao X L. Arpnet: attention regional proposal network for 3D object detection. Sci China Inf Sci, 2019, 62: 220104. Google Scholar

[14] Cao J, Pang Y, Li X. Learning Multilayer Channel Features for Pedestrian Detection.. IEEE Trans Image Process, 2017, 26: 3210-3220 CrossRef PubMed Google Scholar

[15] Cao J, Pang Y, Li X. Pedestrian Detection Inspired by Appearance Constancy and Shape Symmetry.. IEEE Trans Image Process, 2016, 25: 5538-5551 CrossRef PubMed Google Scholar

[16] Quiter C, Ernst M. deepdrive/deepdrive: 2.0. 2018. Google Scholar

[17] Martinez M, Sitawarin C, Finch K, et al. Beyond grand theft auto V for training, testing and enhancing deep learning in self driving cars. 2017,. arXiv Google Scholar

[18] Qiu W, Yuille A. Unrealcv: connecting computer vision to unreal engine. In: Proceedings of European Conference on Computer Vision, 2016. 909--916. Google Scholar

[19] Ganoni O, Mukundan R. A framework for visually realistic multi-robot simulation in natural environment. 2017,. arXiv Google Scholar

[20] Wang T, Wu J D, Coates A, et al. End-to-end text recognition with convolutional neural networks. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR), 2012. 3304--3308. Google Scholar

[21] Zhan F, Zhu H, Lu S. Spatial fusion gan for image synthesis. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2019. 3653--3662. Google Scholar

[22] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672--2680. Google Scholar

[23] Ye Q, Doermann D. Text Detection and Recognition in Imagery: A Survey.. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1480-1500 CrossRef PubMed Google Scholar

[24] Bai X, Liao M K, Shi B G. Deep learning for scene text detection and recognition. Sci Sin Inform, 2018, 48: 531-544 CrossRef Google Scholar

[25] Liu Y, Jin L, Zhang S, et al. Detecting curve text in the wild: New dataset and new solution. 2017,. arXiv Google Scholar

[26] Liao M, Shi B, Bai X, et al. TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2017. 4161--4167. Google Scholar

[27] Ma J, Shao W, Ye H. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Trans Multimedia, 2018, 20: 3111-3122 CrossRef Google Scholar

[28] Liu Y, Jin L. Deep matching prior network: toward tighter multi-oriented text detection. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017. Google Scholar

[29] He W, Zhang Y-X, Yin F, et al. Deep direct regression for multi-oriented scene text detection. In: Proceedings of IEEE International Conference on Computer Vision, 2017. Google Scholar

[30] Zhou X, Yao C, Wen H, et al. EAST: An efficient and accurate scene text detector. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017. Google Scholar

[31] Liao M, Zhu Z, Shi B, et al. Rotation-sensitive regression for oriented scene text detection. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018. 5909--5918. Google Scholar

[32] Liao M, Lyu P, He M, et al. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Trans Pattern Anal Mach Intell, 2019. Google Scholar

[33] Ren S, He K, Girshick R, et al. Faster r-cnn: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015. 91--99. Google Scholar

[34] Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multibox detector. In: Proceedings of European Conference on Computer Vision, 2016. 21--37. Google Scholar

[35] Liao M, Shi B, Bai X. TextBoxes+: A Single-Shot Oriented Scene Text Detector.. IEEE Trans Image Process, 2018, 27: 3676-3690 CrossRef PubMed Google Scholar

[36] Shi B, Bai X, Belongie S. Detecting oriented text in natural images by linking segments. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017. Google Scholar

[37] Wu Y, Natarajan P. Self-organized text detection with minimal post-processing via border learning. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017. 5000--5009. Google Scholar

[38] Long S, Ruan J, Zhang W, et al. Textsnake: a flexible representation for detecting text of arbitrary shapes. In: Proceedings of European Conference on Computer Vision, 2018. Google Scholar

[39] Deng D, Liu H, Li X, et al. Pixellink: Detecting scene text via instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 6773--6780. Google Scholar

[40] Lyu P, Yao C, Wu W, et al. Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[41] Chen J, Lian Z, Wang Y, Tang Y, and Xiao J. Irregular scene text detection via attention guided border labeling. Science China Information Sciences, 2019, 62: 220103. Google Scholar

[42] Arbeláez P, Maire M, Fowlkes C. Contour detection and hierarchical image segmentation.. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 898-916 CrossRef PubMed Google Scholar

[43] Lu S J, Tan C, Lim J-H. Robust and efficient saliency modeling from image co-occurrence histograms.. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 195-201 CrossRef PubMed Google Scholar

[44] Lin Y-T, Maire M, Belongie S, et al. Microsoft coco: Common objects in context. In: Proceedings of European Conference on Computer Vision, 2014. 740--755. Google Scholar

[45] Roth S D. Ray casting for modeling solids. Computer Graphics & Image Processing, 1982, 18: 109--144. Google Scholar

[46] Karatzas D, Shafait F, Uchida S, et al. Icdar 2013 robust reading competition. In: Proceedings of International Conference on Document Analysis and Recognition, 2013. 1484--1493. Google Scholar

[47] Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. Icdar 2015 competition on robust reading. In: Proceedings of International Conference on Document Analysis and Recognition, 2015. 1156--1160. Google Scholar

[48] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

• Figure 1

(Color online) Image samples synthesized by different methods. Compared to existing methods that render text with static background images, our method can realize realistic occlusions, perspective transformations and various illuminations. (a) Gupta et al. [1]; (b) Zhan et al. [2]; (c) ours.

• Figure 2

(Color online) The pipeline of SynthText3D. The blue dashed boxes represent the corresponding modules described in Section sect. 3. (a) Camera anchor generation; (b) text region generation; (c) text generation; (d) 3D rendering.

• Figure 3

(Color online) Camera anchors. The first row depicts the randomly produced anchors (left: in dim light; right: inside a building model) and the second row shows the manually selected anchors.

• Figure 4

(Color online) Comparisons of the text regions between SynthText (a) and ours (b). The first column: original images without text; the second column: depth maps; the third column: segmentation/surface normal maps; the last column: rendered images. The red boxes in the images mean unsuitable text regions.

• Figure 5

(Color online) Illustration of text region generation. (a) The original images; (b) the surface normal maps; protectłinebreak (c) the normal boundary maps; (d) and (e) the generated text regions.

• Figure 6

(Color online) The coordinates calculated by the integer depth and the float depth. $P_v$: coordinates of viewpoint; $P_i$: 3D location using integer depth; $P_f$: 3D location using floating point depth.

• Figure 7

(Color online) Illustration of text placing.

• Figure 8

(Color online) Examples of our synthetic images. SynthText3D can achieve illuminations and visibility adjustment, perspective transformation, and occlusions. (a) Various illuminations and visibility of the same text instances; protectłinebreak (b) different viewpoints of the same text instances; (c) different occlusion cases of the same text instances.

• Figure 9

(Color online) Detection results with different training set. Green boxes: correct results; red boxes: wrong result. As shown, the occlusion and perspective cases can be improved when our synthetic data are used. Trained with (a) ICDAR 2015 (1k) and (b) ours (10k) and ICDAR 2015 (1k).

•

Algorithm 1 Stochastic binary search

§etAlgoLined

Input:binary normal boundary map: $M\in~\mathbb{R}^{H~\times~W}$, initial center point: $(X_0,~Y_0)$.

Output:top-left point: ($X_1$, $Y_1$), bottom-right point: ($X_2$, $Y_2$).

$S_W~=~96$; $S_H~=~64$; // The width and height of minimal initial box as mentioned in Subsection sect. 3.3.2.$X_{(1,{\rm~upper})}=0$; $X_{(1,~{\rm~lower})}=X_0~-~S_W~/~2$;$Y_{(1,{\rm~upper})}=0$; $Y_{(1,~{\rm~lower})}=Y_0~-~S_H~/~2$;$X_{(2,~{\rm~lower})}=X_0~+~S_W~/~2$; $X_{(2,{\rm~upper})}=W$;$Y_{(2,~{\rm~lower})}=Y_0~+~S_H~/~2$; $Y_{(2,{\rm~upper})}=~H$;

while rm expandable do

// “expandable" indicates if it is successfully expanded (not reaching boundaries) in the last step.randomly select side from left, right, top, bottom;uIf${\rm{side~is~left}}$ $\And$ ${\rm~abs}(X_{(1,~{\rm~lower})}-X_{(1,{\rm~upper})})>1$ $\mathrm{mid}=~\lfloor(X_{(1,~{\rm~lower})}+X_{(1,{\rm~upper})})/2\rfloor$;

uIf${\rm~sum}(M[Y_{(1,~{\rm~lower})}:Y_{(2,~{\rm~lower})},~{\rm~mid}:X_{(2,~{\rm~lower})}])~\geqslant~1$ $X_{(1,{\rm~upper})}~=~\mathrm{mid}$;

Else $X_{(1,~{\rm~lower})}~=~\mathrm{mid}$;

Else $\cdots$; // update $X_{(2,\cdot)}$, $Y_{(1,\cdot)}$, $Y_{(2,\cdot)}$ similarly when side is right, top, bottom; $X_1~=~X_{(1,~{\rm~lower})}$; $Y_1~=~Y_{(1,~{\rm~lower})}$;$X_2~=~X_{(2,~{\rm~lower})}$; $Y_2~=~Y_{(2,~{\rm~lower})}$;

return $(X_1,~Y_1)$, $(X_2,~Y_2)$.

• Table 1   Detection results with different synthetic data$^{\rm~a)}$
 Training data ICDAR 2015 ICDAR 2013 MLT P R F P R F P R F SynthText 10k 40.1 54.8 46.3 54.5 69.4 61.1 34.3 41.4 37.5 SynthText 800k 67.1 51.0 57.9 68.9 66.4 67.7 53.9 36.5 43.5 VISD 10k 73.3 59.5 65.7 73.2 68.5 70.8 58.9 40.0 47.6 Ours 10k (10 scenes) 64.5 56.7 60.3 75.8 65.6 70.4 50.4 39.0 44.0 Ours 10k (20 scenes) 69.8 58.1 63.4 76.6 66.0 70.9 51.3 41.1 45.6 Ours 10k (30 scenes) 71.2 62.1 66.3 77.1 67.3 71.9 55.4 43.3 48.6 Ours 5k (10 scenes) + VISD 5k 71.1 64.4 67.6 76.5 71.4 73.8 57.6 44.2 49.8

a

• Table 2   Detection results on ICDAR 2015$^{\rm~a)}$
 Training data Precision Recall F-measure Real (1k) 84.8 78.1 81.3 SynthText (10k) + Real (1k) 85.7 79.5 82.5 VISD (10k) + Real (1k) 86.5 80.0 83.1 Ours (10k, 30 scenes) + Real (1k) 87.3 80.5 83.8

a

Citations

• #### 0

Altmetric

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有