logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 2 : 120102(2021) https://doi.org/10.1007/s11432-020-2900-x

Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

More info
  • ReceivedJan 21, 2020
  • AcceptedApr 26, 2020
  • PublishedNov 17, 2020

Abstract

For a given text, previous text-to-image synthesis methods commonly utilize a multistage generation model to produce images with high resolution in a coarse-to-fine manner. However, these methods ignore the interaction among stages, and they do not constrain the consistent cross-sample relations of images generated in different stages. These deficiencies result in inefficient generation and discrimination. In this study, we propose an interstage cross-sample similarity distillation model based on a generative adversarial network (GAN) for learning efficient text-to-image synthesis. To strengthen the interaction among stages, we achieve interstage knowledge distillation from the refined stage to the coarse stages with novel interstage cross-sample similarity distillation blocks. To enhance the constraint on the cross-sample relations of the images generated at different stages, we conduct cross-sample similarity distillation among the stages. Extensive experiments on the Oxford-102 and Caltech-UCSD Birds-200-2011 (CUB) datasets show that our model generates visually pleasing images and achieves quantitatively comparable performance with state-of-the-art methods.


Acknowledgment

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61876171, 61976203) and Fundamental Research Funds for the Central Universities.


References

[1] Xi Chen, Yan Duan, Rein Houthooft, et al. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2172--2180. Google Scholar

[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672--2680. Google Scholar

[3] Diederik P Kingma, Max Welling. Auto-encoding variational bayes. 2013,. arXiv Google Scholar

[4] Andrew Brock, Jeff Donahue, Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[5] Tero Karras, Samuli Laine, Timo Aila. A style-based generator architecture for generative adversarial networks. 2018,. arXiv Google Scholar

[6] Wei Xiong, Wenhan Luo, Lin Ma, et al. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2364--2373. Google Scholar

[7] Wei Xiong, Zhe Lin, Jimei Yang, et al. Foreground-aware image inpainting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 5840--5848. Google Scholar

[8] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, et al. Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1125--1134. Google Scholar

[9] Jun-Yan Zhu, Taesung Park, Phillip Isola, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2223--2232. Google Scholar

[10] Chen J, Shu Z, Miao Y. Structure-preserving shape completion of 3D point clouds with generative adversarial network. Sci Sin-Inf, 2020, 50: 675-691 CrossRef Google Scholar

[11] Yuanhao Li, Dongyang Ao, Corneliu Octavian Dumitru, et al. Super-resolution of geosynchronous synthetic aperture radar images using dialectical GANs. SCIENTIA SINICA Information Sciences, 2019, 62(10): 209302. Google Scholar

[12] Scott Reed, Zeynep Akata, Xinchen Yan, et al. Generative adversarial text to image synthesis. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[13] Han Zhang, Tao Xu, Hongsheng Li, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017 5907-5915. Google Scholar

[14] Han Zhang, Tao Xu, Hongsheng Li, et al. StackGAN+: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(8): 1947-1962. Google Scholar

[15] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1316--1324. Google Scholar

[16] Zizhao Zhang, Yuanpu Xie, Lin Yang. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6199--6208. Google Scholar

[17] Fengling Mao, Bingpeng Ma, Hong Chang, et al. MS-GAN: Text to image synthesis with attention-modulated generators and similarity-aware discriminators. In: Proceedings of British Machine Vision Conference, 2019. Google Scholar

[18] Alec Radford, Luke Metz, Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. 2015,. arXiv Google Scholar

[19] Geoffrey Hinton, Oriol Vinyals, Jeff Dean. Distilling the knowledge in a neural network. 2015,. arXiv Google Scholar

[20] Maria-Elena Nilsback, Andrew Zisserman. Automated flower classification over a large number of classes. In: Proceedings of Indian Conference on Computer Vision, Graphics & Image Processing, 2008. 722--729. Google Scholar

[21] Catherine Wah, Steve Branson, Peter Welinder, et al. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. Google Scholar

[22] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al. Improved techniques for training GANs. In: Proceedings of Advances in Neural Information Processing Systems 2016. 2234--2242. Google Scholar

[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 6626--6637. Google Scholar

[24] Mehdi Mirza, Simon Osindero. Conditional generative adversarial nets. 2014,. arXiv Google Scholar

[25] Yanwei Pang, Jin Xie, Xuelong Li. Visual haze removal by a unified generative adversarial network. In: Proceedings of IEEE Transactions on Circuits and Systems for Video Technology, 2019. Google Scholar

[26] Sangwoo Mo, Minsu Cho, Jinwoo Shin. InstaGAN: instance-aware image-to-image translation. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[27] Zhen Zhu, Tengteng Huang, Baoguang Shi, et al. Progressive pose attention transfer for person image generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[28] Zhijie Zhang, Yanwei Pang. CGNet: cross-guidance network for semantic segmentation. SCIENCE CHINA Information Sciences, 2020, 63: 120104. Google Scholar

[29] Minghui Liao, Boyu Song, Shangbang Long, et al. SynthText3D: synthesizing scene text images from 3D virtual worlds. SCIENTIA SINICA Information Sciences, 2020, 63: 120105. Google Scholar

[30] Scott Reed, Zeynep Akata, Honglak Lee, et al. Learning deep representations of fine-grained visual descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 49--58. Google Scholar

[31] Zhong Ji, Haoran Wang, Jungong Han, et al. Saliency-guided attention network for image-sentence matching. In: Proceedings of IEEE International Conference on Computer Vision, 2019. Google Scholar

[32] Tingting Qiao, Jing Zhang, Duanqing Xu, et al. MirrorGAN: Learning text-to-image generation by redescription. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1505--1514. Google Scholar

[33] Tingting Qiao, Jing Zhang, Duanqing Xu, et al. Learn, imagine and create: Text-to-image generation from prior knowledge. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 885--895. Google Scholar

[34] Wenbo Li, Pengchuan Zhang, Lei Zhang, et al. Object-driven text-to-image synthesis via adversarial training. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[35] Zehao Huang, Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. 2017,. arXiv Google Scholar

[36] Junho Yim, Donggyu Joo, Jihoon Bae, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4133--4141. Google Scholar

[37] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, et al. Fitnets: Hints for thin deep nets. 2014,. arXiv Google Scholar

[38] Sergey Zagoruyko, Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. 2016,. arXiv Google Scholar

[39] Xinqian Gu, Bingpeng Ma, Hong Chang, et al. Temporal knowledge propagation for image-to-video person re-identification. In: Proceedings of IEEE International Conference on Computer Vision, 2019. 9647--9656. Google Scholar

[40] Mingkuan Yuan, Yuxin Peng. Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of ACM International Conference on Multimedia, 2018. Google Scholar

[41] Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018. Google Scholar

[42] Yann N Dauphin, Angela Fan, Michael Auli, et al. Language modeling with gated convolutional networks. In: Proceedings of International Conference on Machine Learning, 2017. 933--941. Google Scholar

[43] Diederik P Kingma, Jimmy Ba. Adam: A method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015. Google Scholar

[44] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, et al. Rethinking the inception architecture for computer vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2818--2826. Google Scholar

[45] Jia Deng, Wei Dong, Richard Socher, et al. Imagenet: A large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248--255. Google Scholar

[46] Mart'ın Abadi, Paul Barham, Jianmin Chen, et al. Tensorflow: A system for large-scale machine learning. In: Proceedings of Symposium on Operating Systems Design and Implementation, 2016. 265--283. Google Scholar

[47] Scott E Reed, Zeynep Akata, Santosh Mohan, et al. Learning what and where to draw. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 217--225. Google Scholar

  • Figure 1

    (Color online) Architecture of the proposed ICSD-GAN model. The whole network contains three stages (i.e., ${\rm~Stage}^0$, ${\rm~Stage}^1$ and ${\rm~Stage}^2$) and generates images with $64\times~64$, $128\times~128$, and $256\times~256$ resolutions. $G_i$ and $D_i$ are the generator and discriminator for ${\rm~Stage}^i$, respectively. We conduct interstage knowledge distillation between ${\rm~Stage}^2$ and ${\rm~Stage}^0$/${\rm~Stage}^1$ via the CSD block. On the red line, “T" means the teacher branch, and “S" means the student branch.

  • Figure 2

    (Color online) Detailed structure of CSD block. This block takes student feature $F_{\rm~S}$ and teacher feature $F_{\rm~T}$ as input, and outputs the CSD loss. In this figure, the black solid line denotes forward-propagation and the blue dotted line denotes backpropagation. We optimize the output loss without backpropagation to the teacher branch.

  • Figure 3

    (Color online) Visualization results of $256\times~256$ images generated by StackGAN, HDGAN, MS-GAN and our ICSD-GAN model on CUB and Oxford-102 dataset.

  • Figure 4

    (Color online) Visualization of text modification to evaluate the generation ability of our ICSD-GAN on CUB dataset.

  • Figure 5

    (Color online) Visualization of text modification to evaluate the generation ability of our ICSD-GAN on Oxford-102 dataset.

  •   

    Algorithm 1 ICSD-GAN training algorithm

    Require:Training set, number of stages $n_{\rm~stage}=3$, number of training iterations $T$, batch size $N$, learning rate of generator $\alpha_g$, learning rate of discriminator $\alpha_d$, coefficient $\lambda$ and $\lambda_1$ for loss terms, Adam hyperparameters $\beta_1$ and $\beta_2$.

    Output:Generator parameters $\theta_g$.

    Initialize discriminator parameters $\theta_d~=~[\theta_{d_0},~\theta_{d_1},~\theta_{d_2}]$ and generator parameters $\theta_g~=~[\theta_{g_0},~\theta_{g_1},~\theta_{g_2}]$;

    for $t~=~1,\ldots,~T$

    Sample $x$, text $t$, mis-matching images $x_w$ and random noise $z$;

    $\hat{x}~=~G(z,~t)$ $(\hat{x}~=~[\hat{x}_0\in\mathbb{R}^{N\times3\times64\times~64},~\hat{x}_1\in\mathbb{R}^{N\times3\times128\times128},~\hat{x}_2\in\mathbb{R}^{N\times3\times256\times~256}])$;

    for $i~=~0,\ldots,~2$

    $L_{D_i}~\leftarrow~L_{D_i}^{\rm~TF}~+\lambda~L_{D_i}^{\rm~ASL}$;

    $\theta_{d_i}~\leftarrow~\mathrm{Adam}(L_{D_i},~\theta_{d_i},~\alpha_d,~\beta_1,~\beta_2)$;

    end for

    $L_{G}~\leftarrow~\sum_{i=0}^{2}~L_{G_i}~+~\lambda_1~L_{\rm~DAMSM}~+~L_{\rm~CA}~+~L_{\rm~CSD}(\hat{x}_0,~~\hat{x}_2)~+~L_{\rm~CSD}(\hat{x}_1,~\hat{x}_2)$;

    $\theta_{g}~\leftarrow~\mathrm{Adam}(L_{G},~\theta_{g},~\alpha_g,~\beta_1,~\beta_2)$;

    end for

    return $\theta_{g}$.

  • Table 1  

    Table 1Structure of generators$^{\rm~a)}$

    $G_0$$G_1$/$G_2$
    Concat, FC, BN, GLU, Reshape Attn, Concat
    UpSample($2$), Conv($3\times3/1$), BN, GLU$\times4$ AM block $\}\times~N_{\rm~AM}$
    Conv($3\times3/1$), Tanh UpSample($2$), Conv($3\times3/1$), BN, GLU
    Conv($3\times3/1$), Tanh

    a) AM block denotes the attention-modulation block [17]. GLU denotes the gated linear units layer [42]. UpSample($2$) means that the upsampling stride is 2. Conv($3\times3/1$) means that kernel size is 3 and stride is 1 for the convolutional layer.

  • Table 2  

    Table 2Structure of Downsample blocks$^{\rm~a)}$

    Downsample block in $D_0$ Downsample block in $D_1$Downsample block in $D_2$
    Conv($4\times4/2$), LeakyReLU Conv($4\times4/2$), LeakyReLUConv($4\times4/2$), LeakyReLU
    Conv($4\times4/2$), BN, LeakyReLU $\times~3$ Conv($4\times4/2$), BN, LeakyReLU $\times~4$Conv($4\times4/2$), BN, LeakyReLU $\times~5$
    Conv($3\times3/1$), BN, LeakyReLUConv($3\times3/1$), BN, LeakyReLU $\times~2$

    a) Conv($4\times4~/~2$) means that kernel size is 3 and stride is 1 for the convolutional layer.

  • Table 3  

    Table 3The details of CUB and Oxford-102 datasets

    Dataset#images Captions per image #categories #train categories #test categories
    CUB [21]1178810 20015050
    Oxford-102 [20]8189101028220
  • Table 4  

    Table 4Quantitive comparison with state-of-the-art methods on CUB and Oxford-102 datasets

    Method CUBOxford-102
    Inception score ($\uparrow$) FID ($\downarrow$) Inception score ($\uparrow$) FID ($\downarrow$)
    GAN_CLS_INT [12] 2.88 $\pm$ 0.04 68.79 2.66 $\pm$ 0.03 79.55
    GANWN [47] 3.62 $\pm$ 0.07 67.22
    StackGAN [13] 3.70 $\pm$ 0.04 51.89 3.20 $\pm$ 0.01 55.28
    StackGAN-v2 [14] 4.04 $\pm$ 0.05 15.30 3.26 $\pm$ 0.01 48.68
    HDGAN [16] 4.15 $\pm$ 0.05 18.23 3.45 $\pm$ 0.07
    AttnGAN [15] 4.36 $\pm$ 0.03 10.65 3.75 $\pm$ 0.02
    MirrorGAN [32] 4.56 $\pm$ 0.05
    LeicaGAN [33] 4.62 $\pm$ 0.06 3.92 $\pm$ 0.02
    MS-GAN [17] 4.56 $\pm$ 0.02 10.41 3.95 $\pm$ 0.03 36.24
    ICSD-GAN 4.66 $\pm$ 0.04 9.35 3.87 $\pm$ 0.05 32.64
  • Table 5  

    Table 5Performance of baseline model, model with $L_{\rm~CSD}(I_0,~I_2)$, model with $L_{\rm~CSD}(I_1,~I_2)$ and our full model with the two losses on CUB dataset

    Method Inception score ($\uparrow$) FID ($\downarrow$)
    Baseline 4.56 $\pm$ 0.02 10.41
    With $L_{\rm~CSD}(I_0,~I_2)$ 4.72 $\pm$ 0.06 10.97
    With $L_{\rm~CSD}(I_1,~I_2)$ 4.58 $\pm$ 0.04 9.58
    ICSD-GAN 4.66 $\pm$ 0.04 9.35