logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 2 : 120102(2021) https://doi.org/10.1007/s11432-020-2900-x

Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

More info
  • ReceivedJan 21, 2020
  • AcceptedApr 26, 2020
  • PublishedNov 17, 2020

Abstract


Acknowledgment

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61876171, 61976203) and Fundamental Research Funds for the Central Universities.


References

[1] Xi Chen, Yan Duan, Rein Houthooft, et al. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2172--2180. Google Scholar

[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672--2680. Google Scholar

[3] Diederik P Kingma, Max Welling. Auto-encoding variational bayes. 2013,. arXiv Google Scholar

[4] Andrew Brock, Jeff Donahue, Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[5] Tero Karras, Samuli Laine, Timo Aila. A style-based generator architecture for generative adversarial networks. 2018,. arXiv Google Scholar

[6] Wei Xiong, Wenhan Luo, Lin Ma, et al. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2364--2373. Google Scholar

[7] Wei Xiong, Zhe Lin, Jimei Yang, et al. Foreground-aware image inpainting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 5840--5848. Google Scholar

[8] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, et al. Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1125--1134. Google Scholar

[9] Jun-Yan Zhu, Taesung Park, Phillip Isola, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2223--2232. Google Scholar

[10] Chen J, Shu Z, Miao Y. Structure-preserving shape completion of 3D point clouds with generative adversarial network. Sci Sin-Inf, 2020, 50: 675-691 CrossRef Google Scholar

[11] Yuanhao Li, Dongyang Ao, Corneliu Octavian Dumitru, et al. Super-resolution of geosynchronous synthetic aperture radar images using dialectical GANs. SCIENTIA SINICA Information Sciences, 2019, 62(10): 209302. Google Scholar

[12] Scott Reed, Zeynep Akata, Xinchen Yan, et al. Generative adversarial text to image synthesis. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[13] Han Zhang, Tao Xu, Hongsheng Li, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017 5907-5915. Google Scholar

[14] Han Zhang, Tao Xu, Hongsheng Li, et al. StackGAN+: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(8): 1947-1962. Google Scholar

[15] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1316--1324. Google Scholar

[16] Zizhao Zhang, Yuanpu Xie, Lin Yang. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6199--6208. Google Scholar

[17] Fengling Mao, Bingpeng Ma, Hong Chang, et al. MS-GAN: Text to image synthesis with attention-modulated generators and similarity-aware discriminators. In: Proceedings of British Machine Vision Conference, 2019. Google Scholar

[18] Alec Radford, Luke Metz, Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. 2015,. arXiv Google Scholar

[19] Geoffrey Hinton, Oriol Vinyals, Jeff Dean. Distilling the knowledge in a neural network. 2015,. arXiv Google Scholar

[20] Maria-Elena Nilsback, Andrew Zisserman. Automated flower classification over a large number of classes. In: Proceedings of Indian Conference on Computer Vision, Graphics & Image Processing, 2008. 722--729. Google Scholar

[21] Catherine Wah, Steve Branson, Peter Welinder, et al. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. Google Scholar

[22] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al. Improved techniques for training GANs. In: Proceedings of Advances in Neural Information Processing Systems 2016. 2234--2242. Google Scholar

[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 6626--6637. Google Scholar

[24] Mehdi Mirza, Simon Osindero. Conditional generative adversarial nets. 2014,. arXiv Google Scholar

[25] Yanwei Pang, Jin Xie, Xuelong Li. Visual haze removal by a unified generative adversarial network. In: Proceedings of IEEE Transactions on Circuits and Systems for Video Technology, 2019. Google Scholar

[26] Sangwoo Mo, Minsu Cho, Jinwoo Shin. InstaGAN: instance-aware image-to-image translation. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[27] Zhen Zhu, Tengteng Huang, Baoguang Shi, et al. Progressive pose attention transfer for person image generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[28] Zhijie Zhang, Yanwei Pang. CGNet: cross-guidance network for semantic segmentation. SCIENCE CHINA Information Sciences, 2020, 63: 120104. Google Scholar

[29] Minghui Liao, Boyu Song, Shangbang Long, et al. SynthText3D: synthesizing scene text images from 3D virtual worlds. SCIENTIA SINICA Information Sciences, 2020, 63: 120105. Google Scholar

[30] Scott Reed, Zeynep Akata, Honglak Lee, et al. Learning deep representations of fine-grained visual descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 49--58. Google Scholar

[31] Zhong Ji, Haoran Wang, Jungong Han, et al. Saliency-guided attention network for image-sentence matching. In: Proceedings of IEEE International Conference on Computer Vision, 2019. Google Scholar

[32] Tingting Qiao, Jing Zhang, Duanqing Xu, et al. MirrorGAN: Learning text-to-image generation by redescription. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1505--1514. Google Scholar

[33] Tingting Qiao, Jing Zhang, Duanqing Xu, et al. Learn, imagine and create: Text-to-image generation from prior knowledge. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 885--895. Google Scholar

[34] Wenbo Li, Pengchuan Zhang, Lei Zhang, et al. Object-driven text-to-image synthesis via adversarial training. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[35] Zehao Huang, Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. 2017,. arXiv Google Scholar

[36] Junho Yim, Donggyu Joo, Jihoon Bae, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4133--4141. Google Scholar

[37] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, et al. Fitnets: Hints for thin deep nets. 2014,. arXiv Google Scholar

[38] Sergey Zagoruyko, Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. 2016,. arXiv Google Scholar

[39] Xinqian Gu, Bingpeng Ma, Hong Chang, et al. Temporal knowledge propagation for image-to-video person re-identification. In: Proceedings of IEEE International Conference on Computer Vision, 2019. 9647--9656. Google Scholar

[40] Mingkuan Yuan, Yuxin Peng. Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of ACM International Conference on Multimedia, 2018. Google Scholar

[41] Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018. Google Scholar

[42] Yann N Dauphin, Angela Fan, Michael Auli, et al. Language modeling with gated convolutional networks. In: Proceedings of International Conference on Machine Learning, 2017. 933--941. Google Scholar

[43] Diederik P Kingma, Jimmy Ba. Adam: A method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015. Google Scholar

[44] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, et al. Rethinking the inception architecture for computer vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2818--2826. Google Scholar

[45] Jia Deng, Wei Dong, Richard Socher, et al. Imagenet: A large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248--255. Google Scholar

[46] Mart'ın Abadi, Paul Barham, Jianmin Chen, et al. Tensorflow: A system for large-scale machine learning. In: Proceedings of Symposium on Operating Systems Design and Implementation, 2016. 265--283. Google Scholar

[47] Scott E Reed, Zeynep Akata, Santosh Mohan, et al. Learning what and where to draw. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 217--225. Google Scholar

  • Figure 1

    (Color online) Architecture of the proposed ICSD-GAN model. The whole network contains three stages (i.e., ${\rm~Stage}^0$, ${\rm~Stage}^1$ and ${\rm~Stage}^2$) and generates images with $64\times~64$, $128\times~128$, and $256\times~256$ resolutions. $G_i$ and $D_i$ are the generator and discriminator for ${\rm~Stage}^i$, respectively. We conduct interstage knowledge distillation between ${\rm~Stage}^2$ and ${\rm~Stage}^0$/${\rm~Stage}^1$ via the CSD block. On the red line, “T" means the teacher branch, and “S" means the student branch.

  • Figure 2

    (Color online) Detailed structure of CSD block. This block takes student feature $F_{\rm~S}$ and teacher feature $F_{\rm~T}$ as input, and outputs the CSD loss. In this figure, the black solid line denotes forward-propagation and the blue dotted line denotes backpropagation. We optimize the output loss without backpropagation to the teacher branch.

  • Figure 3

    (Color online) Visualization results of $256\times~256$ images generated by StackGAN, HDGAN, MS-GAN and our ICSD-GAN model on CUB and Oxford-102 dataset.

  • Figure 4

    (Color online) Visualization of text modification to evaluate the generation ability of our ICSD-GAN on CUB dataset.

  • Figure 5

    (Color online) Visualization of text modification to evaluate the generation ability of our ICSD-GAN on Oxford-102 dataset.

  •   

    Algorithm 1 ICSD-GAN training algorithm

    Require:Training set, number of stages $n_{\rm~stage}=3$, number of training iterations $T$, batch size $N$, learning rate of generator $\alpha_g$, learning rate of discriminator $\alpha_d$, coefficient $\lambda$ and $\lambda_1$ for loss terms, Adam hyperparameters $\beta_1$ and $\beta_2$.

    Output:Generator parameters $\theta_g$.

    Initialize discriminator parameters $\theta_d~=~[\theta_{d_0},~\theta_{d_1},~\theta_{d_2}]$ and generator parameters $\theta_g~=~[\theta_{g_0},~\theta_{g_1},~\theta_{g_2}]$;

    for $t~=~1,\ldots,~T$

    Sample $x$, text $t$, mis-matching images $x_w$ and random noise $z$;

    $\hat{x}~=~G(z,~t)$ $(\hat{x}~=~[\hat{x}_0\in\mathbb{R}^{N\times3\times64\times~64},~\hat{x}_1\in\mathbb{R}^{N\times3\times128\times128},~\hat{x}_2\in\mathbb{R}^{N\times3\times256\times~256}])$;

    for $i~=~0,\ldots,~2$

    $L_{D_i}~\leftarrow~L_{D_i}^{\rm~TF}~+\lambda~L_{D_i}^{\rm~ASL}$;

    $\theta_{d_i}~\leftarrow~\mathrm{Adam}(L_{D_i},~\theta_{d_i},~\alpha_d,~\beta_1,~\beta_2)$;

    end for

    $L_{G}~\leftarrow~\sum_{i=0}^{2}~L_{G_i}~+~\lambda_1~L_{\rm~DAMSM}~+~L_{\rm~CA}~+~L_{\rm~CSD}(\hat{x}_0,~~\hat{x}_2)~+~L_{\rm~CSD}(\hat{x}_1,~\hat{x}_2)$;

    $\theta_{g}~\leftarrow~\mathrm{Adam}(L_{G},~\theta_{g},~\alpha_g,~\beta_1,~\beta_2)$;

    end for

    return $\theta_{g}$.

  • Table 1  

    Table 1Structure of generators$^{\rm~a)}$

    $G_0$$G_1$/$G_2$
    Concat, FC, BN, GLU, Reshape Attn, Concat
    UpSample($2$), Conv($3\times3/1$), BN, GLU$\times4$ AM block $\}\times~N_{\rm~AM}$
    Conv($3\times3/1$), Tanh UpSample($2$), Conv($3\times3/1$), BN, GLU
    Conv($3\times3/1$), Tanh

    a) AM block denotes the attention-modulation block [17]. GLU denotes the gated linear units layer [42]. UpSample($2$) means that the upsampling stride is 2. Conv($3\times3/1$) means that kernel size is 3 and stride is 1 for the convolutional layer.

  • Table 2  

    Table 2Structure of Downsample blocks$^{\rm~a)}$

    Downsample block in $D_0$ Downsample block in $D_1$Downsample block in $D_2$
    Conv($4\times4/2$), LeakyReLU Conv($4\times4/2$), LeakyReLUConv($4\times4/2$), LeakyReLU
    Conv($4\times4/2$), BN, LeakyReLU $\times~3$ Conv($4\times4/2$), BN, LeakyReLU $\times~4$Conv($4\times4/2$), BN, LeakyReLU $\times~5$
    Conv($3\times3/1$), BN, LeakyReLUConv($3\times3/1$), BN, LeakyReLU $\times~2$

    a) Conv($4\times4~/~2$) means that kernel size is 3 and stride is 1 for the convolutional layer.

  • Table 3  

    Table 3The details of CUB and Oxford-102 datasets

    Dataset#images Captions per image #categories #train categories #test categories
    CUB [21]1178810 20015050
    Oxford-102 [20]8189101028220
  • Table 4  

    Table 4Quantitive comparison with state-of-the-art methods on CUB and Oxford-102 datasets

    Method CUBOxford-102
    Inception score ($\uparrow$) FID ($\downarrow$) Inception score ($\uparrow$) FID ($\downarrow$)
    GAN_CLS_INT [12] 2.88 $\pm$ 0.04 68.79 2.66 $\pm$ 0.03 79.55
    GANWN [47] 3.62 $\pm$ 0.07 67.22
    StackGAN [13] 3.70 $\pm$ 0.04 51.89 3.20 $\pm$ 0.01 55.28
    StackGAN-v2 [14] 4.04 $\pm$ 0.05 15.30 3.26 $\pm$ 0.01 48.68
    HDGAN [16] 4.15 $\pm$ 0.05 18.23 3.45 $\pm$ 0.07
    AttnGAN [15] 4.36 $\pm$ 0.03 10.65 3.75 $\pm$ 0.02
    MirrorGAN [32] 4.56 $\pm$ 0.05
    LeicaGAN [33] 4.62 $\pm$ 0.06 3.92 $\pm$ 0.02
    MS-GAN [17] 4.56 $\pm$ 0.02 10.41 3.95 $\pm$ 0.03 36.24
    ICSD-GAN 4.66 $\pm$ 0.04 9.35 3.87 $\pm$ 0.05 32.64
  • Table 5  

    Table 5Performance of baseline model, model with $L_{\rm~CSD}(I_0,~I_2)$, model with $L_{\rm~CSD}(I_1,~I_2)$ and our full model with the two losses on CUB dataset

    Method Inception score ($\uparrow$) FID ($\downarrow$)
    Baseline 4.56 $\pm$ 0.02 10.41
    With $L_{\rm~CSD}(I_0,~I_2)$ 4.72 $\pm$ 0.06 10.97
    With $L_{\rm~CSD}(I_1,~I_2)$ 4.58 $\pm$ 0.04 9.58
    ICSD-GAN 4.66 $\pm$ 0.04 9.35