SCIENCE CHINA Information Sciences, Volume 64 , Issue 2 : 120102(2021) https://doi.org/10.1007/s11432-020-2900-x

## Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

• AcceptedApr 26, 2020
• PublishedNov 17, 2020
Share
Rating

### Acknowledgment

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61876171, 61976203) and Fundamental Research Funds for the Central Universities.

### References

[1] Xi Chen, Yan Duan, Rein Houthooft, et al. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2172--2180. Google Scholar

[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672--2680. Google Scholar

[3] Diederik P Kingma, Max Welling. Auto-encoding variational bayes. 2013,. arXiv Google Scholar

[4] Andrew Brock, Jeff Donahue, Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[5] Tero Karras, Samuli Laine, Timo Aila. A style-based generator architecture for generative adversarial networks. 2018,. arXiv Google Scholar

[6] Wei Xiong, Wenhan Luo, Lin Ma, et al. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2364--2373. Google Scholar

[7] Wei Xiong, Zhe Lin, Jimei Yang, et al. Foreground-aware image inpainting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 5840--5848. Google Scholar

[8] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, et al. Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1125--1134. Google Scholar

[9] Jun-Yan Zhu, Taesung Park, Phillip Isola, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2223--2232. Google Scholar

[10] Chen J, Shu Z, Miao Y. Structure-preserving shape completion of 3D point clouds with generative adversarial network. Sci Sin-Inf, 2020, 50: 675-691 CrossRef Google Scholar

[11] Yuanhao Li, Dongyang Ao, Corneliu Octavian Dumitru, et al. Super-resolution of geosynchronous synthetic aperture radar images using dialectical GANs. SCIENTIA SINICA Information Sciences, 2019, 62(10): 209302. Google Scholar

[12] Scott Reed, Zeynep Akata, Xinchen Yan, et al. Generative adversarial text to image synthesis. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[13] Han Zhang, Tao Xu, Hongsheng Li, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017 5907-5915. Google Scholar

[14] Han Zhang, Tao Xu, Hongsheng Li, et al. StackGAN+: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(8): 1947-1962. Google Scholar

[15] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1316--1324. Google Scholar

[16] Zizhao Zhang, Yuanpu Xie, Lin Yang. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6199--6208. Google Scholar

[17] Fengling Mao, Bingpeng Ma, Hong Chang, et al. MS-GAN: Text to image synthesis with attention-modulated generators and similarity-aware discriminators. In: Proceedings of British Machine Vision Conference, 2019. Google Scholar

[18] Alec Radford, Luke Metz, Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. 2015,. arXiv Google Scholar

[19] Geoffrey Hinton, Oriol Vinyals, Jeff Dean. Distilling the knowledge in a neural network. 2015,. arXiv Google Scholar

[20] Maria-Elena Nilsback, Andrew Zisserman. Automated flower classification over a large number of classes. In: Proceedings of Indian Conference on Computer Vision, Graphics & Image Processing, 2008. 722--729. Google Scholar

[21] Catherine Wah, Steve Branson, Peter Welinder, et al. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. Google Scholar

[22] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al. Improved techniques for training GANs. In: Proceedings of Advances in Neural Information Processing Systems 2016. 2234--2242. Google Scholar

[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 6626--6637. Google Scholar

[24] Mehdi Mirza, Simon Osindero. Conditional generative adversarial nets. 2014,. arXiv Google Scholar

[25] Yanwei Pang, Jin Xie, Xuelong Li. Visual haze removal by a unified generative adversarial network. In: Proceedings of IEEE Transactions on Circuits and Systems for Video Technology, 2019. Google Scholar

[26] Sangwoo Mo, Minsu Cho, Jinwoo Shin. InstaGAN: instance-aware image-to-image translation. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[27] Zhen Zhu, Tengteng Huang, Baoguang Shi, et al. Progressive pose attention transfer for person image generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[28] Zhijie Zhang, Yanwei Pang. CGNet: cross-guidance network for semantic segmentation. SCIENCE CHINA Information Sciences, 2020, 63: 120104. Google Scholar

[29] Minghui Liao, Boyu Song, Shangbang Long, et al. SynthText3D: synthesizing scene text images from 3D virtual worlds. SCIENTIA SINICA Information Sciences, 2020, 63: 120105. Google Scholar

[30] Scott Reed, Zeynep Akata, Honglak Lee, et al. Learning deep representations of fine-grained visual descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 49--58. Google Scholar

[31] Zhong Ji, Haoran Wang, Jungong Han, et al. Saliency-guided attention network for image-sentence matching. In: Proceedings of IEEE International Conference on Computer Vision, 2019. Google Scholar

[32] Tingting Qiao, Jing Zhang, Duanqing Xu, et al. MirrorGAN: Learning text-to-image generation by redescription. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1505--1514. Google Scholar

[33] Tingting Qiao, Jing Zhang, Duanqing Xu, et al. Learn, imagine and create: Text-to-image generation from prior knowledge. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 885--895. Google Scholar

[34] Wenbo Li, Pengchuan Zhang, Lei Zhang, et al. Object-driven text-to-image synthesis via adversarial training. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[35] Zehao Huang, Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. 2017,. arXiv Google Scholar

[36] Junho Yim, Donggyu Joo, Jihoon Bae, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4133--4141. Google Scholar

[37] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, et al. Fitnets: Hints for thin deep nets. 2014,. arXiv Google Scholar

[38] Sergey Zagoruyko, Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. 2016,. arXiv Google Scholar

[39] Xinqian Gu, Bingpeng Ma, Hong Chang, et al. Temporal knowledge propagation for image-to-video person re-identification. In: Proceedings of IEEE International Conference on Computer Vision, 2019. 9647--9656. Google Scholar

[40] Mingkuan Yuan, Yuxin Peng. Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of ACM International Conference on Multimedia, 2018. Google Scholar

[41] Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018. Google Scholar

[42] Yann N Dauphin, Angela Fan, Michael Auli, et al. Language modeling with gated convolutional networks. In: Proceedings of International Conference on Machine Learning, 2017. 933--941. Google Scholar

[43] Diederik P Kingma, Jimmy Ba. Adam: A method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015. Google Scholar

[44] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, et al. Rethinking the inception architecture for computer vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2818--2826. Google Scholar

[45] Jia Deng, Wei Dong, Richard Socher, et al. Imagenet: A large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248--255. Google Scholar

[46] Mart'ın Abadi, Paul Barham, Jianmin Chen, et al. Tensorflow: A system for large-scale machine learning. In: Proceedings of Symposium on Operating Systems Design and Implementation, 2016. 265--283. Google Scholar

[47] Scott E Reed, Zeynep Akata, Santosh Mohan, et al. Learning what and where to draw. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 217--225. Google Scholar

• Figure 1

(Color online) Architecture of the proposed ICSD-GAN model. The whole network contains three stages (i.e., ${\rm~Stage}^0$, ${\rm~Stage}^1$ and ${\rm~Stage}^2$) and generates images with $64\times~64$, $128\times~128$, and $256\times~256$ resolutions. $G_i$ and $D_i$ are the generator and discriminator for ${\rm~Stage}^i$, respectively. We conduct interstage knowledge distillation between ${\rm~Stage}^2$ and ${\rm~Stage}^0$/${\rm~Stage}^1$ via the CSD block. On the red line, “T" means the teacher branch, and “S" means the student branch.

• Figure 2

(Color online) Detailed structure of CSD block. This block takes student feature $F_{\rm~S}$ and teacher feature $F_{\rm~T}$ as input, and outputs the CSD loss. In this figure, the black solid line denotes forward-propagation and the blue dotted line denotes backpropagation. We optimize the output loss without backpropagation to the teacher branch.

• Figure 3

(Color online) Visualization results of $256\times~256$ images generated by StackGAN, HDGAN, MS-GAN and our ICSD-GAN model on CUB and Oxford-102 dataset.

• Figure 4

(Color online) Visualization of text modification to evaluate the generation ability of our ICSD-GAN on CUB dataset.

• Figure 5

(Color online) Visualization of text modification to evaluate the generation ability of our ICSD-GAN on Oxford-102 dataset.

•

Algorithm 1 ICSD-GAN training algorithm

Require:Training set, number of stages $n_{\rm~stage}=3$, number of training iterations $T$, batch size $N$, learning rate of generator $\alpha_g$, learning rate of discriminator $\alpha_d$, coefficient $\lambda$ and $\lambda_1$ for loss terms, Adam hyperparameters $\beta_1$ and $\beta_2$.

Output:Generator parameters $\theta_g$.

Initialize discriminator parameters $\theta_d~=~[\theta_{d_0},~\theta_{d_1},~\theta_{d_2}]$ and generator parameters $\theta_g~=~[\theta_{g_0},~\theta_{g_1},~\theta_{g_2}]$;

for $t~=~1,\ldots,~T$

Sample $x$, text $t$, mis-matching images $x_w$ and random noise $z$;

$\hat{x}~=~G(z,~t)$ $(\hat{x}~=~[\hat{x}_0\in\mathbb{R}^{N\times3\times64\times~64},~\hat{x}_1\in\mathbb{R}^{N\times3\times128\times128},~\hat{x}_2\in\mathbb{R}^{N\times3\times256\times~256}])$;

for $i~=~0,\ldots,~2$

$L_{D_i}~\leftarrow~L_{D_i}^{\rm~TF}~+\lambda~L_{D_i}^{\rm~ASL}$;

$\theta_{d_i}~\leftarrow~\mathrm{Adam}(L_{D_i},~\theta_{d_i},~\alpha_d,~\beta_1,~\beta_2)$;

end for

$L_{G}~\leftarrow~\sum_{i=0}^{2}~L_{G_i}~+~\lambda_1~L_{\rm~DAMSM}~+~L_{\rm~CA}~+~L_{\rm~CSD}(\hat{x}_0,~~\hat{x}_2)~+~L_{\rm~CSD}(\hat{x}_1,~\hat{x}_2)$;

$\theta_{g}~\leftarrow~\mathrm{Adam}(L_{G},~\theta_{g},~\alpha_g,~\beta_1,~\beta_2)$;

end for

return $\theta_{g}$.

• Table 1

Table 1Structure of generators$^{\rm~a)}$

 $G_0$ $G_1$/$G_2$ Concat, FC, BN, GLU, Reshape Attn, Concat UpSample($2$), Conv($3\times3/1$), BN, GLU$\times4$ AM block $\}\times~N_{\rm~AM}$ Conv($3\times3/1$), Tanh UpSample($2$), Conv($3\times3/1$), BN, GLU Conv($3\times3/1$), Tanh

a) AM block denotes the attention-modulation block [17]. GLU denotes the gated linear units layer [42]. UpSample($2$) means that the upsampling stride is 2. Conv($3\times3/1$) means that kernel size is 3 and stride is 1 for the convolutional layer.

• Table 2

Table 2Structure of Downsample blocks$^{\rm~a)}$

 Downsample block in $D_0$ Downsample block in $D_1$ Downsample block in $D_2$ Conv($4\times4/2$), LeakyReLU Conv($4\times4/2$), LeakyReLU Conv($4\times4/2$), LeakyReLU Conv($4\times4/2$), BN, LeakyReLU $\times~3$ Conv($4\times4/2$), BN, LeakyReLU $\times~4$ Conv($4\times4/2$), BN, LeakyReLU $\times~5$ Conv($3\times3/1$), BN, LeakyReLU Conv($3\times3/1$), BN, LeakyReLU $\times~2$

a) Conv($4\times4~/~2$) means that kernel size is 3 and stride is 1 for the convolutional layer.

• Table 3

Table 3The details of CUB and Oxford-102 datasets

 Dataset #images Captions per image #categories #train categories #test categories CUB [21] 11788 10 200 150 50 Oxford-102 [20] 8189 10 102 82 20
• Table 4

Table 4Quantitive comparison with state-of-the-art methods on CUB and Oxford-102 datasets

 Method CUB Oxford-102 Inception score ($\uparrow$) FID ($\downarrow$) Inception score ($\uparrow$) FID ($\downarrow$) GAN_CLS_INT [12] 2.88 $\pm$ 0.04 68.79 2.66 $\pm$ 0.03 79.55 GANWN [47] 3.62 $\pm$ 0.07 67.22 – – StackGAN [13] 3.70 $\pm$ 0.04 51.89 3.20 $\pm$ 0.01 55.28 StackGAN-v2 [14] 4.04 $\pm$ 0.05 15.30 3.26 $\pm$ 0.01 48.68 HDGAN [16] 4.15 $\pm$ 0.05 18.23 3.45 $\pm$ 0.07 – AttnGAN [15] 4.36 $\pm$ 0.03 10.65 3.75 $\pm$ 0.02 – MirrorGAN [32] 4.56 $\pm$ 0.05 – – – LeicaGAN [33] 4.62 $\pm$ 0.06 – 3.92 $\pm$ 0.02 – MS-GAN [17] 4.56 $\pm$ 0.02 10.41 3.95 $\pm$ 0.03 36.24 ICSD-GAN 4.66 $\pm$ 0.04 9.35 3.87 $\pm$ 0.05 32.64
• Table 5

Table 5Performance of baseline model, model with $L_{\rm~CSD}(I_0,~I_2)$, model with $L_{\rm~CSD}(I_1,~I_2)$ and our full model with the two losses on CUB dataset

 Method Inception score ($\uparrow$) FID ($\downarrow$) Baseline 4.56 $\pm$ 0.02 10.41 With $L_{\rm~CSD}(I_0,~I_2)$ 4.72 $\pm$ 0.06 10.97 With $L_{\rm~CSD}(I_1,~I_2)$ 4.58 $\pm$ 0.04 9.58 ICSD-GAN 4.66 $\pm$ 0.04 9.35

Citations

Altmetric