For a given text, previous text-to-image synthesis methods commonly utilize a multistage generation model to produce images with high resolution in a coarse-to-fine manner. However, these methods ignore the interaction among stages, and they do not constrain the consistent cross-sample relations of images generated in different stages. These deficiencies result in inefficient generation and discrimination. In this study, we propose an interstage cross-sample similarity distillation model based on a generative adversarial network (GAN) for learning efficient text-to-image synthesis. To strengthen the interaction among stages, we achieve interstage knowledge distillation from the refined stage to the coarse stages with novel interstage cross-sample similarity distillation blocks. To enhance the constraint on the cross-sample relations of the images generated at different stages, we conduct cross-sample similarity distillation among the stages. Extensive experiments on the Oxford-102 and Caltech-UCSD Birds-200-2011 (CUB) datasets show that our model generates visually pleasing images and achieves quantitatively comparable performance with state-of-the-art methods.
This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61876171, 61976203) and Fundamental Research Funds for the Central Universities.
[1] Xi Chen, Yan Duan, Rein Houthooft, et al. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 2172--2180. Google Scholar
[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672--2680. Google Scholar
[3] Diederik P Kingma, Max Welling. Auto-encoding variational bayes. 2013,. arXiv Google Scholar
[4] Andrew Brock, Jeff Donahue, Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar
[5] Tero Karras, Samuli Laine, Timo Aila. A style-based generator architecture for generative adversarial networks. 2018,. arXiv Google Scholar
[6] Wei Xiong, Wenhan Luo, Lin Ma, et al. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2364--2373. Google Scholar
[7] Wei Xiong, Zhe Lin, Jimei Yang, et al. Foreground-aware image inpainting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 5840--5848. Google Scholar
[8] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, et al. Image-to-image translation with conditional adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1125--1134. Google Scholar
[9] Jun-Yan Zhu, Taesung Park, Phillip Isola, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2223--2232. Google Scholar
[10] Chen J, Shu Z, Miao Y. Structure-preserving shape completion of 3D point clouds with generative adversarial network. Sci Sin-Inf, 2020, 50: 675-691 CrossRef Google Scholar
[11] Yuanhao Li, Dongyang Ao, Corneliu Octavian Dumitru, et al. Super-resolution of geosynchronous synthetic aperture radar images using dialectical GANs. SCIENTIA SINICA Information Sciences, 2019, 62(10): 209302. Google Scholar
[12] Scott Reed, Zeynep Akata, Xinchen Yan, et al. Generative adversarial text to image synthesis. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar
[13] Han Zhang, Tao Xu, Hongsheng Li, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017 5907-5915. Google Scholar
[14] Han Zhang, Tao Xu, Hongsheng Li, et al. StackGAN+: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(8): 1947-1962. Google Scholar
[15] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1316--1324. Google Scholar
[16] Zizhao Zhang, Yuanpu Xie, Lin Yang. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 6199--6208. Google Scholar
[17] Fengling Mao, Bingpeng Ma, Hong Chang, et al. MS-GAN: Text to image synthesis with attention-modulated generators and similarity-aware discriminators. In: Proceedings of British Machine Vision Conference, 2019. Google Scholar
[18] Alec Radford, Luke Metz, Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. 2015,. arXiv Google Scholar
[19] Geoffrey Hinton, Oriol Vinyals, Jeff Dean. Distilling the knowledge in a neural network. 2015,. arXiv Google Scholar
[20] Maria-Elena Nilsback, Andrew Zisserman. Automated flower classification over a large number of classes. In: Proceedings of Indian Conference on Computer Vision, Graphics & Image Processing, 2008. 722--729. Google Scholar
[21] Catherine Wah, Steve Branson, Peter Welinder, et al. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. Google Scholar
[22] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al. Improved techniques for training GANs. In: Proceedings of Advances in Neural Information Processing Systems 2016. 2234--2242. Google Scholar
[23] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 6626--6637. Google Scholar
[24] Mehdi Mirza, Simon Osindero. Conditional generative adversarial nets. 2014,. arXiv Google Scholar
[25] Yanwei Pang, Jin Xie, Xuelong Li. Visual haze removal by a unified generative adversarial network. In: Proceedings of IEEE Transactions on Circuits and Systems for Video Technology, 2019. Google Scholar
[26] Sangwoo Mo, Minsu Cho, Jinwoo Shin. InstaGAN: instance-aware image-to-image translation. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar
[27] Zhen Zhu, Tengteng Huang, Baoguang Shi, et al. Progressive pose attention transfer for person image generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar
[28] Zhijie Zhang, Yanwei Pang. CGNet: cross-guidance network for semantic segmentation. SCIENCE CHINA Information Sciences, 2020, 63: 120104. Google Scholar
[29] Minghui Liao, Boyu Song, Shangbang Long, et al. SynthText3D: synthesizing scene text images from 3D virtual worlds. SCIENTIA SINICA Information Sciences, 2020, 63: 120105. Google Scholar
[30] Scott Reed, Zeynep Akata, Honglak Lee, et al. Learning deep representations of fine-grained visual descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 49--58. Google Scholar
[31] Zhong Ji, Haoran Wang, Jungong Han, et al. Saliency-guided attention network for image-sentence matching. In: Proceedings of IEEE International Conference on Computer Vision, 2019. Google Scholar
[32] Tingting Qiao, Jing Zhang, Duanqing Xu, et al. MirrorGAN: Learning text-to-image generation by redescription. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1505--1514. Google Scholar
[33] Tingting Qiao, Jing Zhang, Duanqing Xu, et al. Learn, imagine and create: Text-to-image generation from prior knowledge. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 885--895. Google Scholar
[34] Wenbo Li, Pengchuan Zhang, Lei Zhang, et al. Object-driven text-to-image synthesis via adversarial training. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar
[35] Zehao Huang, Naiyan Wang. Like what you like: Knowledge distill via neuron selectivity transfer. 2017,. arXiv Google Scholar
[36] Junho Yim, Donggyu Joo, Jihoon Bae, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 4133--4141. Google Scholar
[37] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, et al. Fitnets: Hints for thin deep nets. 2014,. arXiv Google Scholar
[38] Sergey Zagoruyko, Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. 2016,. arXiv Google Scholar
[39] Xinqian Gu, Bingpeng Ma, Hong Chang, et al. Temporal knowledge propagation for image-to-video person re-identification. In: Proceedings of IEEE International Conference on Computer Vision, 2019. 9647--9656. Google Scholar
[40] Mingkuan Yuan, Yuxin Peng. Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of ACM International Conference on Multimedia, 2018. Google Scholar
[41] Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In: Proceedings of AAAI Conference on Artificial Intelligence, 2018. Google Scholar
[42] Yann N Dauphin, Angela Fan, Michael Auli, et al. Language modeling with gated convolutional networks. In: Proceedings of International Conference on Machine Learning, 2017. 933--941. Google Scholar
[43] Diederik P Kingma, Jimmy Ba. Adam: A method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015. Google Scholar
[44] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, et al. Rethinking the inception architecture for computer vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2818--2826. Google Scholar
[45] Jia Deng, Wei Dong, Richard Socher, et al. Imagenet: A large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. 248--255. Google Scholar
[46] Mart'ın Abadi, Paul Barham, Jianmin Chen, et al. Tensorflow: A system for large-scale machine learning. In: Proceedings of Symposium on Operating Systems Design and Implementation, 2016. 265--283. Google Scholar
[47] Scott E Reed, Zeynep Akata, Santosh Mohan, et al. Learning what and where to draw. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 217--225. Google Scholar
Figure 1
(Color online) Architecture of the proposed ICSD-GAN model. The whole network contains three stages (i.e., ${\rm~Stage}^0$, ${\rm~Stage}^1$ and ${\rm~Stage}^2$) and generates images with $64\times~64$, $128\times~128$, and $256\times~256$ resolutions. $G_i$ and $D_i$ are the generator and discriminator for ${\rm~Stage}^i$, respectively. We conduct interstage knowledge distillation between ${\rm~Stage}^2$ and ${\rm~Stage}^0$/${\rm~Stage}^1$ via the CSD block. On the red line, “T" means the teacher branch, and “S" means the student branch.
Figure 2
(Color online) Detailed structure of CSD block. This block takes student feature $F_{\rm~S}$ and teacher feature $F_{\rm~T}$ as input, and outputs the CSD loss. In this figure, the black solid line denotes forward-propagation and the blue dotted line denotes backpropagation. We optimize the output loss without backpropagation to the teacher branch.
Figure 3
(Color online) Visualization results of $256\times~256$ images generated by StackGAN, HDGAN, MS-GAN and our ICSD-GAN model on CUB and Oxford-102 dataset.
Figure 4
(Color online) Visualization of text modification to evaluate the generation ability of our ICSD-GAN on CUB dataset.
Figure 5
(Color online) Visualization of text modification to evaluate the generation ability of our ICSD-GAN on Oxford-102 dataset.
Initialize discriminator parameters $\theta_d~=~[\theta_{d_0},~\theta_{d_1},~\theta_{d_2}]$ and generator parameters $\theta_g~=~[\theta_{g_0},~\theta_{g_1},~\theta_{g_2}]$; |
Sample $x$, text $t$, mis-matching images $x_w$ and random noise $z$; |
$\hat{x}~=~G(z,~t)$ $(\hat{x}~=~[\hat{x}_0\in\mathbb{R}^{N\times3\times64\times~64},~\hat{x}_1\in\mathbb{R}^{N\times3\times128\times128},~\hat{x}_2\in\mathbb{R}^{N\times3\times256\times~256}])$; |
|
$L_{D_i}~\leftarrow~L_{D_i}^{\rm~TF}~+\lambda~L_{D_i}^{\rm~ASL}$; |
$\theta_{d_i}~\leftarrow~\mathrm{Adam}(L_{D_i},~\theta_{d_i},~\alpha_d,~\beta_1,~\beta_2)$; |
|
$L_{G}~\leftarrow~\sum_{i=0}^{2}~L_{G_i}~+~\lambda_1~L_{\rm~DAMSM}~+~L_{\rm~CA}~+~L_{\rm~CSD}(\hat{x}_0,~~\hat{x}_2)~+~L_{\rm~CSD}(\hat{x}_1,~\hat{x}_2)$; |
$\theta_{g}~\leftarrow~\mathrm{Adam}(L_{G},~\theta_{g},~\alpha_g,~\beta_1,~\beta_2)$; |
$G_0$ | $G_1$/$G_2$ |
Concat, FC, BN, GLU, Reshape | Attn, Concat |
UpSample($2$), Conv($3\times3/1$), BN, GLU$\times4$ | AM block $\}\times~N_{\rm~AM}$ |
Conv($3\times3/1$), Tanh | UpSample($2$), Conv($3\times3/1$), BN, GLU |
Conv($3\times3/1$), Tanh |
a) AM block denotes the attention-modulation block
Downsample block in $D_0$ | Downsample block in $D_1$ | Downsample block in $D_2$ |
Conv($4\times4/2$), LeakyReLU | Conv($4\times4/2$), LeakyReLU | Conv($4\times4/2$), LeakyReLU |
Conv($4\times4/2$), BN, LeakyReLU $\times~3$ | Conv($4\times4/2$), BN, LeakyReLU $\times~4$ | Conv($4\times4/2$), BN, LeakyReLU $\times~5$ |
Conv($3\times3/1$), BN, LeakyReLU | Conv($3\times3/1$), BN, LeakyReLU $\times~2$ |
a) Conv($4\times4~/~2$) means that kernel size is 3 and stride is 1 for the convolutional layer.
Dataset | #images | Captions per image | #categories | #train categories | #test categories |
CUB | 11788 | 10 | 200 | 150 | 50 |
Oxford-102 | 8189 | 10 | 102 | 82 | 20 |
Method | CUB | Oxford-102 | ||
Inception score ($\uparrow$) | FID ($\downarrow$) | Inception score ($\uparrow$) | FID ($\downarrow$) | |
GAN_CLS_INT | 2.88 $\pm$ 0.04 | 68.79 | 2.66 $\pm$ 0.03 | 79.55 |
GANWN | 3.62 $\pm$ 0.07 | 67.22 | – | – |
StackGAN | 3.70 $\pm$ 0.04 | 51.89 | 3.20 $\pm$ 0.01 | 55.28 |
StackGAN-v2 | 4.04 $\pm$ 0.05 | 15.30 | 3.26 $\pm$ 0.01 | 48.68 |
HDGAN | 4.15 $\pm$ 0.05 | 18.23 | 3.45 $\pm$ 0.07 | – |
AttnGAN | 4.36 $\pm$ 0.03 | 10.65 | 3.75 $\pm$ 0.02 | – |
MirrorGAN | 4.56 $\pm$ 0.05 | – | – | – |
LeicaGAN | 4.62 $\pm$ 0.06 | – | 3.92 $\pm$ 0.02 | – |
MS-GAN | 4.56 $\pm$ 0.02 | 10.41 | 36.24 | |
ICSD-GAN | 3.87 $\pm$ 0.05 |
Method | Inception score ($\uparrow$) | FID ($\downarrow$) |
Baseline | 4.56 $\pm$ 0.02 | 10.41 |
With $L_{\rm~CSD}(I_0,~I_2)$ | 10.97 | |
With $L_{\rm~CSD}(I_1,~I_2)$ | 4.58 $\pm$ 0.04 | 9.58 |
ICSD-GAN | 4.66 $\pm$ 0.04 |