logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 8 : 1217-1238(2020) https://doi.org/10.1360/N112018-00304

Multi-task learning with shared random effects and specific sparse effects

More info
  • ReceivedJan 28, 2019
  • AcceptedJun 5, 2019
  • PublishedAug 5, 2020

Abstract

In multi-task learning scenarios, random effects may be shared among different tasks while each task can have its own sparse effects. This structure has often been observed in the field of sentiment analysis for movie rating. In this study, we consider a multi-task learning problem in the presence of variables with shared random effects and specific sparse effects. To address this issue, we propose MSS (multi-task learning with shared random effects and specific sparse effects). To build this model, appropriate priors for the shared effects and specific effects under the Bayesian framework are considered. To overcome the computational complexity of Bayesian inference, an efficient algorithm is proposed based on variational inference, which is scalable to large-scale data analysis problems. The effectiveness of MSS in prediction and variable selection is demonstrated through comprehensive simulation studies and real data analysis of movie rating. The results demonstrate that the characterization of shared weak effects and task-specific sparse effects can improve the accuracy of prediction and variable selection.


Funded by

国家自然科学基金(71472023,11501440)


Supplement

Appendix

变分推断与参数估计

E-step

当给定参数 ${\boldsymbol~\theta}~=~\{\sigma_{{\boldsymbol~\beta}_0}^2,~\sigma_{{\boldsymbol~\beta}_j}^2,~\sigma_j^2,~\pi_j\}$ 时, 可以得到联合分布: \begin{align}{{\rm Pr}({\boldsymbol y}, {\boldsymbol \beta}_0, {\boldsymbol \beta}_j, {\boldsymbol \gamma}_j|{\boldsymbol X};\theta)} = &\prod_{j}\left({{\rm Pr}({\boldsymbol y}_j|{\boldsymbol \beta}_0, {\boldsymbol \beta}_j, {\boldsymbol \gamma}_j;\theta)}{\rm Pr}({\boldsymbol \beta}_j, {\boldsymbol \gamma}_j|\theta)\right){\rm Pr}({\boldsymbol \beta}_0|\theta) \\ = &\prod_{j}\left({\rm Pr}\big({\boldsymbol y}_j|{\boldsymbol X}_j({\boldsymbol \beta}_0+{\boldsymbol \beta}_j\odot{\boldsymbol \gamma}_j)\big)\mathcal{N}({\boldsymbol \beta}_j| \bf{0}, \sigma_{{\boldsymbol \beta}_j}^2{\boldsymbol I}_p)\prod_{k}{\pi_j^{\gamma_{jk}}(1-\pi_j)^{1-\gamma_{jk}}}\right)\mathcal{N}({\boldsymbol \beta}_0| \bf{0}, \sigma_{{\boldsymbol \beta}_0}^2{\boldsymbol I}_p), \tag{16} \end{align} 其中 ${\boldsymbol~\beta}_j\odot{\boldsymbol~\gamma}_j$ 表示两个向量对应元素的乘积, 其结果为第 $k$ 个元素为 $\beta_{jk}\gamma_{jk}$ $(k~=~1,~\ldots,~p)$ 的向量. 我们对模型中的潜在变量 $\boldsymbol{\beta_0}$, $\boldsymbol{\beta}$, $\boldsymbol{\gamma}$ 积分得到边缘分布: \begin{align}{{\rm Pr}({\boldsymbol y}|{\boldsymbol X};\boldsymbol{\theta)}} = &\sum_{{\boldsymbol \gamma}}\int\int\prod_{j}\bigg({\mathcal{N}\big({\boldsymbol X}_j({\boldsymbol \beta}_0+{\boldsymbol \beta}_j\odot{\boldsymbol \gamma}_j), \sigma_j^2{\boldsymbol I}_{n_j}\big)}\mathcal{N}({\boldsymbol \beta}_j| \bf{0}, \sigma_{{\boldsymbol \beta}_j}^2{\boldsymbol I}_p) \\ &\cdot \prod_{k}{\pi_j^{\gamma_{jk}}(1-\pi_j)^{1-\gamma_{jk}}}\bigg)\mathcal{N}({\boldsymbol \beta}_0| \bf{0}, \sigma_{{\boldsymbol \beta}_0}^2{\boldsymbol I}_p)\, {\rm d}{\boldsymbol \beta}_0\, {\rm d}{\boldsymbol \beta}. \tag{17} \end{align} 同时, 我们也希望得到潜在变量的后验分布: \begin{equation}{\rm Pr}({\boldsymbol \beta}_0, {\boldsymbol{\beta}}, \boldsymbol{\gamma}|{\boldsymbol y}, {\boldsymbol X};\boldsymbol{\theta}) = \frac{{{\rm Pr}({\boldsymbol y}, {\boldsymbol \beta}_0, {\boldsymbol \beta}_j, {\boldsymbol \gamma}_j|{\boldsymbol X};\boldsymbol{\theta})}}{{{\rm Pr}({\boldsymbol y}|{\boldsymbol X};\boldsymbol{\theta)}}}. \tag{18}\end{equation} 由于积分的存在, 等式(17) 很难得到解析的表达形式, 因此我们这里采用变分推断的方法. 假设潜在变量的后验概率分布可以进行如下分解: \begin{equation}q({\boldsymbol \beta}_0, {\boldsymbol{\beta}}, \boldsymbol{\gamma}) = q({\boldsymbol \beta}_0)\prod_{j}{\prod_{k}\big({q(\beta_{jk}|\gamma_{jk}})q(\gamma_{jk})}\big), \tag{19}\end{equation} 利用变分推断的一般结论, $\log{q({\boldsymbol~\beta}_0)}$ 的最优解具有如下的表达形式: \begin{align}\log{q^*({\boldsymbol \beta}_0)} = & \mathbb{E}_{{\boldsymbol \beta}, {\boldsymbol \gamma}} [\log{{\rm Pr}({\boldsymbol y}_j, {\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma}|{\boldsymbol X}; \theta)}] \\ = &\sum_j \left(\mathbb{E}_{{\boldsymbol \beta}_j, {\boldsymbol \gamma}_j} \left[-\frac{n_j}{2}\log(2\pi\sigma_j^2)-\frac{1}{2\sigma_j^2}\big({\boldsymbol y}_j-{\boldsymbol X}_j({\boldsymbol \beta}_0+{\boldsymbol \gamma}_j\odot{\boldsymbol \beta}_j)\big)^{\rm T}\big({\boldsymbol y}_j-{\boldsymbol X}_j({\boldsymbol \beta}_0+{\boldsymbol \gamma}_j\odot{\boldsymbol \beta}_j)\big) \right.\right. \\ &-\frac{p}{2}\log(2\pi\sigma_{{\boldsymbol \beta}_j}^2)-\frac{1}{2\sigma_{{\boldsymbol \beta}_j}^2}\beta_{jk}^2-\frac{1}{2\sigma_{{\boldsymbol \beta}_j}^2}\sum_{l\not = k}{\beta_{jl}^2}+\gamma_{jk}{\log(\pi_j)}+(1-\gamma_{jk})\log(1-\pi_j) \\ &+\log(\pi_j)\sum_{l\not = k}\gamma_{jl}+\log(1-\pi_j)\sum_{l\not = k}{(1-\gamma_{jl})}\Bigg]\Bigg) -\frac{p}{2}\log(2\pi\sigma_{{\boldsymbol \beta}_0}^2)-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol \beta}_0^{\rm T}{\boldsymbol \beta}_0 \\ = &\sum_{j}\bigg({-\frac{1}{2\sigma_j^2}\big({\boldsymbol \beta}_0^{\rm T}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j{\boldsymbol \beta}_0+2\mathbb{E}_{{\boldsymbol \beta}_j, {\boldsymbol \gamma}_j}{[({\boldsymbol \gamma}_j\odot{\boldsymbol \beta}_j)^{\rm T}]}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j){\boldsymbol \beta}_0-2{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j{\boldsymbol \beta}_0\big)}\bigg) -\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol \beta}_0^{\rm T}{\boldsymbol \beta}_0+{\rm const} \\ = & {\boldsymbol \beta}_0^{\rm T}\left(\sum_{j}{-\frac{1}{2\sigma_{j}^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right){\boldsymbol \beta}_0+\sum_{j}-\frac{1}{\sigma_{j}^2}\left(\mathbb{E}_{{\boldsymbol \beta}_j, {\boldsymbol \gamma}_j}{[({\boldsymbol \gamma}_j\odot{\boldsymbol \beta}_j)^{\rm T}]}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\right){\boldsymbol \beta}_0 +{\rm const}. \tag{20} \end{align} 因为 $\log{q^*({\boldsymbol~\beta}_0)}$ 是二次型, 所以 ${\boldsymbol~\beta}_0~\sim~\mathcal{N}({\boldsymbol~\mu}_0,~{\boldsymbol~S}_0^2)$, 其中 \begin{align}&{\boldsymbol S}_0^2 = -\frac{1}{2}\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right)^{-1}, \tag{21} \\ &{\boldsymbol \mu}_0 = {\boldsymbol S}_0^2\sum_{j}-\frac{1}{\sigma_j^2}\left(\mathbb{E}_{{\boldsymbol \beta}_j, {\boldsymbol \gamma}_j}{[({\boldsymbol \gamma}_j\odot{\boldsymbol \beta}_j)^{\rm T}]}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\right)^{\rm T}. \tag{22} \end{align} 需要注意的是, 当 $p$ 很大的时候, 与 ${\boldsymbol~S}_0^2$ 有关的计算量会非常大, 因为在迭代的每一步都涉及对矩阵求逆. 所以进一步的, 我们假设 $q({\boldsymbol~\beta}_0)$ 可以分解为 $\prod_{k~=~1}^{p}q(\beta_{0k})$, 对应的结果为 \begin{align}&{\boldsymbol S}_0^2 = -\frac{1}{2}{\rm diag}\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right)^{-1}, \tag{23} \\ &\mu_{0k} = {\boldsymbol S}_0^2(k, k)\sum_{j}-\frac{1}{\sigma_j^2}\left(\sum_{l\not = k}{\boldsymbol x}_{jl}\mathbb{E}[{\beta_{0l}}]+{\boldsymbol X}_j\mathbb{E}_{{\boldsymbol \beta}_j, {\boldsymbol \gamma}_j}{[({\boldsymbol \gamma}_j\odot{\boldsymbol \beta}_j)]}-{\boldsymbol y}_j\right)^{\rm T}{\boldsymbol x}_{jk}, \tag{24} \end{align} 这里 ${\boldsymbol~x}_{jk}$ 表示设计矩阵 ${\boldsymbol~X}_j$ 的第 $k$ 列.

接下来, 对联合分布等式 (16) 取对数, 并整理如下: \begin{align}\log {\rm Pr}({\boldsymbol y}, {\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma}|{\boldsymbol X};\boldsymbol{\theta)} = &\sum_{j}\left(-\frac{n_j}{2}\log(2\pi\sigma_j^2)-\frac{{\boldsymbol y}_j^{\rm T}{\boldsymbol y}_j}{2\sigma_{j}^2}+\frac{(\beta_{0k}+\gamma_{jk}\beta_{jk}){\boldsymbol x}_{jk}^{\rm T}{\boldsymbol y}_j}{\sigma_j^2}+\frac{\sum_{l\not = k}(\beta_{0l}+\gamma_{jl}\beta_{jl}){\boldsymbol x}_{jl}^{\rm T}{\boldsymbol y}_j}{\sigma_j^2} \right. \\ &-\frac{1}{2\sigma_{j}^2}(\beta_{0k}+\gamma_{jk}\beta_{jk})^2{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}-\frac{\sum_{l\not = k}\sum_{l'\not = l{\rm \;and\;}k}(\beta_{0l}+\gamma_{jl}\beta_{jl})(\beta_{0l'}+\gamma_{jl'}\beta_{jl'}){\boldsymbol x}_{jl}^{\rm T}{\boldsymbol x}_{jl'}}{2\sigma_j^2} \\ &-\frac{\sum_{l\not = k}{(\beta_{0l}+\gamma_{jl}\beta_{jl})^2{\boldsymbol x}_{jl}^{\rm T}{\boldsymbol x}_{jl}}}{2\sigma_{j}^2}-\frac{\sum_{l\not = k}(\beta_{0k}+\gamma_{jk}\beta_{jk})(\beta_{0l}+\gamma_{jl}\beta_{jl}){\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jl}}{\sigma_j^2} \\ &-\frac{p}{2}\log(2\pi\sigma_{{\boldsymbol \beta}_j}^2)-\frac{1}{2\sigma_{{\boldsymbol \beta}_j}^2}\beta_{jk}^2-\frac{1}{2\sigma_{{\boldsymbol \beta}_j}^2}\sum_{l\not = k}{\beta_{jl}^2}+\gamma_{jk}{\log(\pi_j)}+(1-\gamma_{jk})\log(1-\pi_j) \\ &+\log(\pi_j)\sum_{l\not = k}\gamma_{jl}+\log(1-\pi_j)\sum_{l\not = k}({1-\gamma_{jl}})\Bigg) -\frac{p}{2}\log(2\pi\sigma_{{\boldsymbol \beta}_0}^2)-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}\sum_{k}{\beta_{0k}^{\rm T}\beta_{0k}}. \tag{25} \end{align} 当 $\gamma_{jk}~=~1$ 时, 对等式(25)关于 $q(\beta_{-jk},~\gamma_{-jk})$ 和 $q({\boldsymbol~\beta_0})$ 取期望, 可以得到 \begin{align}&\log{q(\beta_{jk}|\gamma_{jk} = 1)} \\ & = \left(-\frac{1}{2\sigma_j^2}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}-\frac{1}{2\sigma_{{\boldsymbol \beta}_j}^2}\right)\beta_{jk}^2+\frac{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol y}_j-\sum_{l\not = k}{\mathbb{E}_{jl}{[\gamma_{jl}\beta_{jl}]}}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jl}-\sum_{l = 1}^p{\mathbb{E}_{\beta_{0l}}{[\beta_{0l}]}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jl}}}{\sigma_j^2}\beta_{jk} +{\rm const} \\ & = \left(-\frac{1}{2\sigma_j^2}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}-\frac{1}{2\sigma_{{\boldsymbol \beta}_j}^2}\right)\beta_{jk}^2+\frac{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol y}_j-\sum_{l\not = k}{\mathbb{E}_{jl}{[\gamma_{jl}\beta_{jl}]}}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jl}-{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol X}_j\mathbb{E}_{\beta_{0}}{[{\boldsymbol \beta}_0]}}{\sigma_j^2}\beta_{jk} +{\rm const}, \tag{26} \end{align} 这是关于 $\beta_{jk}$ 的二次型, 因此 $q^*(\beta_{jk}|\gamma_{jk}~=~1)~\sim~\mathcal{N}({\boldsymbol~\mu}_{jk},~{\boldsymbol~S}_{jk}^2)$, 其中 \begin{align}s_{jk}^2 = &\frac{\sigma_{j}^2}{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_{j}^2}{\sigma_{{\boldsymbol \beta}_j}^2}}, \tag{27} \\ \mu_{jk} = &\frac{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol y}_j-\sum_{l\not = k}{\mathbb{E}_{jl}{[\gamma_{jl}\beta_{jl}]}}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jl}-{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol X}_j\mathbb{E}_{\beta_{0}}{[{\boldsymbol \beta}_0]}}{\sigma_{j}^2}s_{jk}^2 \\ = &\frac{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol y}_j-\sum_{l\not = k}{\mathbb{E}_{jl}{[\gamma_{jl}\beta_{jl}]}}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jl}-{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol X}_j\mathbb{E}_{\beta_{0}}{[{\boldsymbol \beta}_0]}}{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_{j}^2}{\sigma_{{\boldsymbol \beta}_j}^2}}\mbox{. } \tag{28} \end{align} 当 $\gamma_{jk}~=~0$ 时, 同样的我们用上述方法可以得到

$\log{q(\beta_{jk}|\gamma_{jk}~=~0)~=~-\frac{1}{2\sigma_{{\boldsymbol~\beta}_j}^2}\beta_{jk}}+{\rm~const},~$

因此 $q(\beta_{jk}|\gamma_{jk}~=~0)~\sim~\mathcal{N}(0,~\sigma_{{\boldsymbol~\beta}_j}^2)$.

由于 $\gamma_{jk}$ 服从Bernoulli分布, 我们定义 $\alpha_{jk}~=~q{(\gamma_{jk}~=~1)}$, 可以得到 $\beta_{jk}$ 和 $\gamma_{jk}$ 的联合后验分布: \begin{align}q(\beta_{jk}, \gamma_{jk}) = \big[\alpha_{jk}\mathcal{N}(\mu_{jk}, s_{jk}^2)\big]^{\gamma_{jk}}\left[(1-\alpha_{jk})\mathcal{N}(0, \sigma_{{\boldsymbol \beta}_j}^2)\right]^{1-\gamma_{jk}}. \tag{29} \end{align} 综上, 可以得到如下的一些结论: \begin{align}&\mathbb{E}{[\gamma_{jk}\beta_{jk}]} = \mathbb{E}_{\gamma_{jk}}{[\mathbb{E}_{\beta_{jk}}{[\gamma_{jk}\beta_{jk}|\gamma_{jk}]}]} = \alpha_{jk}\mu_{jk}+(1-\alpha_{jk})\times 0 = \alpha_{jk}\mu_{jk}, \tag{30} \\ &\mathbb{E}{[(\gamma_{jk}\beta_{jk})^2]} = \mathbb{D}_{\gamma_{jk}, \beta_{jk}}{[\gamma_{jk}\beta_{jk}]}+\mathbb{E}_{\gamma_{jk}, \beta_{jk}}^2{[\gamma_{jk}\beta_{jk}]} \\ & =\mathbb{D}_{\gamma_{jk}}{\big[\mathbb{E}_{\beta_{jk}}{[\gamma_{jk}\beta_{jk}|\gamma_{jk}]}\big]}+\mathbb{E}_{\gamma_{jk}}{\big[\mathbb{D}_{\beta_{jk}}{[\gamma_{jk}\beta_{jk}|\gamma_{jk}]}\big]}+\mathbb{E}_{\gamma_{jk}}^2{[\gamma_{jk}\beta_{jk}]} \\ & = (\alpha_{jk}-\alpha_{jk}^2)\mu_{jk}^2+\alpha_{jk}s_{jk}^2+\alpha_{jk}^2\mu_{jk}^2 \\ & = \alpha_{jk}(\mu_{jk}^2+s_{jk}^2), \tag{31} \\ &\mathbb{E}{[\beta_{jk}^2]} = \mathbb{E}_{r_{jk}}{\big[\mathbb{E}_{\beta_{jk}}{[\beta_{jk}^2|\gamma_{jk}]}\big]} = \alpha_{jk}(\mu_{jk}^2+\sigma_{jk}^2)+(1-\alpha_{jk})\sigma_{{\boldsymbol \beta}_j}^2, \tag{32} \\ &\mathbb{E}{[\gamma_{jk}]} = \alpha_{jk}. \tag{33} \end{align} 接下来, 我们可以对 $\log~{\rm~Pr}({\boldsymbol~y},~{\boldsymbol~\beta}_0,~{\boldsymbol~\beta}$, ${\boldsymbol~\gamma}$ $|{\boldsymbol~X};\boldsymbol{\theta)}$ 关于 $q(\beta_{jk},~\gamma_{jk})$ $(k~=~1,~\ldots,~p;~j~=~1,~\ldots,~J)$ 和 $q({\boldsymbol~\beta}_0)$ 求期望, 得到边缘分布的下界 $L(q)$: \begin{align}&\mathbb{E}_q{[{\rm Pr}({\boldsymbol y}, {\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma}|{\boldsymbol X};\boldsymbol{\theta)}]} -\mathbb{E}_q{[\log{q({\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma})}]} \\ & = \sum_{j}\left(-\frac{n_j}{2}\log(2\pi\sigma_j^2)-\frac{{\boldsymbol y}_j^{\rm T}{\boldsymbol y}_j}{2\sigma_j^2}+\frac{\sum_{k = 1}{(\mathbb{E}_{q}{[\beta_{0k}]}+\mathbb{E}_q{[\gamma_{jk}\beta_{jk}]})}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol y}_j}{\sigma_j^2}\right. \\ & -\frac{\sum_{k = 1}\sum_{k'\not = k}\mathbb{E}_{q}{[(\beta_{0k}+\gamma_{jk}\beta_{jk})(\beta_{0k'}+\gamma_{jk'}\beta_{jk'})]}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk'}}{2\sigma_j^2} \\ & -\frac{\sum_{k = 1}{\mathbb{E}_q{[(\beta_{0k}+\gamma_{jk}\beta_{jk})^2]}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}}}{2\sigma_j^2} -\frac{p}{2}\log(2\pi\sigma_{{\boldsymbol \beta}_j}^2)-\frac{1}{2\sigma_{{\boldsymbol \beta}_j}^2}\sum_{k = 1}{\mathbb{E}_q{[\beta_{jk}^2}]} \\ & +\log(\pi_j)\sum_{k = 1}\mathbb{E}_q{[\gamma_{jk}]}+\log(1-\pi_j)\sum_{k = 1}{(1-\mathbb{E}_q{[\gamma_{jk}]})}\Bigg) \\ & -\frac{p}{2}\log(2\pi\sigma_{{\boldsymbol \beta}_0}^2)-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}\mathbb{E}_q{[{\boldsymbol \beta}_0^{\rm T}{\boldsymbol \beta}_0]}-\mathbb{E}_q{[\log{q({\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma})}]}, \tag{34} \end{align} 利用等式(20), 我们可以直接得到包含 ${\boldsymbol~\beta}_0$ 的项的期望: \begin{align}&\mathbb{E}_q{\left[{\boldsymbol \beta}_0^{\rm T}\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right){\boldsymbol \beta}_0+\sum_{j}-\frac{1}{\sigma_j^2}\left(\mathbb{E}_{{\boldsymbol \beta}_j, {\boldsymbol \gamma}_j}{[({\boldsymbol \gamma}_j\odot{\boldsymbol \beta}_j)^{\rm T}]}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\right){\boldsymbol \beta}_0\right]}+{\rm const} \\ &= {\boldsymbol \mu}_0^{\rm T}\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right){\boldsymbol \mu}_0+{\rm Tr}\left({\boldsymbol S}_0^2 \left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right)\right) \\ & +\sum_{j}-\frac{1}{\sigma_j^2}\left(\mathbb{E}_{{\boldsymbol \beta}_j, {\boldsymbol \gamma}_j}{[({\boldsymbol \gamma}_j\odot{\boldsymbol \beta}_j)^{\rm T}]}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\right)\mathbb{E}_q{[{\boldsymbol \beta}_0]}+{\rm const}, \tag{35} \end{align} 所以下界 $L(q)$: \begin{align*}&\mathbb{E}_q{[{\rm Pr}({\boldsymbol y}, {\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma}|{\boldsymbol X};\boldsymbol{\theta})]}-\mathbb{E}_q{[\log{q({\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma})}]} \\ &= \sum_{j}\left(-\frac{n_j}{2}\log(2\pi\sigma_j^2)-\frac{{\boldsymbol y}_j^{\rm T}{\boldsymbol y}_j}{2\sigma_j^2} +\frac{\sum_{k = 1}{(\mathbb{E}_q{[\gamma_{jk}\beta_{jk}]})}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol y}_j}{\sigma_j^2} \right. -\frac{\sum_{k = 1}\sum_{k'\not = k}\mathbb{E}_{q}{[(\gamma_{jk}\beta_{jk})]}\mathbb{E}_q{[(\gamma_{jk'}\beta_{jk'})]}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk'}}{2\sigma_j^2} \\ & -\frac{\sum_{k = 1}{\mathbb{E}_q{[(\gamma_{jk}\beta_{jk})^2]}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}}}{2\sigma_j^2} -\frac{p}{2}\log(2\pi\sigma_{{\boldsymbol \beta}_j}^2)-\frac{1}{2\sigma_{{\boldsymbol \beta}_j}^2}\sum_{k = 1}{\mathbb{E}_q{[\beta_{jk}^2}]} +\log(\pi_j)\sum_{k = 1}\mathbb{E}_q{[\gamma_{jk}]} \\ & +\log(1-\pi_j)\sum_{k = 1}{(1-\mathbb{E}_q{[\gamma_{jk}]})}\Bigg) +{\boldsymbol \mu}_0^{\rm T}\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right){\boldsymbol \mu}_0+{\rm Tr}\left({\boldsymbol S}_0^2\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right)\right) \\ & +\sum_{j}-\frac{1}{\sigma_j^2}\big(\mathbb{E}_{{\boldsymbol \beta}_j, {\boldsymbol \gamma}_j}{[({\boldsymbol \gamma}_j\odot{\boldsymbol \beta}_j)^{\rm T}]}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\big)\mathbb{E}_q{[{\boldsymbol \beta}_0]} -\frac{p}{2}\log(2\pi\sigma_{{\boldsymbol \beta}_0}^2)-\mathbb{E}_q{[\log{q({\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma})}]}. \end{align*} 由于 \begin{align}\mathbb{E}_q{[\log{q({\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma})}]} = & \mathbb{E}{[\log{q{({\boldsymbol \beta}_0)}]}+\sum_{j = 1}{\sum_{k = 1}{\mathbb{E}{[\log{q{(\beta_{jk}, \gamma_{jk}})}]}}}} \\ = & \mathbb{E}{[\log{q{({\boldsymbol \beta}_0)}}]}+\sum_{j = 1}{\sum_{k = 1}{\mathbb{E}_{\gamma_{jk}, \beta_{jk}}{\big[\log{[\alpha_{jk}\mathcal{N}(\mu_{jk}, s_{jk}^2)]^{\gamma_{jk}}[(1-\alpha_{jk})\mathcal{N}(0, \sigma_{{\boldsymbol \beta}_j}^2)]^{1-\gamma_{jk}}}\big]}}} \\ = & \mathbb{E}{[\log{\mathcal N}({\boldsymbol \mu_0}, {\boldsymbol S}_0^2)]}+\sum_{j = 1}{\sum_{k = 1}{\big(\mathbb{E}_{q}{[\gamma_{jk}]}\log{\alpha_{jk}}+(1-\mathbb{E}_q{[\gamma_{jk}]})\log{(1-\alpha_{jk})}}} \\ &+\alpha_{jk}\mathbb{E}_{\beta_{jk}|\gamma_{jk} = 1}{[\log{\mathcal{N}(\mu_{jk}, s_{jk}^2)}]}+(1-\alpha_{jk})\mathbb{E}_{\beta_{jk}|\gamma_{jk} = 0}{[\log{\mathcal{N}(0, \sigma_{{\boldsymbol \beta}_j}^2)}]}\big), \tag{36} \end{align} 利用正态分布的熵的结论, 可以得到 \begin{align}\mathbb{E}_q{[\log{q({\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma})}]} = &-\frac{1}{2}\log |{\boldsymbol S}_0^2|-\frac{p}{2}(1+\log 2\pi) \\ &+\sum_{j}{\sum_{k}{\big(\alpha_{jk}\log a_{jk}+(1-\alpha_{jk})\log (1-\alpha_{jk})\big)}} \\ &-\sum_j{\sum_k{\frac{1}{2}\alpha_{jk}\big(\log s_{jk}^2-\log \sigma_{{\boldsymbol \beta}_j}^2\big)}}-\sum_j{\frac{p}{2}\log \sigma_{{\boldsymbol \beta}_j}^2}-\sum_j{\frac{p}{2}(1+\log 2\pi)}. \tag{37} \end{align} 代入 $L(q)$, 再将等式(30)$\sim$(33)带入并整理可以得到 \begin{align*}L(q) = & \mathbb{E}_q{[{\rm Pr}({\boldsymbol y}, {\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma}|{\boldsymbol X};\boldsymbol{\theta})]}-\mathbb{E}_q{[\log{q({\boldsymbol \beta}_0, {\boldsymbol \beta}, {\boldsymbol \gamma})}]} \\ = &\sum_{j}\left(-\frac{n_j}{2}\log(2\pi\sigma_j^2)-\frac{\left\lVert {\boldsymbol y}_j-\sum_{l}\alpha_{jk}\mu_{jk}{\boldsymbol x}_{jk}\right\rVert^2}{2\sigma_{j}^2}-\frac{1}{2\sigma_j^2}\sum_{l}\big[\alpha_{jk}(s_{jk}^2+\mu_{jk}^2)-(\alpha_{jk}\mu_{jk})^2\big]{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}\right) \\ &+{\boldsymbol \mu}_0^{\rm T}\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right){\boldsymbol \mu}_0+{\rm Tr}\left({\boldsymbol S}_0^2 \left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right)\right) \\ &+\sum_{j}-\frac{1}{\sigma_j^2}\big(({\boldsymbol \alpha}_j\odot{\boldsymbol \mu}_j)^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\big){\boldsymbol \mu}_0 \\ &-\frac{p}{2}\log(2\pi\sigma_{{\boldsymbol \beta}_0}^2)+\frac{1}{2}\log |{\boldsymbol S}_0^2|+\sum_{j}{\sum_{l}{\left(\alpha_{jk}\log \left(\frac{\pi_j}{\alpha_{jk}}\right)+(1-\alpha_{jk})\log\left(\frac{1-\pi_j}{1-\alpha_{jk}}\right)\right)}} \\ &+\sum_j{\sum_l{\frac{1}{2}\alpha_{jk}\left(1+\log \frac{s_{jk}^2}{ \sigma_{{\boldsymbol \beta}_j}^2}-\frac{\mu_{jk}^2+s_{jk}^2}{\sigma_{{\boldsymbol \beta}_j}^2}\right)}}-\frac{jp}{2}\log 2\pi+\frac{jp+p}{2}(1+\log 2\pi)-\frac{jp}{2}\nonumber. \end{align*} 对于 $\alpha_{jk}$, 令 \begin{equation}\begin{split} \frac{\partial L(q)}{\partial{\alpha_{jk}}} = 0, \end{split} \tag{38}\end{equation} 可以得到 \begin{align}\frac{\partial L(q)}{\partial{\alpha_{jk}}} = &\frac{({\boldsymbol y}_j-\sum_k{\alpha_{jk}\mu_{jk}{\boldsymbol x}_{jk}})^{\rm T}\mu_{jk}{\boldsymbol x}_{jk}}{\sigma_j^2}+\frac{(\alpha_{jk}\mu_{jk}^2){\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}}{\sigma_j^2}-\frac{(s_{jk}^2+\mu_{jk}^2){\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}}{2\sigma_j^2} \\ &-\frac{1}{\sigma_j^2}\mu_{jk}{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol X}_j{\boldsymbol \mu}_0+\log \bigg(\frac{\pi_j}{1-\pi_j}\bigg)+\log \bigg(\frac{1-\alpha_{jk}}{\alpha_{jk}}\bigg) \\ &+\frac{1}{2}\left(1+\log \frac{s_{jk}^2}{\sigma_{{\boldsymbol \beta}_j}^2}-\frac{\mu_{jk}^2+s_{jk}^2}{\sigma_{{\boldsymbol \beta}_j}^2}\right) = 0, \tag{39} \end{align} 由等式(27)和(28), 可以得到 \begin{align}&{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_{j}^2}{\sigma_{{\boldsymbol \beta}_j}^2} = \frac{\sigma_j^2}{s_{jk}^2}, \tag{40} \\ &\mu_{jk}\left({\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_{j}^2}{\sigma_{{\boldsymbol \beta}_j}^2}\right)+{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol X}_j{\boldsymbol \mu}_0 = \left({\boldsymbol y}_j-\sum_{k\not = k}\alpha_{jk}\mu_{jk}{\boldsymbol x}_{jk}\right)^{\rm T}{\boldsymbol x}_{jk}, \tag{41} \end{align} 所以 \begin{align}\frac{\partial L(q)}{\partial{\alpha_{jk}}} = &\frac{(\mu_{jk}^2-s_{jk}^2){\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}}{2\sigma_j^2}+\log \left(\frac{\pi_j}{1-\pi_j}\right) +\log \left(\frac{1-\alpha_{jk}}{\alpha_{jk}}\right)+\frac{1}{2}\left(1+\log \frac{s_{jk}^2}{\sigma_{{\boldsymbol \beta}_j}^2}\right)+\frac{\mu_{jk}^2-s_{jk}^2}{2\sigma_{{\boldsymbol \beta}_j}^2} \\ = &\frac{\mu_{jk}^2}{2s_{jk}^2}+\log \left(\frac{\pi_j}{1-\pi_j}\right)+\log \left(\frac{1-\alpha_{jk}}{\alpha_{jk}}\right)+\frac{1}{2}\log \frac{s_{jk}^2}{\sigma_{{\boldsymbol \beta}_j}^2} = 0, \tag{42} \end{align} 可以得到 \begin{equation}\alpha_{jk} = \frac{1}{1+\exp(-u_{jk})}, \tag{43}\end{equation} 其中 \begin{equation}u_{jk} = \frac{\mu_{jk}^2}{2s_{jk}^2}+\log \bigg(\frac{\pi_j}{1-\pi_j}\bigg)+\frac{1}{2}\log \frac{s_{jk}^2}{\sigma_{{\boldsymbol \beta}_j}^2}. \tag{44}\end{equation}

M-step

接下来我们推导 $\sigma_j^2$, $\sigma_{{\boldsymbol~\beta}_j}^2$和 $\sigma_{{\boldsymbol~\beta}_0}^2$ 的更新方程.

对于 $\sigma_{j}^2$, \begin{align}\frac{\partial{L(q)}}{\partial{\sigma_j^2}} = &-\frac{n_j}{2\sigma_j^2}+\frac{1}{2\sigma_j^4}\left({\boldsymbol y}_j-\sum_{l}\alpha_{jl}\mu_{jl}{\boldsymbol x}_{jl}\right)^{\rm T}\left({\boldsymbol y}_j-\sum_{l}\alpha_{jl}\mu_{jl}{\boldsymbol x}_{jl}\right) +\frac{1}{2\sigma_j^4}\sum_{l}\big[\alpha_{jl}(s_{jl}^2+\mu_{ji}^2)-(\alpha_{jl}\mu_{jl})^2\big]{\boldsymbol x}_{jl}^{\rm T}{\boldsymbol x}_{jl} \\ &+\frac{1}{2\sigma_j^4}{\boldsymbol \mu}_0^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j){\boldsymbol \mu}_0+\frac{1}{2\sigma_j^4}{\rm Tr}\big({\boldsymbol S}_0^2({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)\big)+\frac{1}{\sigma_j^4}\big(({\boldsymbol \alpha}_j\odot{\boldsymbol \mu}_j)^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\big){\boldsymbol \mu}_0 = 0, \tag{45} \end{align} 可以得到 \begin{align}\sigma_j^2 = &\frac{1}{n_j}\left(\left({\boldsymbol y}_j-\sum_{l}\alpha_{jl}\mu_{jl}{\boldsymbol x}_{jl}\right)^{\rm T}\left({\boldsymbol y}_j-\sum_{l}\alpha_{jl}\mu_{jl}{\boldsymbol x}_{jl}\right) +\sum_{l}[\alpha_{jl}(s_{jl}^2+\mu_{ji}^2)-(\alpha_{jl}\mu_{jl})^2]{\boldsymbol x}_{jl}^{\rm T}{\boldsymbol x}_{jl} \right. \\ &+{\boldsymbol \mu}_0^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j){\boldsymbol \mu}_0+{\rm Tr}\big({\boldsymbol S}_0^2({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)\big)+2\big(({\boldsymbol \alpha}_j\odot{\boldsymbol \mu}_j)^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\big){\boldsymbol \mu}_0\Bigg). \tag{46} \end{align} 对于 $\sigma_{{\boldsymbol~\beta}_j}^2$, 令 \begin{equation}\begin{split} \frac{\partial L(q)}{\sigma_{{\boldsymbol \beta}_j}^2} = 0, \end{split} \tag{47}\end{equation} 可以得到 \begin{equation}\begin{split} \sigma_{{\boldsymbol \beta}_j}^2 = \frac{\sum_l\alpha_{jl}(\mu_{jl}^2+s_{jl}^2)}{\sum_l\alpha_{jl}}. \end{split} \tag{48}\end{equation}

对于 $\sigma_{{\boldsymbol~\beta}_0}^2$令 \begin{equation}\begin{split} \frac{\partial L(q)}{\sigma_{{\boldsymbol \beta}_0}^2} = \frac{1}{2\sigma_{{\boldsymbol \beta}_0}^4}\big({\boldsymbol \mu}_0^{\rm T}{\boldsymbol \mu}_0+{\rm Tr}({\boldsymbol S}_0^2)\big)-\frac{p}{2\sigma_{{\boldsymbol \beta}_0}^2} = 0, \end{split} \tag{49}\end{equation} 可以得到 \begin{equation}\begin{split} \sigma_{{\boldsymbol \beta}_0}^2 = \frac{1}{p}\big({\boldsymbol \mu}_0^{\rm T}{\boldsymbol \mu}_0+{\rm Tr}({\boldsymbol S}_0^2)\big). \end{split} \tag{50}\end{equation}

同样的, 最后可以得到 \begin{equation}\begin{split} \pi_j = \frac{1}{p}\sum_{l}\alpha_{jl}. \end{split} \tag{51}\end{equation}

算法细节

我们现在给出关于变分EM算法的详细推导, 其包含如下迭代过程:

Initialize ${\boldsymbol~\mu}_0$, ${\boldsymbol~S}_0^2$, $\mu_{jk},~s_{jk}^2$, $\alpha_{jk}$, $\sigma_j^2$, $\sigma_{{\boldsymbol~\beta}_j}^2$, $\sigma_{{\boldsymbol~\beta}_0}^2$, $\pi_j$ where $j~=~1,~\ldots,~J$, $k~=~1,~\ldots,~p$. Let $\tilde{\boldsymbol~y}_j~=~\sum_{k}\alpha_{jk}\mu_{jk}{\boldsymbol~x}_{jk}$, $\tilde{\boldsymbol~y}_{0j}~=~\sum_k\mu_{0k}{\boldsymbol~x}_{jk}$.

E-step \begin{align}{\boldsymbol S}_0^2 = &-\frac{1}{2}{\rm diag}\left(\sum_{j}{-\frac{1}{2\sigma_j^2}{\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j}-\frac{1}{2\sigma_{{\boldsymbol \beta}_0}^2}{\boldsymbol I}\right)^{-1}, \tag{52} \end{align} For all $j$ and $l$: \begin{align}&\tilde{{\boldsymbol y}}_{0jk} = \tilde{\boldsymbol y}_{0j}-\mu_{0k}{\boldsymbol x}_{jk}, \tag{53} \\ &\mu_{0k} = {\boldsymbol S}_0^2(k, k)\sum_{j}-\frac{1}{\sigma_j^2}\big(\tilde{\boldsymbol y}_{0jk}+{\boldsymbol X}_j({\boldsymbol \alpha}_j\odot{\boldsymbol \mu}_j)-{\boldsymbol y}_j\big)^{\rm T}{\boldsymbol x}_{jk}, \tag{54} \\ &\tilde{{\boldsymbol y}}_{0j} = \tilde{\boldsymbol y}_{0jk}+\mu_{0k}{\boldsymbol x}_{jk}. \tag{55} \end{align} Then, for all $j$ and $k$: \begin{align}&\tilde{{\boldsymbol y}}_{jk} = \tilde{\boldsymbol y}_j-\alpha_{jk}\mu_{jk}{\boldsymbol x}_{jk}, \tag{56} \\ &s_{jk}^2 = \frac{\sigma_{j}^2}{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_j^2}{\sigma_{{\boldsymbol \beta}_j}^2}}, \tag{57} \\ &\mu_{jk} = \frac{{\boldsymbol x}_{jk}^{\rm T}({\boldsymbol y}_j-\tilde{\boldsymbol y}_{jk})-{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol X}_j{\boldsymbol \mu}_0}{{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk}+\frac{\sigma_{j}^2}{\sigma_{{\boldsymbol \beta}_j}^2}}, \tag{58} \\ &\alpha_{jk} = \frac{1}{1+\exp(-u_{jk})}, \tag{59} \end{align} where \begin{align}& u_{jk} = \frac{\mu_{jk}^2}{2s_{jk}^2}+\log \bigg(\frac{\pi_j}{1-\pi_j}\bigg)+\frac{1}{2}\log \frac{s_{jk}^2}{\sigma_{{\boldsymbol \beta}_j}^2}, \tag{60} \\ &\tilde{{\boldsymbol y}}_{jk} = \tilde{\boldsymbol y}_j+\alpha_{jk}\mu_{jk}{\boldsymbol x}_{jk}. \tag{61} \end{align}

M-step \begin{align}&\sigma_j^2 = \frac{1}{n_j}\left(({\boldsymbol y}_j-\tilde{{\boldsymbol y}}_j)^{\rm T}({\boldsymbol y}_j-\tilde{{\boldsymbol y}}_j) +\sum_{k = 1}^p[\alpha_{jk}(s_{jk}^2+\mu_{jk}^2)-(\alpha_{jk}\mu_{jk})^2]{\boldsymbol x}_{jk}^{\rm T}{\boldsymbol x}_{jk} \right. \\ & +{\boldsymbol \mu}_0^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j){\boldsymbol \mu}_0+{\rm Tr}\big({\boldsymbol S}_0^2({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)\big) +2\big(({\boldsymbol \alpha}_j\odot{\boldsymbol \mu}_j)^{\rm T}({\boldsymbol X}_j^{\rm T}{\boldsymbol X}_j)-{\boldsymbol y}_j^{\rm T}{\boldsymbol X}_j\big){\boldsymbol \mu}_0\Bigg), \tag{62} \\ &\sigma_{{\boldsymbol \beta}_j}^2 = \frac{\sum_k\alpha_{jk}(\mu_{jk}^2+s_{jk}^2)}{\sum_k\alpha_{jk}}, \tag{63} \\ &\sigma_{{\boldsymbol \beta}_0}^2 = \frac{1}{p}\big({\boldsymbol \mu}_0^{\rm T}{\boldsymbol \mu}_0+{\rm Tr}({\boldsymbol S}_0^2)\big), \tag{64} \\ &\pi_j = \frac{1}{p}\sum_{k}\alpha_{jk}. \tag{65} \end{align}

重复上述 E-step 和 M-step 步骤直到 $L(q)$ 收敛到一定范围内, 比如 $1~\times~10^{-6}$.


References

[1] Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Statistical Soc-Ser B (Statistical Methodology), 2011, 73: 273-282 CrossRef Google Scholar

[2] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J Am Statistical Association, 2001, 96: 1348-1360 CrossRef Google Scholar

[3] Zhang C H. Nearly unbiased variable selection under minimax concave penalty. Ann Statist, 2010, 38: 894-942 CrossRef Google Scholar

[4] Zongben Xu , Xiangyu Chang , Fengmin Xu . L1/2 regularization: a thresholding representation theory and a fast solver.. IEEE Trans Neural Netw Learning Syst, 2012, 23: 1013-1027 CrossRef PubMed Google Scholar

[5] Jinshan Zeng , Shaobo Lin , Yao Wang . $L_{1/2}$ Regularization: Convergence of Iterative Half Thresholding Algorithm. IEEE Trans Signal Process, 2014, 62: 2317-2329 CrossRef ADS arXiv Google Scholar

[6] Wright S J. Coordinate descent algorithms. Math Program, 2015, 151: 3--34. Google Scholar

[7] Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004. Google Scholar

[8] Boyd S, Parikh N, Chu E. Distributed optimization and statistical learning via the alternating direction method of multipliers. FNT Machine Learning, 2010, 3: 1-122 CrossRef Google Scholar

[9] Figueiredo M A T. Adaptive sparseness for supervised learning. IEEE Trans Pattern Anal Machine Intell, 2003, 25: 1150-1159 CrossRef Google Scholar

[10] Yuan M, Lin Y. Efficient Empirical Bayes Variable Selection and Estimation in Linear Models. J Am Statistical Association, 2005, 100: 1215-1225 CrossRef Google Scholar

[11] Park T, Casella G. The Bayesian Lasso. J Am Statistical Association, 2008, 103: 681-686 CrossRef Google Scholar

[12] Mitchell T J, Beauchamp J J. Bayesian Variable Selection in Linear Regression. J Am Statistical Association, 1988, 83: 1023-1032 CrossRef Google Scholar

[13] George E I, McCulloch R E. Variable Selection via Gibbs Sampling. J Am Statistical Association, 1993, 88: 881-889 CrossRef Google Scholar

[14] Madigan D, Raftery A E. Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window. J Am Statistical Association, 1994, 89: 1535-1546 CrossRef Google Scholar

[15] Zhou X, Carbonetto P, Stephens M. Polygenic modeling with bayesian sparse linear mixed models.. PLoS Genet, 2013, 9: e1003264 CrossRef PubMed Google Scholar

[16] Xu X, Ghosh M. Bayesian Variable Selection and Estimation for Group Lasso. Bayesian Anal, 2015, 10: 909-936 CrossRef Google Scholar

[17] Chen R B, Chu C H, Yuan S. Bayesian Sparse Group Selection. J Comput Graphical Stat, 2016, 25: 665-683 CrossRef Google Scholar

[18] Blei D M, Kucukelbir A, McAuliffe J D. Variational Inference: A Review for Statisticians. J Am Statistical Association, 2017, 112: 859-877 CrossRef Google Scholar

[19] Carbonetto P, Stephens M. Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies. Bayesian Anal, 2012, 7: 73-108 CrossRef Google Scholar

[20] Gopalan P, Hao W, Blei D M. Scaling probabilistic models of genetic variation to millions of humans. Nat Genet, 2016, 48: 1587-1590 CrossRef PubMed Google Scholar

[21] Dai M W, Ming J S, Cai M X. IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies. Bioinformatics, 2017, 33: 2882-2889 CrossRef PubMed Google Scholar

[22] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. J Mach Learn Res, 2003, 3: 993--1022. Google Scholar

[23] Ming J S, Dai M W, Cai M X. LSMM: a statistical approach to integrating functional annotations with genome-wide association studies.. Bioinformatics, 2018, 34: 2788-2796 CrossRef PubMed Google Scholar

[24] Raj A, Stephens M, Pritchard J K. fastSTRUCTURE: variational inference of population structure in large SNP data sets.. Genetics, 2014, 197: 573-589 CrossRef PubMed Google Scholar

[25] Sebastiani F. Multitask learning. In: Learning to Learn. Berlin: Springer, 1998. 95--133. Google Scholar

[26] Bach F R. Consistency of the group lasso and multiple kernel learning. J Mach Learn Res, 2008, 9: 1179--1225. Google Scholar

[27] Ravikumar P, Wainwright M J, Lafferty J D. High-dimensional Ising model selection using ? 1 -regularized logistic regression. Ann Statist, 2010, 38: 1287-1319 CrossRef Google Scholar

[28] Jalali A, Sanghavi S, Ruan C, et al. A dirty model for multi-task learning. In: Proceedings of Advances in Neural Information Processing Systems, 2010. 964--972. Google Scholar

[29] Gross S M, Tibshirani R. Data Shared Lasso: A Novel Tool to Discover Uplift.. Comput Stat Data Anal, 2016, 101: 226-235 CrossRef PubMed Google Scholar

[30] Hoerl A E, Kennard R W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 2000, 42: 80-86 CrossRef Google Scholar

[31] Yang J, Benyamin B, McEvoy B P. Common SNPs explain a large proportion of the heritability for human height.. Nat Genet, 2010, 42: 565-569 CrossRef PubMed Google Scholar

[32] Bishop C M. Pattern Recognition and Machine Learning. Berlin: Springer, 2007. 462--474. Google Scholar

[33] Genkin A, Lewis D D, Madigan D. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics, 2007, 49: 291-304 CrossRef Google Scholar

  • Figure 1

    Graphical model representation of joint distribution Eq. (7). Here $\tilde{{\boldsymbol~x}}_{ji}$ is the $i$-th row of the design matrix ${\boldsymbol~X_j}$ and $y_{ji}$is the coressponding response variable. In this graphical model, we introduce a node for each of the variables. We denote latent variables by open circles and observed variables by shading the corresponding circles. The others are deterministic parameters or constant variables. Links express probabilistic relationships between these variables. We have introduced a plate labeled with a number represents the number of nodes of this kind

  • Figure 2

    (Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0$ and with data sets separately from each task. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

  • Figure 3

    (Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0$ and with data set pooled from all three tasks. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

  • Figure 4

    (Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0.5$ and with data sets separately from each task. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

  • Figure 5

    (Color online) The comparison of MSS, RSS, ridge regression and the Lasso with $\rho~=~0.5$ and with data set pooled from all three tasks. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

  • Figure 6

    (Color online) The comparison of MSS, dirty model, and the data shared Lasso with $\rho~=~0$. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

  • Figure 7

    (Color online) The comparison of MSS, dirty model, and the data shared Lasso with $\rho~=~0.5$. (a) MSE of coefficient estimation; (b) AUC for the performance of variable selection of each algorithm

  • Figure 8

    (Color online) The comparison of different models' estimation of effect size

  • Figure 9

    (Color online) The comparison of different models' estimation of effect size when the shared random effect $\boldsymbol{\beta_0}=~\mathbf{0}$

  • Figure 10

    (Color online) The comparison of different models' estimation of effect size when the specific sparse effects $\boldsymbol{\beta_j}=~\mathbf{0}(j=1,~\ldots,~J)$

  • Figure 11

    (Color online) Computing time (CPU seconds) of MSS with respect to different number of samples in each task, different number of features and different number of tasks

  • Figure 12

    (Color online) Word cloud of keywords generated by MSS. The words in red and green represent the positive and negative. The scale of the words represent the level of effect

  • Figure 13

    (Color online) The performance of convergence of MSS

  • Table 1   The comparison of computing time (s)
    MSS Dirty model DSL
    $p~=~500$ 10 334 26
    $p~=~1000$ 18 210 40
    $p~=~2000$ 33 682 55
  • Table 2   Mean squared error of test results
    All Drama Comedy Horror
    MSS 5.50 5.54 5.74 4.99
    Spike-Slab 5.54 5.57 5.78 5.05
    Ridge 5.77 5.72 6.24 5.13
    Lasso 5.55 5.63 5.77 4.95

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号