Skip to content

Expectation Maximization

Expectation Maximization

Assume we have some observed data $x=(x_1,\ldots,x_n)$. Assume that this data is generated by a probability distribution $p_{\theta}(x,z)$ where the model parameter $\theta$ is unknown and the $z$ part of the data is unobserved. Typically, $p_{\theta}(x,z)$ is assumed to be an exponential family. Our goal is to estimate $\theta_{MLE}$ and (later) $\theta_{MAP}$
Recall that
\begin{eqnarray}
\theta_{MLE} = \mathrm{argmax}_{\theta} p_{\theta}(x)
\end{eqnarray}
Note that $p_{\theta}(x)$ is just the marginal distribution of $p_{\theta}(x,z)$ over $z$.
\begin{eqnarray}
p_{\theta}(x)=\sum_z p_{\theta}(x,z)
\end{eqnarray}
As we do not observe $z$ part of the data we cannot maximize over $x$ and $z$, hence we use the marginal probability of $x$. But this marginal distribution is generally very hard to compute, and even more difficult to maximize. We will see one such example when we consider the Gaussian mixtures.
So, rather than directly computing the marginal density and maximize, we iterate:

  1. Choose an initial $\theta_0$
  2. E-Step: Compute the quantity
    \begin{eqnarray}
    Q(\theta,\theta_t)= E_{\theta_t}(\log p_{\theta}(X,Z) | X=x )
    \end{eqnarray}
    where $x$ is the observed value. $\theta_t$ is known, $\theta$ is undetermined.
  3. M-Step:
    \begin{eqnarray}
    \theta_{t+1}=argmax_{\theta} Q(\theta,\theta_t)
    \end{eqnarray}

EM in Exponential Families

Remember that our goal is an MLE or MAP estimate. Let us concentrate on MLE. A MLE estimate can be found as
\begin{eqnarray}
\max_{\theta}p_{\theta}(x) = \max_{\theta} \sum_z p_{\theta}(x,z)
\end{eqnarray}
We will concentrate our attention to exponential family of distributions:
\begin{eqnarray}
p_{\theta}(x,z)=e^{\theta^T s(x,z)} \frac{h(x,z)}{C(\theta)} = e^{\left[\theta^T s(x,z)-\log C(\theta)\right]} h(x,z)
\end{eqnarray}
To maximize, we take the derivative and equate it to zero:
\begin{eqnarray}
\frac{\partial}{\partial \theta_i} \log p_{\theta}(x)&=&\frac{1}{p_{\theta}(x)} \frac{\partial }{\partial \theta_i}p_{\theta}(x) \\
&=& \frac{1}{p_{\theta}(x)} \frac{\partial}{\partial \theta_i} \sum_z p_{\theta}(x,z)\\
&=& \frac{1}{p_{\theta}(x)} \sum_z \frac{\partial}{\partial \theta_i} p_{\theta}(x,z)\\
&=& \frac{1}{p_{\theta}(x)} \sum_z \left[s_i(x,z)-\frac{\partial}{\partial \theta_i}\log C(\theta)\right]p_{\theta}(x,z) \nonumber \\
\end{eqnarray}
One of the many beautiful things about the exponential families is,
\begin{eqnarray}
\frac{\partial}{\partial \theta_i}\log C(\theta)= E_{\theta}(s_i(X,Z))
\end{eqnarray}
If we substitute this back
\begin{eqnarray}
&=& \frac{1}{p_{\theta}(x)} \sum_z \left[s_i(x,z)-E_{\theta}(s_i(X,Z))\right]p_{\theta}(x,z) \nonumber \\
&=& \sum_z s_i(x,z)p_{\theta}(z|x)-\sum_z E_{\theta}(s_i(X,Z))p_{\theta}(z|x) \nonumber \\
\end{eqnarray}
Note that
\begin{eqnarray}
\sum_z s_i(x,z)p_{\theta}(z|x)= E_{\theta}[s_i(X,Z)|X=x]
\end{eqnarray}
$E_{\theta}(s_i(X,Z))$ is a constant and hence can be taken out of the sum.
\begin{eqnarray}
\sum_z E_{\theta}(s_i(X,Z))p_{\theta}(z|x) = E_{\theta}(s_i(X,Z)) \sum_zp_{\theta}(z|x) = E_{\theta}(s_i(X,Z)) \nonumber \\
\end{eqnarray}
Hence the equation set above becomes
\[\boxed{
E_{\theta}[s_i(X,Z)|X=x] = E_{\theta}(s_i(X,Z)) }
\]
Hence we have to find a $\theta$ for which the conditional expectation on the left becomes equal to the real expectation on the right.
But especially the LHS of this expression
becomes very complicated after conditioning, and it is very hard to solve for $\theta$. Hence we resort to iteration:
\begin{eqnarray}
E_{\theta_i}[s_i(X,Z)|X=x] = E_{\theta_{i+1}}(s_i(X,Z))
\end{eqnarray}
This does not look like the EM algorithm at this point. But, actually, it is equivalent with EM. To see this, let us
write down the E-step (ie, $Q(\theta,\theta_0)$) for the exponential family:
\begin{eqnarray}
Q(\theta,\theta_0) &=& E_{\theta_0}(\log p_{\theta}(X,Z)|X=x)\\
&=& E_{\theta_0} \left[ \log e^{\theta^T s(X,Z)} \frac{h(X,Z)}{S(\theta)} | X=x \right] \\
&=& E_{\theta_0} \left[ \theta^T s(X,Z)+ \log h(X,Z) -\log S(\theta) | X=x \right] \\
&=& E_{\theta_0}[\theta^T s(X,Z) |X=x]+E_{\theta_0}[\log h(X,Z)| X=x] -\log S(\theta) \nonumber
\end{eqnarray}
Then, for M-step
\begin{eqnarray}
\frac{\partial}{\partial \theta_i}Q(\theta,\theta_0)=E_{\theta_0}[s_i(X,Z) |X=x]-\frac{\partial}{\partial \theta_i}\log S(\theta)=0
\end{eqnarray}
Recall the the important property of exponential families:,
\begin{eqnarray}
\frac{\partial}{\partial \theta_i}\log C(\theta)= E_{\theta}(s_i(X,Z))
\end{eqnarray}
Substituting this, we get exactly the same iterative formula with the above
\begin{eqnarray}
E_{\theta_0}[s_i(X,Z) |X=x]=E_{\theta}(s_i(X,Z))
\end{eqnarray}

Gaussian Mixtures

A gaussian mixture is a convex combination of Gaussians.
Or, expressed differently, a gaussian mixture consists of a combination of two types of distributions:

  • An m-dimensional discrete random variable $Z$
    \begin{eqnarray}
    Z =\left[ \begin{array}{c}
    z_1\\
    z_2\\
    \vdots\\
    z_{m-1}\\
    z_m\\
    \end{array}\right]
    \end{eqnarray}
    whose sample space consists of $m$ $n$-dimensional vectors.
    \begin{eqnarray}
    e_1=\left[ \begin{array}{c}
    1\\
    0\\
    \vdots\\
    0\\
    0\\
    \end{array}\right]
    \quad,\quad e_2=\left[
    \begin{array}{c}
    0\\
    1\\
    \vdots\\
    0\\
    0\\
    \end{array}
    \right] \quad, \ldots,\quad e_{m-1}=\left[
    \begin{array}{c}
    0\\
    0\\
    \vdots\\
    1\\
    0\\
    \end{array}
    \right] \quad,\quad e_m= \left[
    \begin{array}{c}
    0\\
    0\\
    \vdots\\
    0\\
    1\\
    \end{array}
    \right] \nonumber
    \end{eqnarray}
    with pmf $P(Z=e_j)=\alpha_j$. Note that only one of $z_j$’s in (..) is 1, the rest is zero.
  • $m$ $d$-dimensional gaussians $N(x|\mu_k, C_k)$. Here $1\leq k\leq n$ where $\mu_k \in \mathbb{R}^d$ is the mean vector and
    $C_k \in \mathbb{R}^{d \times d}$ is the covariance matrix of the $k$th gaussian.

Sampling from a gaussian mixture is done as follows:

  1. First, draw a sample from the discrete random variable $Z$. Assume $Z=e_k$. Note that the probability of this is $\alpha_k$.
  2. Randomly sample $X \sim N(\mu_k, C_k)$.
  3. The result is $X$

We want to write the probability density function $P(x,z)$ of the Gaussian mixture model. But, first, let us write the
marginal distribution for $x$:
\begin{eqnarray}
P(x) &=& \sum_{k=1}^m P(x|Z=e_k) P(Z=e_k)\\
&=& \sum_{k=1}^m \alpha_k N(x|\mu_k, C_k)
\end{eqnarray}
this is a convex linear combination og Gaussian pdf’s as $\alpha_i$’s are nonnegative and
they sum up to 1. $\alpha_i$’s are known as the “mixing coefficients”. \\\\
The joint density function $P(x,z)$ is
\begin{eqnarray}
P(x,z) &=& \alpha_k N(x|\mu_k, C_k)\\
&=& \prod_{k=1}^m \alpha_k^{z_k} N(x|\mu_k, C_k)^{z_k}
\end{eqnarray}
where
\begin{eqnarray}
z=\left[ \begin{array}{c}
z_1\\
z_2\\
\vdots\\
z_{m-1}\\
z_m\\
\end{array}\right]
\end{eqnarray}
Note that only one of the $z_k$’s is 1 and the rest is 0. For example, if $z=e_4$, only $z_4=1$ and all the other $z_k$’s are zero. \\
Lastly, this formula will also be useful:
\begin{eqnarray}
P(Z=e_k|x)=\frac{P(x|e_k)P(e_k)}{P(x)} = \frac{\alpha_k N(x|\mu_k, C_k)}{\sum_{l=1}^m \alpha_l N(x|\mu_l, C_l)}
\end{eqnarray}

Gaussian Mixtures as a special case of the exponential family

We have to show that
\begin{eqnarray}
P(x,z) = \prod_{k=1}^m \alpha_k^{z_k} N(x|\mu_k, C_k)^{z_k}
\end{eqnarray}
is an exponential family, with parameter vector $\theta=\{ \alpha_1, \mu_1, C_1,…,\alpha_m, \mu_m, C_m \}$. Recalling that
\begin{eqnarray}
N(x|\mu_k, C_k) = \frac{|\Lambda_k|}{2 \pi} e^{-\frac{1}{2}(x-\mu_k)^T \Lambda_k (x-\mu_k)}
\end{eqnarray}
where $\Lambda_k=C_k^{-1}$, we can write the above expression as
\begin{eqnarray}
P(x,z) &=& \prod_{k=1}^m e^{z_k \log(\alpha_k)} \left(\sqrt{\frac{|\Lambda_k|}{2 \pi}}\right)^{z_k}
e^{-\frac{z_k}{2}(x-\mu_k)^T \Lambda_k (x-\mu_k)}\nonumber\\
&=& \exp\left[ \sum_{k=1}^m \left( z_k \log(\alpha_k) + \frac{z_k}{2} \log(|\Lambda_k|)
– \frac{z_k}{2} (x^T\Lambda_k x-2\mu_k^T \Lambda_k x +\mu_k^T \Lambda_k \mu_k \right)\right] \nonumber
\end{eqnarray}
Here we disregarded the factor of $2\pi$ in the second line. Now, recall that the form of the exponential
family is
\begin{eqnarray}
e^{[\theta^T s(x,z)]}\frac{h(x,z)}{C(\theta)}
\end{eqnarray}
Now we have to fit $P(x,z)$ into the scheme of exponential distributions. As a first step, we define
\begin{eqnarray}
\beta_k = \log(\alpha_k)+\frac{1}{2}\log|\Lambda_k| -\frac{1}{2} \mu_k^T \Lambda_k \mu_k
\end{eqnarray}
As a second step, note that $\Lambda_k$ is a quadratic form. Therefore
\begin{eqnarray}
x^T\Lambda_k x=\sum_{v=1}^n \sum_{w=1}^n (\Lambda_k)_{vw}x_v x_w
\end{eqnarray}
and
\begin{eqnarray}
\mu_k^T\Lambda_k x=\sum_{v=1}^n \sum_{w=1}^n (\Lambda_k)_{vw}(\mu_k)_v x_w
\end{eqnarray}
where $(\Lambda_k)_{vw}$ denotes $v$, $w$ th element of the matrix $\Lambda_k$. Similarly
$(\mu_k)_{v}$ denotes $v$ th element of the vector $\mu_k$. Recollecting these, we get the exponential form of
gaussian mixture:
\begin{eqnarray}
P(x,z)=\exp\left[ \sum_{k=1}^n \left( z_k \beta_k – \frac{1}{2}\sum_{v=1}^n \sum_{w=1}^n (\Lambda_k)_{vw}x_v x_w z_k
– \frac{1}{2}\sum_{v=1}^n \sum_{w=1}^n (\Lambda_k)_{vw}(\mu_k)_v x_w z_k \right) \right] \nonumber
\end{eqnarray}
The sufficient statistics are:

  • $z_k$, \qquad $k=1 \ldots n$
  • $z_k x_v x_w$, \qquad $v,w=1 \ldots n$, \quad $k=1 \ldots m$.
    Or, written in matrix form, $z_k xx^T$. Note that every element of this $d \times d$ matrix is a sufficient statistic, and there are $n$ such matrices.
  • $z_k x_v $, \qquad $v=1 \ldots n$, \quad $k=1 \ldots m$. Or, the vector $z_k x$.

Gaussian Mixtures with Multiple iid Samples

The equation (..) is valid for a single sample $x,z$. What if we took $r$ iid samples $x^1,z^1;x^2,z^2;\ldots;x^r,z^r$
from the same gaussian mixture distribution? In this case, the corresponding pdf will be
\begin{eqnarray}
P(x^1,z^1;x^2,z^2;\ldots;x^r,z^r)=P(x^1,z^1)P(x^2,z^2)\ldots R(x^r,z^r)
\end{eqnarray}
At this stage, recall that $x^i$ is a $d$-dimensional vector and each $z^i$ is an $m$-dimensional model, ie.
\begin{eqnarray}
x^i=\left[ \begin{array}{c}
x_1^i\\
x_2^i\\
\vdots\\
x_{d-1}^i\\
x_d^i\\
\end{array}\right] \qquad , \qquad z^i=\left[ \begin{array}{c}
z_1^i\\
z_2^i\\
\vdots\\
z_{m-1}^i\\
z_m^i\\
\end{array}\right]
\end{eqnarray}
Hence,
\begin{eqnarray}
P(x^i,z^i)=\exp\left[ \sum_{k=1}^n \left( z_k^i \beta_k – \frac{1}{2}\sum_{v=1}^n \sum_{w=1}^n (\Lambda_k)_{vw}x_v^i x_w^i z_k^i
– \frac{1}{2}\sum_{v=1}^n \sum_{w=1}^n (\Lambda_k)_{vw}(\mu_k)_v x_w^i z_k^i \right) \right] \nonumber
\end{eqnarray}
and
\begin{eqnarray}
P(x^1,z^1;x^2,z^2;\ldots;x^r,z^r) = \qquad \qquad \qquad \qquad \qquad \qquad
\qquad \qquad \qquad \qquad \qquad \qquad \qquad \nonumber \\
\exp\left[ \sum_{k=1}^n \left( \beta_k \sum_{i=1}^r z_k^i
– \frac{1}{2}\sum_{v=1}^n \sum_{w=1}^n (\Lambda_k)_{vw}\sum_{i=1}^r x_v^i x_w^i z_k^i
– \frac{1}{2}\sum_{v=1}^n \sum_{w=1}^n (\Lambda_k)_{vw}(\mu_k)_v \sum_{i=1}^r x_w^i z_k^i \right) \right] \nonumber
\end{eqnarray}
The sufficient statistics in this case are

  • $\sum_{i=1}^r z_k^i$, \qquad $k=1 \ldots n$
  • $\sum_{i=1}^r z_k^i x_v^i x_w^i$, \qquad $v,w=1 \ldots n$, \quad $k=1 \ldots m$. Or, the matrix $\sum_{i=1}^r z_k^i x^i (x^i)^T$
  • $\sum_{i=1}^r z_k^i x_v^i $, \qquad $v=1 \ldots n$, \quad $k=1 \ldots m$. Or, the vector $\sum_{i=1}^r z_k^i x^i $

EM on Gaussian Mixtures

As we have the sufficient statistics, we can substitute them to Equation (**). For the first sufficient statistic, we get
\begin{eqnarray}
E_{\theta_0}\left(\sum_i z_{ik} | X=x\right)=E_{\theta}\left(\sum_i z_{ik} \right)
\end{eqnarray}
Let us analyze the RHS and LHS of this equations separately. Using the linearity of expectation, we can expand the RHS
\begin{eqnarray}
E_{\theta}\left(\sum_i z_{ik} \right)=\sum_i E_{\theta}\left( Z_{ik} \right)
\end{eqnarray}
The subscript $ik$ under the random variable $Z_{ik}$ means the $k$’th component of $i$th sample. As the samples are iid, the pmf of
$Z_{ik}$ is independent of the sample index $i$: $P(Z_{ik}=1)=\alpha_k$, $P(Z_{ik}=0)=1-\alpha_k$. Hence,
\begin{eqnarray}
E_{\theta}\left( Z_{ik} \right)=0.P(Z_{ik}=0)+1.P(Z_{ik}=1)=\alpha_k
\end{eqnarray}
and
\begin{eqnarray}
E_{\theta}\left(\sum_i Z_{ik} \right)=\sum_i \alpha_k=n \alpha_k
\end{eqnarray}
For the LHS,