ML Review -- Expectation Maximization Algorithm

Maximum-likelihood estimation problem

We have a density function $ p(x|\Theta)$ that is governed by the set of parameters $ \Theta$ (e.g., $ p$ might be a set Gaussians and $ \Theta$ could be the means and covariances). We also have a data set of size $ N$, That is , we assume that these data vectors are independent and identically distributed (i.i.d) with distribution $ p$. Therefore, the resulting density for the samples is $ p(X|\Theta)=\sum{i=1}^{N}p(xi|\Theta) = L(\Theta|X)$

This function $ L(\Theta|X)$ is called the liklihood of the parameters given the data, or just the likelihood function. The likelihood is thought of as a function of the parameters $ \Theta$ where the data $X$ is fixed. In the maximum likelihood problem, our goal is to find the $ \Theta$ that maximizes $ L$. That is, we wish to find $ \Theta '$ where $ \Theta ' =argmax_{\Theta} L(\Theta|X)$

Often we maximize $ \log(L(\Theta|X))$ instead because it is analytically easier.

Basic EM

There are two main applications of the EM algorithm. The first occurs when the data indeed has missing values, due to problems with or limitations of the observation process. The second occurs when optimizing the likelihood function is analytically intractable but when the likelihood function can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters. The latter application is more common in the computational pattern recognition community.

As before, we assume that data $ X$ is observed and is generated by some distribution. We call $ X$ the incomplete data. We assume that a complete data set exists $ Z=(X,Y)$ and also assume a joint density funciton

$ p(z|\Theta)=p(x,y|\Theta)=p(y|x,\Theta)p(x|\Theta)$

With this new density function, we can define a new likelihood function, $ L(\Theta|Z)=L(\Theta|X,Y)=p(X,Y|\Theta)$, called the complete-data likelihood. Note that this function is in fact a random variable since the missing information $ Y$ is unknown, random and presumably governed by an underlying distribution. That is, we can think of $ L(\Theta|X,Y)=h_{X,\Theta}(Y)$ for some function $ h$ where $ X$ and $ \Theta$ are constant and $ Y$ is a random variable. The original likelihood $ L(\Theta|X)$ is referred to as the incomplete-data likelihood funciton.

The EM algorithm first finds the expected value of the complete-data log-likelihood $ \log p(X,Y|\Theta)$ with respect to the unknown data $ Y$ given the observed data $ X$ and the current parameter estimates. We define:

$ Q(\Theta, \Theta^{i-1})= E(\log p(X,Y|\Theta)|X, \Theta^{i-1})$

Where $ \Theta^{i-1}$ are the current parameters estimates that we used to evaluate the expectation and $\Theta$ are the new parameters that we optimize to increase Q.

The key thing to understand is that in this expression $ X$ and $ \Theta^{i-1}$ are constants, $\Theta$ is a normal variable that we wish to adjust, and $ Y$ is a random variable governed by the distribution $ f(y|X,\Theta^{i-1})$. The right side of $ Q$ can be re-written as:

$ E[\log p(X,Y|\Theta)|X,\Theta^{i-1}]=\int_{y \in \Gamma} \log p(X,y|\Theta)f(y|X,\Theta^{i-1})dy$

Note that $ f(y|X,\Theta^{i-1})$ is the marginal distribution of the unobserved data and is dependent on both the observed data $X$ and on the current parameters, and $ \Gamma$ is the space of values $ y$ can take on.

The evaluation of this expectation is called the E-step of the algorithm. Notice the meaning of the two arguments int function $ Q(\Theta ,\Theta') $ The first argument $ \Theta$ corresponds to the parameters that ultimately will be optimized in an attempt to maximize the likelihood. The second argument $ \Theta$ corresponds to the parameters that we use to evaluate the expectation.

The second step(M-step) of the EM algorithm is to maximize the expectation we computed in the first step. We find:

$ \Theta^{i} = argmax_{\Theta} Q(\Theta,\Theta^{i-1})$

These 2 steps are repeated as necessary. Each iteration is guaranteed to increase the log-likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood function.