Expectation Maximization Algorithm
Maximum-likelihood estimation problem
We have a density function p(x|Θ)p(x|Θ) that is governed by the set of parameters ΘΘ (e.g., pp might be a set Gaussians and ΘΘ could be the means and covariances). We also have a data set of size NN, That is , we assume that these data vectors are independent and identically distributed (i.i.d) with distribution pp. Therefore, the resulting density for the samples is p(X|Θ)=∑Ni=1p(xi|Θ)=L(Θ|X)p(X|Θ)=∑Ni=1p(xi|Θ)=L(Θ|X)
This function L(Θ|X)L(Θ|X) is called the liklihood of the parameters given the data, or just the likelihood function. The likelihood is thought of as a function of the parameters ΘΘ where the data XX is fixed. In the maximum likelihood problem, our goal is to find the ΘΘ that maximizes LL. That is, we wish to find Θ‘Θ‘ where Θ‘=argmaxΘL(Θ|X)Θ‘=argmaxΘL(Θ|X)
Often we maximize log(L(Θ|X))log(L(Θ|X)) instead because it is analytically easier.
Basic EM
There are two main applications of the EM algorithm. The first occurs when the data indeed has missing values, due to problems with or limitations of the observation process. The second occurs when optimizing the likelihood function is analytically intractable but when the likelihood function can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters. The latter application is more common in the computational pattern recognition community.
As before, we assume that data XX is observed and is generated by some distribution. We call XX the incomplete data. We assume that a complete data set exists Z=(X,Y)Z=(X,Y) and also assume a joint density funciton
p(z|Θ)=p(x,y|Θ)=p(y|x,Θ)p(x|Θ)p(z|Θ)=p(x,y|Θ)=p(y|x,Θ)p(x|Θ)
With this new density function, we can define a new likelihood function, L(Θ|Z)=L(Θ|X,Y)=p(X,Y|Θ)L(Θ|Z)=L(Θ|X,Y)=p(X,Y|Θ), called the complete-data likelihood. Note that this function is in fact a random variable since the missing information YY is unknown, random and presumably governed by an underlying distribution. That is, we can think of L(Θ|X,Y)=hX,Θ(Y)L(Θ|X,Y)=hX,Θ(Y) for some function hh where XX and ΘΘ are constant and YY is a random variable. The original likelihood L(Θ|X)L(Θ|X) is referred to as the incomplete-data likelihood funciton.
The EM algorithm first finds the expected value of the complete-data log-likelihood logp(X,Y|Θ)logp(X,Y|Θ) with respect to the unknown data YY given the observed data XX and the current parameter estimates. We define:
Q(Θ,Θi−1)=E(logp(X,Y|Θ)|X,Θi−1)Q(Θ,Θi−1)=E(logp(X,Y|Θ)|X,Θi−1)
Where Θi−1Θi−1 are the current parameters estimates that we used to evaluate the expectation and ΘΘ are the new parameters that we optimize to increase Q.
The key thing to understand is that in this expression XX and Θi−1Θi−1 are constants, ΘΘ is a normal variable that we wish to adjust, and YY is a random variable governed by the distribution f(y|X,Θi−1)f(y|X,Θi−1). The right side of QQ can be re-written as:
E[logp(X,Y|Θ)|X,Θi−1]=∫y∈Γlogp(X,y|Θ)f(y|X,Θi−1)dyE[logp(X,Y|Θ)|X,Θi−1]=∫y∈Γlogp(X,y|Θ)f(y|X,Θi−1)dy
Note that f(y|X,Θi−1)f(y|X,Θi−1) is the marginal distribution of the unobserved data and is dependent on both the observed data XX and on the current parameters, and ΓΓ is the space of values yy can take on.
The evaluation of this expectation is called the E-step of the algorithm. Notice the meaning of the two arguments int function Q(Θ,Θ′) The first argument Θ corresponds to the parameters that ultimately will be optimized in an attempt to maximize the likelihood. The second argument Θ corresponds to the parameters that we use to evaluate the expectation.
The second step(M-step) of the EM algorithm is to maximize the expectation we computed in the first step. We find:
Θi=argmaxΘQ(Θ,Θi−1)
These 2 steps are repeated as necessary. Each iteration is guaranteed to increase the log-likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood function.