Topic Model-- LDA

Introduction

In real Infromation Retrieval (Google, Bing), when we gather many online documents, we need to classify them according to their topic. For this task, we need a machine learning algorithm to do the classification.

One way is to use naive bayes to choose the topic with maximal posterior probability P(topic|document).

In this approach, there are 2 models available: multivariate Bernoulli model and multinomial model.

But in 2003, David Blei, Andrew Ng, and Michael Jordan introduced LDA into this area, presented as a graphical model for topic discovery.

Basics

In LDA, each document may be viewed as a mixture of various topics. This is similar to probabilistic latent semantic analysis, except that in LDA the topic distribution is assumed to have a Dirichlet prior. In practice, this results in more reasonable mixtures of topics in a document.

Model

LDA

The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. M denotes the number of documents, N the number of words in a document.

α is the parameter of the Dirichlet prior on the per-document topic distribution. β is the parameter of the Dirichlet prior on the per-topic word distribution. Θi is the topic distribution for document i. ϕk is the word distribution for topic k zij is the topic for the jth word in document i wij is the specific word

The wij are the only observable variables, and the other variables are latent variables.

The gernerative process behind is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for each document i in a corpus D:

  1. Choose ΘiDir(α), where i1,M and Dir(α) is the Dirichlet distribution for parameter α

  2. Choose ϕkDir(β), where k1,,K

  3. For each of the word wij, where j1,,Ni 3a. Choose a topic zijMultinomial(Θi) 3b. Choose a word wijMultinomial(ϕzi,j)

Inference

There is a long long long inference process… So if you are interested in the entire inference, refer to PRML