The Probabilistic Relevance Framework
Introduction
The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grouned in work done in the 19701980s, which led to the development of one of the most successful textretrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document metadata (especially structure and linkgraph information). This led to one of the most successful Websearch and cororatesearch algorithms, BM25F.
Basic Model
Some notation
rank equivalence : $\propto_{q}$ eg. $g() \propto_{q} h()$
relevance $Rel$: $rel,\bar{rel}$ (relevant or not)
document: $d = (tf_{1},\ldots, tf_{V})$, where $tf_i$ normally represents the frequency of term $t_i$
query: $q = (qtf_{1},\ldots,qtf_{V})$, where $qtf_i$ represents term frequencies in the query or may represent a binary presence or absence feature.
Development of Basic Model
We start with the very general PRF, and end up with an explicit document scoring function.
$P(reld,q) \propto_{q}\frac{P(reld,q)}{P(\bar{rel}d,q)} = \frac{P(drel,q) P(relq)}{P(d\bar{rel},q)P(\bar{rel}q)}$
$\propto_{q} \frac{P(drel,q)}{P(d\bar{rel},q)} $
$\approx \prod_{i \in V} \frac{P(TF_i = tf_i  rel, q)}{P(TF_i = tf_i \bar{rel}, q)}$
$\approx \prod_{i \in q} \frac{P(TF_i = tf_i  rel)}{P(TF_i = tf_i \bar{rel})}$
$\propto_{q} \sum_q \log \frac{P(TF_i = tf_i  rel)}{P(TF_i = tf_i\bar{rel})}$
Let $U_i(x) = \log \frac{P(TF_i =xrel)}{P(TF_i =x\bar{rel})}$, then we have
$=\sum_q U_i (tf_i)$
$= \sum_{q,tf_i>0} U_i(tf_i) + \sum_{q, tf_i=0} U_i(0)  \sum_{q, tf_i>0}U_i(0) + \sum_{q, tf_i>0} U_i(0)$
$= \sum_{q,tf_i>0} (U_i(tf_i)  U_i(0)) + \sum_q U_i(0)$
$\propto_{q} \sum_{q,tf_i>0} (U_i(tf_i)  U_i(0))$
let $W_i(x) = U_i(x)  U_i(0) = \log \frac{P(TF_i = x  rel) P(TF_i =0 \bar{rel})}{P(TF_i=x\bar{rel})P(TF_i=0rel)}$
and $w_i = W_i(tf_i)$, we finally have:
$\sum_{q,tf_i>0} w_i$
Derived Models
The binary independent model
Suppose that $TF_i$ is a binary variable, having only the values zero and one. Now we got new $w_i$:
$w_i^{BIM} = \log \frac{P(t_irel)(1 P(t_i\bar{rel}))}{1P(t_irel)P(t_i\bar{rel})}$
Let’s get into detail about deriving a “relevance” property since above equation is conditional on the relevance property.
Notation
$N$ size of the judged sample
$n_i$ Number of documents in the judged sample containing term $t_i$
$R$ Relevant set size( number of documents judged relevant )
$r_i$ Number of judged relevant docs containing term $t_i$
Robertson Sprck Jones weight
$w_i^{RSJ} = \log \frac{(r_i+0.5)(NR n_i + r_i +0.5)}{(n_ir_i+0.5)(Rr_i+0.5)}$
Next, we suppose that some documents, probably a small number, have been retrieved and judged—this is the usual relevance feedback scenario.
And, we suppose that we have no relevance information at all(the most usual scenario). In this case, we can only assume that the relevance probability $P(t_irel)$ is fixed, but we can continue to use the complement method for the nonrelevance probability. All this can be achieved by setting $R=r_i=0$ in the RSJ formula. This resulting fomula is a close approximation to classical $IDF$
$w_i^{IDF} = \log \frac{Nn_i+0.5}{n_i+0.5}$
The Eliteness Model
We suppose that for any documentterm pair there is a hidden property which we refer to as $eliteness$ This can be interpreted as a form of aboutness: if the term is elite in the document, in some sense the document is about the concept denoted by the term.
2Poisson Model
New notation: The eliteness random variable $E$ can take two values:
$E$ : $elite$, $\bar{elite}$
We now decompose all the probabilities we want using these tow disjoint events, following this pattern:
$P(\alpha\beta) = P(\alpha elite)P(elite\beta) + P(\alpha  \bar{elite})P(\bar{elite}  \beta)$.
Let $p_{i1} = P(E_i = eliterel)$ and $p_{i0} = P(E_i = elite\bar{rel})$
$E_{i1}(tf) = P(TF_i=TF  E_i = elite)$ and $E_{i0}(tf) = P(TF_i=TF  E_i = \bar{elite})$
We can get this : $P(TF_i = tf rel) = p_{i1} E_{i1}(tf) + (1  p_{i1} E_{i0}(tf))$
This gives us an equation for our termweights:
$w_{i}^{elite} = \log \frac{(p_1E_1(tf) + (1p_1)E_0(tf))(p_0E_1(0)+(1p_0)E_0(0))}{(p_1E_1(0)+(1p_1)E_0(0))(p_0E_1(tf)+(1p_0)E_0(tf))}$
More specifically, we make distributional assumptions about these events. From $Harter$, we assume that the distributions of term frequencies across documents, conditioned on eliteness, are Poisson:
$E_{ie}(tf) \sim Poisson(\lambda_{ie})$ (Poisson with mean $\lambda_{ie}$)
We note the following characterisitics of this model:

The model of topicality is a very simple one – one word one topic.

There is no attempt to normalise the probabilities across the full unigram model for the document.

The model depends fairly crucially on the notion that all documents are of the same(fixed) length
If we plug the Poisson distributional assumptions into above equation, we can express the term weight as a function of the two means $\lambda_{e}$ and the mixing proportion of elite and nonelite documents in the collection(as well as the observed tf). This is a somewhat messy formula.
We note:
 $w_{i}^{elite}(0) = 0$
 $w_{i}^{elite}(tf)$ increases monotonically with $tf$;
 However, asymptotically approaches a maximum value as $tf \rightarrow \infty$
 and the asymptotic limit being:
$\lim_{tf\rightarrow\infty} w_i^{elite}(tf) = \log \frac{p_1(1p_0)}{(1p_1)p_0} = w_i^{BIM}$
BM25
Notation: soft length normlisation: $B = ((1b) + b\frac{dl}{avdl})$ ,
where $dl = \sum_{i\in V} tf_i$ it’s document length,
$avdl$ is the average document length. $b$ is normlisation parameter and setting $b=1$will perform full documentlength normalisation
$tf’ = \frac{tf}{B}$
$w_{i}^{BM25}(tf) = \frac{tf’}{k_1+tf’} w_i^{RSJ}$
$=\frac{tf}{k_1((1b) + b\frac{dl}{avdl})+tf} w_i^{RSJ}$
where $k_1$ is a $saturation$ parameter. For high $k_1$, increments in $tf$ continue to contribute significantly to the score, whereas for low $k_1$, the additional contribution of a newly observed occourrence tails off ver rapidly.
Reference
“The Probabilistic Relevance Framework BM25 and Beyond” by Stephen Robertson and Hugo Zaragoza
“A probabilistic approach to automatic keyword indexing” S.P.Harter