The Probabilistic Relevance Framework

Introduction

The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grouned in work done in the 1970-1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). This led to one of the most successful Web-search and cororate-search algorithms, BM25F.

Basic Model

Some notation

rank equivalence : $\propto_{q}$ eg. $g() \propto_{q} h()$
relevance $Rel$: $rel,\bar{rel}$ (relevant or not)
document: $d = (tf_{1},\ldots, tf_{|V|})$, where $tf_i$ normally represents the frequency of term $t_i$
query: $q = (qtf_{1},\ldots,qtf_{|V|})$, where $qtf_i$ represents term frequencies in the query or may represent a binary presence or absence feature.

Development of Basic Model

We start with the very general PRF, and end up with an explicit document scoring function. $P(rel|d,q) \propto_{q}\frac{P(rel|d,q)}{P(\bar{rel}|d,q)} = \frac{P(d|rel,q) P(rel|q)}{P(d|\bar{rel},q)P(\bar{rel}|q)}$
$\propto_{q} \frac{P(d|rel,q)}{P(d|\bar{rel},q)} $

$\approx \prod_{i \in V} \frac{P(TF_i = tf_i | rel, q)}{P(TF_i = tf_i| \bar{rel}, q)}$

$\approx \prod_{i \in q} \frac{P(TF_i = tf_i | rel)}{P(TF_i = tf_i| \bar{rel})}$

$\propto_{q} \sum_q \log \frac{P(TF_i = tf_i | rel)}{P(TF_i = tf_i|\bar{rel})}$

Let $U_i(x) = \log \frac{P(TF_i =x|rel)}{P(TF_i =x|\bar{rel})}$, then we have

$=\sum_q U_i (tf_i)$

$= \sum_{q,tf_i>0} U_i(tf_i) + \sum_{q, tf_i=0} U_i(0) - \sum_{q, tf_i>0}U_i(0) + \sum_{q, tf_i>0} U_i(0)$

$= \sum_{q,tf_i>0} (U_i(tf_i) - U_i(0)) + \sum_q U_i(0)$

$\propto_{q} \sum_{q,tf_i>0} (U_i(tf_i) - U_i(0))$

let $W_i(x) = U_i(x) - U_i(0) = \log \frac{P(TF_i = x | rel) P(TF_i =0| \bar{rel})}{P(TF_i=x|\bar{rel})P(TF_i=0|rel)}$

and $w_i = W_i(tf_i)$, we finally have:

$\sum_{q,tf_i>0} w_i$

Derived Models

The binary independent model

Suppose that $TF_i$ is a binary variable, having only the values zero and one. Now we got new $w_i$:

$w_i^{BIM} = \log \frac{P(t_i|rel)(1- P(t_i|\bar{rel}))}{1-P(t_i|rel)P(t_i|\bar{rel})}$

Let’s get into detail about deriving a “relevance” property since above equation is conditional on the relevance property.

Notation $N$ size of the judged sample
$n_i$ Number of documents in the judged sample containing term $t_i$
$R$ Relevant set size( number of documents judged relevant )
$r_i$ Number of judged relevant docs containing term $t_i$

Robertson Sprck Jones weight

$w_i^{RSJ} = \log \frac{(r_i+0.5)(N-R -n_i + r_i +0.5)}{(n_i-r_i+0.5)(R-r_i+0.5)}$

Next, we suppose that some documents, probably a small number, have been retrieved and judged—this is the usual relevance feedback scenario.

And, we suppose that we have no relevance information at all(the most usual scenario). In this case, we can only assume that the relevance probability $P(t_i|rel)$ is fixed, but we can continue to use the complement method for the non-relevance probability. All this can be achieved by setting $R=r_i=0$ in the RSJ formula. This resulting fomula is a close approximation to classical $IDF$

$w_i^{IDF} = \log \frac{N-n_i+0.5}{n_i+0.5}$

The Eliteness Model

We suppose that for any document-term pair there is a hidden property which we refer to as $eliteness$ This can be interpreted as a form of aboutness: if the term is elite in the document, in some sense the document is about the concept denoted by the term.

2-Poisson Model

New notation: The eliteness random variable $E$ can take two values:

$E$ : $elite$, $\bar{elite}$

We now decompose all the probabilities we want using these tow disjoint events, following this pattern:

$P(\alpha|\beta) = P(\alpha |elite)P(elite|\beta) + P(\alpha | \bar{elite})P(\bar{elite} | \beta)$.

Let $p_{i1} = P(E_i = elite|rel)$ and $p_{i0} = P(E_i = elite|\bar{rel})$

$E_{i1}(tf) = P(TF_i=TF | E_i = elite)$ and $E_{i0}(tf) = P(TF_i=TF | E_i = \bar{elite})$

We can get this : $P(TF_i = tf |rel) = p_{i1} E_{i1}(tf) + (1 - p_{i1} E_{i0}(tf))$

This gives us an equation for our term-weights:

$w_{i}^{elite} = \log \frac{(p_1E_1(tf) + (1-p_1)E_0(tf))(p_0E_1(0)+(1-p_0)E_0(0))}{(p_1E_1(0)+(1-p_1)E_0(0))(p_0E_1(tf)+(1-p_0)E_0(tf))}$

More specifically, we make distributional assumptions about these events. From $Harter$, we assume that the distributions of term frequencies across documents, conditioned on eliteness, are Poisson:

$E_{ie}(tf) \sim Poisson(\lambda_{ie})$ (Poisson with mean $\lambda_{ie}$)

We note the following characterisitics of this model:

  1. The model of topicality is a very simple one – one word one topic.

  2. There is no attempt to normalise the probabilities across the full unigram model for the document.

  3. The model depends fairly crucially on the notion that all documents are of the same(fixed) length

If we plug the Poisson distributional assumptions into above equation, we can express the term weight as a function of the two means $\lambda_{e}$ and the mixing proportion of elite and non-elite documents in the collection(as well as the observed tf). This is a somewhat messy formula.

We note:

  1. $w_{i}^{elite}(0) = 0$
  2. $w_{i}^{elite}(tf)$ increases monotonically with $tf$;
  3. However, asymptotically approaches a maximum value as $tf \rightarrow \infty$
  4. and the asymptotic limit being:

$\lim_{tf\rightarrow\infty} w_i^{elite}(tf) = \log \frac{p_1(1-p_0)}{(1-p_1)p_0} = w_i^{BIM}$

BM25

Notation: soft length normlisation: $B = ((1-b) + b\frac{dl}{avdl})$ ,

where $dl = \sum_{i\in V} tf_i$ it’s document length,

$avdl$ is the average document length. $b$ is normlisation parameter and setting $b=1$will perform full document-length normalisation

$tf’ = \frac{tf}{B}$

$w_{i}^{BM25}(tf) = \frac{tf’}{k_1+tf’} w_i^{RSJ}$

$=\frac{tf}{k_1((1-b) + b\frac{dl}{avdl})+tf} w_i^{RSJ}$

where $k_1$ is a $saturation$ parameter. For high $k_1$, increments in $tf$ continue to contribute significantly to the score, whereas for low $k_1$, the additional contribution of a newly observed occourrence tails off ver rapidly.

Reference

“The Probabilistic Relevance Framework BM25 and Beyond” by Stephen Robertson and Hugo Zaragoza

“A probabilistic approach to automatic keyword indexing” S.P.Harter