Fast Text Cache--Feed Forward Bloom Filter

In my current work, I want to speed up the text pattern match and have good cache behavior. For pattern match, there are several pretty good algorithm : KMP , Rabin Karp.I found the following algorithm is extremely awesome!!!! This algorithm quickly filter out non-related corpus. It reduces quite a lot in terms of problem size.

white paper

Following is my learning note.

Bloom Filter

Definition from wiki:
An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.Probability of false positives :Assume that a hash function selects each array position with equal probability. If m is the number of bits in the array , and k is the number of the numver of hash functions, then the probability that a certain bit is not set to 1 by a certain hash function during the insertion of an element is then : $1 - \frac{1}{m}$. The probability that it is not set to 1 by any of the hash functions is $(1 - \frac{1}{m})^k$ . If we have inserted n elements, the probability that a certain bit is still 0 is $(1 - \frac{1}{m})^{kn}$. The probability that it is 1 is therefore $1 - (1-\frac{1}{m})^{kn}$. Now test membership of anelement that is not in the set. Each of the k array positions computed by the hash function is 1 with probability as above. The probability of all of them being 1, which would cause the algorithm erroneously claim that the element is in the set, is often given as $(1 - (1 - \frac{1}{m})^{kn})^{k} = (1 - \exp^{-kn/m})^k$.

So basically, bloom filter has pretty good rejection feature. Only false positives could happen.

Feed Forward Bloom Fitlers

For exact_match acceleration, Bloom filters are typically used as a filter before a traditional matching phase. Like a traditional Bloom Filter, the feed-forward Bloom filter reduces the size of the corpus before cleanup. Unlike traditional filters, however, it also uses information determined while filtering the corpus to eliminate many of the patterns from the second phase. As a result, it reduces drastically the memory used for cleanup.

Cache-partitioned Bloom Filters

A lookup in a typical Bloom filter involves computing a number of hash values for a query, and using these values as indices when accessing a bit vector. Because the hash values must be randomly distributed for the filter to be effective, and since, for millions of pattherns, the bit vector needs to be a lot larger than the cache available on modern CPUs, Bloom filter implementations have poor cache performance. The solution above paper provided is to split the bloom filter into 2 parts. The first part is smaller than the largest CPU cache available(typically L2 cache) and is the only one accessed for the wide majority of the lookups. In consequence, it will remain entirely cache-resident. The second part of the filter is larger, but is accessed infrequently. The result of this paper is that the cache-partitioned Bloom filter is as effective as the classic Bloom filter, but has much better cache performance.

Design and Implementation

High Level

a feed-forward Bloom filter(FFBF) is built from the large set of patterns.
It is used to scan the corpus and discard every item(e.g. line of text, if the patterns cannot span multiple lines, or input fragment) that does not generate hits in the filter and therefore cannot contain any matches.
The set of patterns is then scanned using feed-forward information obtained during the corpus scan. Only those patterns for which there is a chance of a match in the filtered corpus are kept for the next phase.
At this point, all that is left to do is search for a small fraction of the initial number of patterns in a small fragment of the text corpus. Therefore, this exact matching step can be performed quickly and with minimal memory requirements using any traditional multiple pattern matching algorithm. Notice that the large set of patterns does not have to be memory -resident at any point during the execution of our algorithm – we only need to stream it sequentially from external media.

BLOOM

My feeling

Bloom filter works pretty well when acting as reducing problem size. That actually reminds me of other applications that will be improved by this approach, especially before we do distributed computing, we can reduce the problem size quite bit if we define a “domain” bloom filter.