Return to main page

A short overview of Terrier's scoring architecture

Terrier is another Java-based, open source search engine developed at the School of Computing Science, University of Glasgow. It is positioned as a platform for IR research, and as such, has an extensible ranking architecture, with many ranking functions, such as LM, BM25 and the DFR framework already implemented. Therefore, it may be worthwhile to include a short summary and analysis of their architecture here.

The Terrier ranking architecture

The interface Model is top of the model (Similarity) hierarchy.

WeightingModel: abstract implementation of Model. It duplicates the information in the CollectionStatistics and EntryStatistics objects.

There are two classes responsible for providing the statistical parameters to the model.
CollectionStatistics stores the collection-level information.

EntryStatistics contain some information about the term.

The models are rather simple: they override the scoring methods with their own formulas. There are also methods for setting the parameters (e.g. b for BM25).

The last class of interest is Idf. All kinds of idf functions are defined in it, such as idfDFR(), idfENQUIRY (seems to be buggy in 3.0), etc. I have two gripes with this:

Idf also has methods that compute the logarithm of number(s). This is because they use log2 instead of logn.

There are several sub-hierarchies under WeightingModel. These will be introduced in the next sections.

The DFR Framework

DFRWeightingModel is the base class for the modular DFR framework. The scoring formula has been divided into three parts, following the original paper:

These classes define their own custom interfaces, which have little in common with the interface of the WeightingModel class. It is an interesting question if it is possible to maintain a consistent interface (e.g. Similarity) across scoring components and ranking frameworks.

It is worth mentioning that the basic model, aftereffect and normalization implementations are selected by class name and are initialized via the class loader mechanism. This might or might not be necessary in Lucene.

Per-field Normalisation

Field-aware scoring algorithms are implemented as subclasses of PerFieldNormWeightingModel. Currently two such models are implemented: BM25F and PL2F.

As far as I understand, per-field scoring is the default in Lucene, so this particular sub-hierarchy might not be of little interest to us.

Statistics availability in Lucene

The content of the EntryStatistics class, i.e. the term statistics, is conveniently mirrored by the TermContext class. The relevant fields are

If I understand correctly, these statistics are extracted from the index on the fly, per field, so they depend on the fields searched in the query.

Also, as far as df goes, there is also IndexReader.docFreq().

Collection-level statistics seem to be harder to come by.

Conclusion

I have found the scoring hierarchy of Terrier very straightforward and easy to extend. However, I am not really convinced of the merits of the way idf is handled. I would rather have separate Tf, Idf and Query weight (and maybe Smoothing) parts that could be combined freely with each other. The obvious solution is to have separate class hierarchies (however flat) for all of them. This is up for debate, of course.

Return to main page