SummerOfCode2011ProjectRankingNotes

Notes and Q&A

What about deleted docs?

<rmuir> maxDoc() doesnt reflect deletes
<rmuir> docFreq() doesnt reflect deletes
<rmuir> the numDocs() reflects delete

What about the other methods?

sumOfNorms can be used as a "sum of lengths", provided the norm reflects the length (and not 1/sqrt(#tokens) as the default)
Lucene indexes in segments. For ranking we need to see the whole index, that's why we climb up to the top of segment tree via ReaderUtil.getTopLevelContext(context); in MockBM25Similarity.avgDocumentLength().
In Similarity.computeWeight() (soon to be computeStats) we are seek'ed to the term, so statistics should be computed there.
There are three types of boost
- score + boost: I do not consider this a boost, but rather a sum of similarity scores, of which one happens to come from outside (e.g. PageRank)
- score * boost
- score = tf(boost * freq) * idf
We prefer manual instantiation (for Similarities, parts thereof). Providers should be written manually.

Is it possible to design a scoring interface that is consistent across ranking frameworks?
How do contexts work?
NormConverter? NO
Common Normalization, IDF, etc. TOO
QueryWeight class
What to pass to score()?
LM default parameters?
Factory for DFR (low prio)
lnu.ltc, LM, DFR+