Notes and Q&A
Notes
- In Lucene, ranking is per-field.
What about deleted docs?
<rmuir> maxDoc() doesnt reflect deletes <rmuir> docFreq() doesnt reflect deletes <rmuir> the numDocs() reflects delete
What about the other methods?sumOfNorms can be used as a "sum of lengths", provided the norm reflects the length (and not 1/sqrt(#tokens) as the default)
Lucene indexes in segments. For ranking we need to see the whole index, that's why we climb up to the top of segment tree via ReaderUtil.getTopLevelContext(context); in MockBM25Similarity.avgDocumentLength().
In Similarity.computeWeight() (soon to be computeStats) we are seek'ed to the term, so statistics should be computed there.
- There are three types of boost
score + boost: I do not consider this a boost, but rather a sum of similarity scores, of which one happens to come from outside (e.g. PageRank)
score * boost
score = tf(boost * freq) * idf
- We prefer manual instantiation (for Similarities, parts thereof). Providers should be written manually.
Problems
- Language modeling would require custom aggregation of query terms
- product instead of weighted sum (this could be solved by using log, but the query norm still messes it up)
- decide which documents have a term, and which do not, because we have to weight them accordingly (p_t or 1 - p_t)
- two types of aggregation?
- per field (definitely Similarity-specific)
- whole query (should be Similarity-specific too, but might be OK if fixed)
- What about phrases? LATER... sum(DF)
Questions about Lucene
- Is it possible to design a scoring interface that is consistent across ranking frameworks?
- How do contexts work?
NormConverter? NO
- Common Normalization, IDF, etc. TOO
QueryWeight class
- What to pass to score()?
- LM default parameters?
- Factory for DFR (low prio)
- lnu.ltc, LM, DFR+