Return to main page

Notes and Q&A

Notes

  • In Lucene, ranking is per-field.
  • What about deleted docs?
    <rmuir> maxDoc() doesnt reflect deletes
    <rmuir> docFreq() doesnt reflect deletes
    <rmuir> the numDocs() reflects delete
    
    What about the other methods?
  • sumOfNorms can be used as a "sum of lengths", provided the norm reflects the length (and not 1/sqrt(#tokens) as the default)
  • Lucene indexes in segments. For ranking we need to see the whole index, that's why we climb up to the top of segment tree via ReaderUtil.getTopLevelContext(context); in MockBM25Similarity.avgDocumentLength().
  • In Similarity.computeWeight() (soon to be computeStats) we are seek'ed to the term, so statistics should be computed there.
  • There are three types of boost
    • score + boost: I do not consider this a boost, but rather a sum of similarity scores, of which one happens to come from outside (e.g. PageRank)
    • score * boost
    • score = tf(boost * freq) * idf
  • We prefer manual instantiation (for Similarities, parts thereof). Providers should be written manually.

Problems

  • Language modeling would require custom aggregation of query terms
    • product instead of weighted sum (this could be solved by using log, but the query norm still messes it up)
    • decide which documents have a term, and which do not, because we have to weight them accordingly (p_t or 1 - p_t)
    • two types of aggregation?
      • per field (definitely Similarity-specific)
      • whole query (should be Similarity-specific too, but might be OK if fixed)
  • What about phrases? LATER... sum(DF)

Questions about Lucene

  • Is it possible to design a scoring interface that is consistent across ranking frameworks?
  • How do contexts work?
  • NormConverter? NO
  • Common Normalization, IDF, etc. TOO
  • QueryWeight class
  • What to pass to score()?
  • LM default parameters?
  • Factory for DFR (low prio)
  • lnu.ltc, LM, DFR+

Return to main page

  • No labels