Differences between revisions 2 and 3
Revision 2 as of 2006-05-29 07:06:44
Size: 27788
Comment: reformat for fixed width email
Revision 3 as of 2009-09-20 21:47:43
Size: 27788
Editor: localhost
Comment: converted to 1.6 markup
No differences found!

The conversation that started it all:

Hi Doug,

I am interested in taking on #11 at 
http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard and was wondering 
if you had some time to elaborate on your ideas.

I would also like to be able to store more information at higher levels, 
such as the document and the index.

If you would like, we can move this discussion to the dev list, but I 
thought we could have a few conversations first and then I could work up 
some interfaces and documentation over the next month or so to present 
to the other developers.

Thanks,
Grant 

Follow ups:

Grant Ingersoll wrote:
> I am interested in taking on #11 at 
> http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard and was 
> wondering if you had some time to elaborate on your ideas.

Great!

I have not spent enough time thinking about it to have very elaborate 
ideas!  One group to collaborate with is the folks who've recently been 
porting Lucene to other languages, like Marvin Humphrey (Perl) and David 
Balmain (Ruby), cc'd.  I had originally imagined a java-centric approach 
for extensibility (e.g., storing class names in the index).  Now I think 
that would be a mistake.

We probably should not try to support both arbitrary extensibility and 
cross-language portability in a single file format.  Instead we should 
add a few flags to control what's stored.  We don't want too many, or 
things get messy.  So the art will be deciding which will make the 
biggest difference.  Per-position boosts is one extreme, and 
boolean-only is the other.  We should support those and existing 
options.  Anything more should be carefully justified.

> I would also like to be able to store more information at higher levels, 
> such as the document and the index.

That sounds reasonable.  Per-index properties would be very convenient.

> If you would like, we can move this discussion to the dev list, but I 
> thought we could have a few conversations first and then I could work up 
> some interfaces and documentation over the next month or so to present 
> to the other developers.

Or you could just start drafting something in public on the wiki, 
Lucene3 or somesuch.  That way others can follow along.  You might first 
focus on the features to be added, then on the file formats, then the 
new APIs and finally on back-compatibility.  The point is we want to be 
able to someday drop back-compatibility and have the new design not have 
many remnants of the old.  So first we should specify how we want things 
to look, then figure out how to get there back-compatibly.  Does that 
make sense?

Cheers,

Doug

Hello Doug, hello Grant, thanks for bringing me in...

On May 8, 2006, at 10:25 AM, Doug Cutting wrote:

> Grant Ingersoll wrote:
>> I am interested in taking on #11 at http://wiki.apache.org/jakarta- 
>> lucene/Lucene2Whiteboard and was wondering if you had some time to  
>> elaborate on your ideas.
>
> We probably should not try to support both arbitrary extensibility  
> and cross-language portability in a single file format.  Instead we  
> should add a few flags to control what's stored.  We don't want too  
> many, or things get messy.  So the art will be deciding which will  
> make the biggest difference.  Per-position boosts is one extreme,  
> and boolean-only is the other.  We should support those and  
> existing options.  Anything more should be carefully justified.

I agree with all of the above.  In addition...

I've been playing around with a rich position scheme using  
KinoSearch, and got it to work mechanically without changing any  
search logic, but I decided I wasn't going to get very far in the  
absence of a sophisticated search-time benchmarker.  Writing one of  
those is job one.  I'll probably get around to that eventually, but  
I'm not sure just when.

It would be interesting to see how much better a boolean scorer which  
implements the proximity algorithm described in the Google paper  
performs in terms of precision.  This could also be done using the  
present index format.  Has it been tried?

I originally thought that a fixed-width/variable-width hybrid for the  
rich positions scheme would yield better performance while preserving  
full proximity data rather than truncating as the Googs propose.   
However, thinking about the amount of calculation that will occur per  
position in a boolean scorer, my gut feeling is that the  
decompression overhead of a VInt wouldn't be a major factor.  Again,  
benchmark to confirm.

As discussed on java-dev, integrating freqs, positions, and boosts/ 
norms into one contiguous section of one file ought to improve the  
ability of a Searcher to launch from rest and return a query.  That's  
not terribly important in a persistent Java server setup, but it  
would be handy for say, a help file system, or for quick 'n' dirty  
Perl CGI.

The number of bits per position dedicated to weighting (4-8 out of  
16) in the Google paper is maddeningly small.  However the number of  
bits per document per term for a common term is embarrassingly large  
compared to the 8 Lucene currently has available.  It strikes me that  
it might be helpful to delta encode not just positions, but boosts as  
well.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Marvin Humphrey wrote:
> The number of bits per position dedicated to weighting (4-8 out of  16) 
> in the Google paper is maddeningly small.  However the number of  bits 
> per document per term for a common term is embarrassingly large  
> compared to the 8 Lucene currently has available.  It strikes me that  
> it might be helpful to delta encode not just positions, but boosts as  
> well.

Not sure what you mean here, since boosts are not ordered.

Personally I think the eight-bit floats used by Lucene give plenty of 
precision for this class of computation.  Relevant documents should be 
easily distinguished from non-relevant documents, and fine-differences 
in ranking between relevant documents don't matter.  The only time folks 
have complained about the precision of eight-bit floats in Lucene is 
when they've attempted to overload them with other semantics besides 
relevance (e.g., dates), which is inappropriate.

So I would opt for delta-encoded positions with a one-byte boost.

I think combining frequencies and positions in a single file might be 
useful.  If folks don't want to pay the penalty of pawing through 
positions then they should disable position indexing for that field.

Another thing folks have frequently asked for is per-document boosts 
(i.e., boosts instead of frequencies, and no positions).

Some useful posting options are thus:

a. <doc>+
b. <doc, boost>+
c. <doc, freq, <position>+ >+
d. <doc, freq, <position, boost>+ >+

These suggest the following booleans per field:

1. freq
2. document boost
3. position (requires freq)
4. position boost (requires position)

Doug

On May 9, 2006, at 9:07 AM, Doug Cutting wrote:

> Marvin Humphrey wrote:
>> The number of bits per position dedicated to weighting (4-8 out  
>> of  16) in the Google paper is maddeningly small.  However the  
>> number of  bits per document per term for a common term is  
>> embarrassingly large  compared to the 8 Lucene currently has  
>> available.  It strikes me that  it might be helpful to delta  
>> encode not just positions, but boosts as  well.
>
> Not sure what you mean here, since boosts are not ordered.

Within a termdoc[1], each position will probably have the same boost  
by default.  Since Lucene doesn't currently implement boost-per- 
position, that's effectively what we have now.

It seemed wasteful to allocate one byte to each position for boost  
information when all of them are the same.  A scheme similar to what  
is currently used in the .frq file would potentially save us some  
space: left shift the position delta by 1 and use the low bit to  
indicate whether or not there has been a change in the boost since  
the last position.

I originally thought it might make sense to delta encode the boost if  
there was a change.  But I also wasn't necessarily assuming that the  
scoring multiplier associated with each position would be an 8-bit  
float.  I haven't come up with a better idea yet, though.  :)

> Personally I think the eight-bit floats used by Lucene give plenty  
> of precision for this class of computation.

I agree, I like the 8 bit floats.  But going down even to 7 bits  
radically reduces the range that can be covered by that float.  7  
might still be enough, but 4 I think is too few.  1/16 of the  
quantization levels, but fully half of the space requirement --  
feh.   Reminds me of the difference between 16-bit audio and 8-bit  
audio.  Ergo, my objection to that aspect of the Google98 scheme.

What does it mean to have the 8-bit float available per position?

Right now, TermScorer gets the norm/boost information by grabbing a  
byte from the a norms array, then decoding it with a lookup in the  
normDecoder.  Under boost-per-position, we'd have to ask for the  
boost by calling termPositions.getBoost().  I think responsibility  
would fall to a SegmentTermPositions object to iterate through the  
positions and build up that number.

Say we opt for simple summing.  That means that at index time, we  
calculate a total boost for the termdoc, and then divvy it up per  
position in slices proportional to each position's existing boost  
(which would be 1 by default).  I'm afraid that means that our search- 
time sum would have a compounded quantization error and we'd be worse  
off than we are now.

> I think combining frequencies and positions in a single file might  
> be useful.  If folks don't want to pay the penalty of pawing  
> through positions then they should disable position indexing for  
> that field.

While I suspect that merging frequencies and positions is justified  
if only because it cuts down on disk seeks, I'm not sure that  
disabling position indexing is a realistic option for one very common  
use case: a bodytext field that you need to be able to perform phrase  
queries against.

> Some useful posting options are thus:
>
> a. <doc>+
> b. <doc, boost>+
> c. <doc, freq, <position>+ >+
> d. <doc, freq, <position, boost>+ >+
>
> These suggest the following booleans per field:
>
> 1. freq
> 2. document boost
> 3. position (requires freq)
> 4. position boost (requires position)

Delicious!

IMO this discussion should be taking place on java-dev.  I grant  
permission to forward my contributions in full if we all agree and  
somebody wants to post the thread there in one chunk.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

[1] I've taken to using "termdoc" to mean "one term indexing one  
document" and "posting" to mean "one term indexing one document one  
time".

Marvin Humphrey wrote:
> Within a termdoc[1], each position will probably have the same boost  by 
> default.  Since Lucene doesn't currently implement boost-per- position, 
> that's effectively what we have now.
> 
> It seemed wasteful to allocate one byte to each position for boost  
> information when all of them are the same.  A scheme similar to what  is 
> currently used in the .frq file would potentially save us some  space: 
> left shift the position delta by 1 and use the low bit to  indicate 
> whether or not there has been a change in the boost since  the last 
> position.

Right.  That makes sense.  But when boosts change they could go up or 
down, and by large amounts, so it's not clear that we should write the 
difference rather than simply the new value.

> What does it mean to have the 8-bit float available per position?
> 
> Right now, TermScorer gets the norm/boost information by grabbing a  
> byte from the a norms array, then decoding it with a lookup in the  
> normDecoder.  Under boost-per-position, we'd have to ask for the  boost 
> by calling termPositions.getBoost().  I think responsibility  would fall 
> to a SegmentTermPositions object to iterate through the  positions and 
> build up that number.
> 
> Say we opt for simple summing.  That means that at index time, we  
> calculate a total boost for the termdoc, and then divvy it up per  
> position in slices proportional to each position's existing boost  
> (which would be 1 by default).  I'm afraid that means that our search- 
> time sum would have a compounded quantization error and we'd be worse  
> off than we are now.

I'm confused.  If we're sticking the norm into the position boost, then 
we just multiply it into every position boost before they're summed, 
rather than multiplying them into the sum as we do now, right?  So we 
don't have to divide up the norm.  Am I missing something?

My thinking is that one could specify both position boosts and norms, in 
which case both would be multiplied into the score.  Or one could omit 
norms and instead store their values in position boosts, or vice versa.

For a phrase, we might boost it by the average of it's matching term's 
boosts.

>> I think combining frequencies and positions in a single file might  be 
>> useful.  If folks don't want to pay the penalty of pawing  through 
>> positions then they should disable position indexing for  that field.
> 
> While I suspect that merging frequencies and positions is justified  if 
> only because it cuts down on disk seeks, I'm not sure that disabling 
> position indexing is a realistic option for one very common use case: a 
> bodytext field that you need to be able to perform phrase queries against.

Right.  Disabling positions should be an option on a field-by-field basis.

>> Some useful posting options are thus:
>>
>> a. <doc>+
>> b. <doc, boost>+
>> c. <doc, freq, <position>+ >+
>> d. <doc, freq, <position, boost>+ >+
>>
>> These suggest the following booleans per field:
>>
>> 1. freq
>> 2. document boost
>> 3. position (requires freq)
>> 4. position boost (requires position)
> 
> 
> Delicious!
> 
> IMO this discussion should be taking place on java-dev.  I grant  
> permission to forward my contributions in full if we all agree and  
> somebody wants to post the thread there in one chunk.

If I have a chance, I'll do that.  Better yet, we should post a summary 
to the wiki as a design proposal.

Doug

On May 9, 2006, at 12:42 PM, Doug Cutting wrote:

> If we're sticking the norm into the position boost, then we just  
> multiply it into every position boost before they're summed, rather  
> than multiplying them into the sum as we do now, right?  So we  
> don't have to divide up the norm.  Am I missing something?

No, it was my mistake.  I've worked through the problem and found my  
error.  I'm down with 8-bit floats for the position boosts.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

At the Index (Directory ???) level, I think we could make it such that a 
Directory/Index can have Fields (perhaps just Keyword based, not sure 
what it would mean to have tokenized text stored on the index) 
associated with it as well.  Thus we would be able to use existing 
mechanisms for writing Index level metadata.  We would just need a new 
place in the file format for storing these fields.  This may call for 
some marker interfaces (in Java land anyway) such that the Index field 
addition mechanism only allows certain kind of Fields if that makes sense. 

Then the user could do similar things to an Index that they do to a 
Document (i.e. get value for a field, etc.).  I currently simulate this 
in our IR Tools implementation by writing out an XML file that goes in 
the same directory as the index files and stores metadata about the index.

Doug Cutting wrote:
> Marvin Humphrey wrote:
>> The number of bits per position dedicated to weighting (4-8 out of  
>> 16) in the Google paper is maddeningly small.  However the number of  
>> bits per document per term for a common term is embarrassingly large  
>> compared to the 8 Lucene currently has available.  It strikes me 
>> that  it might be helpful to delta encode not just positions, but 
>> boosts as  well.
>
> Not sure what you mean here, since boosts are not ordered.
>
> Personally I think the eight-bit floats used by Lucene give plenty of 
> precision for this class of computation.  Relevant documents should be 
> easily distinguished from non-relevant documents, and fine-differences 
> in ranking between relevant documents don't matter.  The only time 
> folks have complained about the precision of eight-bit floats in 
> Lucene is when they've attempted to overload them with other semantics 
> besides relevance (e.g., dates), which is inappropriate.
>
> So I would opt for delta-encoded positions with a one-byte boost.
>
> I think combining frequencies and positions in a single file might be 
> useful.  If folks don't want to pay the penalty of pawing through 
> positions then they should disable position indexing for that field.
>
> Another thing folks have frequently asked for is per-document boosts 
> (i.e., boosts instead of frequencies, and no positions).
>
> Some useful posting options are thus:
>
> a. <doc>+
> b. <doc, boost>+
> c. <doc, freq, <position>+ >+
> d. <doc, freq, <position, boost>+ >+
>
> These suggest the following booleans per field:
>
> 1. freq
> 2. document boost
> 3. position (requires freq)
> 4. position boost (requires position)
>
> Doug
>

-- 

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 

I would also be interested in optionally storing more information about 
the Term in the term dictionary (such as POS or other metadata).   But 
this also might be more added to the Term Vector capabilities instead 
and thus wouldn't cause overhead during indexing/search


Doug Cutting wrote:
> Marvin Humphrey wrote:
>> The number of bits per position dedicated to weighting (4-8 out of  
>> 16) in the Google paper is maddeningly small.  However the number of  
>> bits per document per term for a common term is embarrassingly large  
>> compared to the 8 Lucene currently has available.  It strikes me 
>> that  it might be helpful to delta encode not just positions, but 
>> boosts as  well.
>
> Not sure what you mean here, since boosts are not ordered.
>
> Personally I think the eight-bit floats used by Lucene give plenty of 
> precision for this class of computation.  Relevant documents should be 
> easily distinguished from non-relevant documents, and fine-differences 
> in ranking between relevant documents don't matter.  The only time 
> folks have complained about the precision of eight-bit floats in 
> Lucene is when they've attempted to overload them with other semantics 
> besides relevance (e.g., dates), which is inappropriate.
>
> So I would opt for delta-encoded positions with a one-byte boost.
>
> I think combining frequencies and positions in a single file might be 
> useful.  If folks don't want to pay the penalty of pawing through 
> positions then they should disable position indexing for that field.
>
> Another thing folks have frequently asked for is per-document boosts 
> (i.e., boosts instead of frequencies, and no positions).
>
> Some useful posting options are thus:
>
> a. <doc>+
> b. <doc, boost>+
> c. <doc, freq, <position>+ >+
> d. <doc, freq, <position, boost>+ >+
>
> These suggest the following booleans per field:
>
> 1. freq
> 2. document boost
> 3. position (requires freq)
> 4. position boost (requires position)
>
> Doug
>

-- 

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

I think storing a String->String map per index makes sense, but I'm not 
sure it makes sense for this to be implemented as a set of Fields, since 
many field attributes do not make sense here, such as Indexed, 
Tokenized, Stored, Vectored, etc.  The Binary & Compressed attributes 
could make sense however.  So we could try to refactor Field, but much 
of it would not be shared code anyway, since we would probably not use 
the same format for field names.  So I think I would opt for a similar 
yet distinct implementation for index attributes rather than overloading 
Field.

Doug

Grant Ingersoll wrote:
> At the Index (Directory ???) level, I think we could make it such that a 
> Directory/Index can have Fields (perhaps just Keyword based, not sure 
> what it would mean to have tokenized text stored on the index) 
> associated with it as well.  Thus we would be able to use existing 
> mechanisms for writing Index level metadata.  We would just need a new 
> place in the file format for storing these fields.  This may call for 
> some marker interfaces (in Java land anyway) such that the Index field 
> addition mechanism only allows certain kind of Fields if that makes sense.
> Then the user could do similar things to an Index that they do to a 
> Document (i.e. get value for a field, etc.).  I currently simulate this 
> in our IR Tools implementation by writing out an XML file that goes in 
> the same directory as the index files and stores metadata about the index.
> 
> Doug Cutting wrote:
> 
>> Marvin Humphrey wrote:
>>
>>> The number of bits per position dedicated to weighting (4-8 out of  
>>> 16) in the Google paper is maddeningly small.  However the number of  
>>> bits per document per term for a common term is embarrassingly large  
>>> compared to the 8 Lucene currently has available.  It strikes me 
>>> that  it might be helpful to delta encode not just positions, but 
>>> boosts as  well.
>>
>>
>> Not sure what you mean here, since boosts are not ordered.
>>
>> Personally I think the eight-bit floats used by Lucene give plenty of 
>> precision for this class of computation.  Relevant documents should be 
>> easily distinguished from non-relevant documents, and fine-differences 
>> in ranking between relevant documents don't matter.  The only time 
>> folks have complained about the precision of eight-bit floats in 
>> Lucene is when they've attempted to overload them with other semantics 
>> besides relevance (e.g., dates), which is inappropriate.
>>
>> So I would opt for delta-encoded positions with a one-byte boost.
>>
>> I think combining frequencies and positions in a single file might be 
>> useful.  If folks don't want to pay the penalty of pawing through 
>> positions then they should disable position indexing for that field.
>>
>> Another thing folks have frequently asked for is per-document boosts 
>> (i.e., boosts instead of frequencies, and no positions).
>>
>> Some useful posting options are thus:
>>
>> a. <doc>+
>> b. <doc, boost>+
>> c. <doc, freq, <position>+ >+
>> d. <doc, freq, <position, boost>+ >+
>>
>> These suggest the following booleans per field:
>>
>> 1. freq
>> 2. document boost
>> 3. position (requires freq)
>> 4. position boost (requires position)
>>
>> Doug
>>
> 

On May 10, 2006, at 7:31 AM, Grant Ingersoll wrote:

> I would also be interested in optionally storing more information  
> about the Term in the term dictionary (such as POS or other  
> metadata).   But this also might be more added to the Term Vector  
> capabilities instead and thus wouldn't cause overhead during  
> indexing/search

Scanning through the term dictionary is, if I am not mistaken, a  
pretty resource-intensive task.  However, the thing about Term  
Vectors is they're document-centric.  What are some of the uses you  
envision?

What do you mean by POS?  The Wikipedia disambiguation page yields  
some entertaining possibilities: <http://en.wikipedia.org/wiki/POS>.   
I'm guessing part of speech?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

You can, but this then requires more sophisticated querying techniques, 
right?  Such as PrefixQuery or WildcardQuery when you don't care about 
those things.

I am sure it would have perf. implications.  It would have to be 
implemented such that when it is turned off there is negligible perf. 
impact.

Doug Cutting wrote:
> Grant Ingersoll wrote:
>> I would also be interested in optionally storing more information 
>> about the Term in the term dictionary (such as POS or other metadata).
>
> Can't you simply pack this onto the end of the term string?  
> Adding per-term attribute sets could have performance implications.
>
> Doug

-- 

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

On May 10, 2006, at 9:54 AM, Grant Ingersoll wrote:

> Yep, Part of Speech.  CNLP is a pseudo academic research lab  
> (meaning we do commercial work too), we often have rich information  
> about terms, phrases, etc. from our NLP analysis, such as  
> categories, co-references, etc.  One of our biggest usages of  
> Lucene is in our QA system, where these kind of things come in  
> handy.  I agree, though, that this is probably not worth the  
> performance hit unless there is a way to make it negligible when it  
> is turned off.  In our QA system, the performance hit I think I  
> would see in Lucene is most likely small potatoes compared to  
> having to lookup and process some of these things later.  I can  
> envision some nice SpanQuery uses that let me identify candidate  
> answers very quickly as compared to having to post-process them  
> from the reconstructed document.  At any rate, there is probably a  
> very small audience for this, so it may not be worth it.

This seems like the kind of app where a Lucene/RDBMS hybrid could  
serve.  Adding arbitrary metadata is an SQL-like feature.  I imagine  
you've explored that possibility, though.  :)

Maybe you could do some hacking with a synonym filter that stored  
part of speech as a Term at the same position.  Is that what you're  
doing with the TokenFilter?

> Another one of the things we need to from time to time is calculate  
> corpus statistics, i.e. the total number of occurrences for a term  
> (usually for all terms).  This involves looping over all the terms  
> and the term docs and summing and storing (perhaps I am missing  
> something in the API?).

No, you're not missing anything.  You have to use a TermDocs object  
and sum termDocs.freq() over all docs for each term.  The Term  
Dictionary only gives you docFreq.

How often does the index get updated?

> This can be done using a TokenFilter, but then we have to manage  
> the memory, data structures, etc. whereas, Lucene could easily  
> store this count along with the term.  This kind of thing could  
> also be stored at the index level, and is part of what I imagined  
> for that functionality.

It's also easy to export.  But exporting's a problem if you require  
frequent updates.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

ConversationsBetweenDougMarvinAndGrant (last edited 2009-09-20 21:47:43 by localhost)