boosting fields

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

boosting fields

Karl Wettin-3
I don't like how fields are configured.

     Document doc = new Document();
     Field f;
     f = new Field("foo", "bar tzar", Field.Store.NO,  
Field.Index.TOKENIZED, Field.TermVector.YES);
     f.setBoost(1.5f);
     doc.add(f);
     f = new Field("foo", "blah yada", Field.Store.NO,  
Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS);
     f.setBoost(2f);
     doc.add(f);

This could lead me to believe I can use different boost for fields  
with the same name within one document. I guess it is nice that I can  
set it up so that different subsets of the terms are stored a  
specific way in the term vector, even though I never had to use it.  
But the boosts are stored by field name, right? There are no  
constraints in the code. I would at least expect a warning when  
indexing above document.

How about refactoring fields to something like:

[Document](fieldName)<#>---- {0..1} ->[Field +boost]<#>---- {0..*} ->
[FieldValue +store +index +termVector]

instead of as now:

[Document](fieldName)<#>---- {0..1} ->[Field +boost +store +index  
+termVector]

I would not mind if it worked the way the design implies, but it  
doesn't. Right?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: boosting fields

Karl Wettin-3

25 apr 2006 kl. 18.56 skrev karl wettin:

> How about refactoring fields to something like:
>
> [Document](fieldName)<#>---- {0..1} ->[Field +boost]<#>---- {0..*} -
> >[FieldValue +store +index +termVector]
>
> instead of as now:
>
> [Document](fieldName)<#>---- {0..1} ->[Field +boost +store +index  
> +termVector]

Oups.

instead of as now:

[Document](fieldName)<#>---- {0..*} ->[Field +boost +store +index  
+termVector]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: boosting fields

Doug Cutting
In reply to this post by Karl Wettin-3
karl wettin wrote:
> This could lead me to believe I can use different boost for fields  with
> the same name within one document.

You can.  The values are multiplied to produce the final boost value for
the field.  This is described in:

http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html#setBoost(float)

But you're right, that's not entirely intuitive.

> How about refactoring fields to something like:
>
> [Document](fieldName)<#>---- {0..1} ->[Field +boost]<#>---- {0..*} ->
> [FieldValue +store +index +termVector]

That would be a big, incompatible change to one of Lucene's primary
APIs, no?  Long-term, an API which supports per token boosting will
probably prove useful, as a part of #11 on
http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard.  When this
happens it would probably be worth considering making the change you
suggest, but I'm not sure it would be worth it before that.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: boosting fields

Karl Wettin-3

25 apr 2006 kl. 19.34 skrev Doug Cutting:

> karl wettin wrote:
>> This could lead me to believe I can use different boost for  
>> fields  with the same name within one document.
>
> You can.  The values are multiplied to produce the final boost  
> value for the field.  This is described in:
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/document/ 
> Field.html#setBoost(float)

It's not really the same thing as I tried to describe though. In the  
end it is the same boost for all field values. I would personally  
prefer to set that manually per unique field name. I have a hard time  
to figure out why I want to add multiple boosts and then normalise  
them. Is there some feature I missed?

>> How about refactoring fields to something like:
>> [Document](fieldName)<#>---- {0..1} ->[Field +boost]<#>---- {0..*}  
>> -> [FieldValue +store +index +termVector]
>
> That would be a big, incompatible change to one of Lucene's primary  
> APIs, no?

Not if I got it right in my head. Then it's really just a matter of  
handling deprication. The field-methods in Document could be the same.

> Long-term, an API which supports per token boosting will probably  
> prove useful, as a part of #11 on http://wiki.apache.org/jakarta- 
> lucene/Lucene2Whiteboard.  When this happens it would probably be  
> worth considering making the change you suggest, but I'm not sure  
> it would be worth it before that.

I've wanted that feature a few times. Let me know if there is  
something I can do to help when the time is right.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: boosting fields

Doug Cutting
karl wettin wrote:
>> karl wettin wrote:
>>
>>> This could lead me to believe I can use different boost for  fields  
>>> with the same name within one document.
>>
>> You can.  The values are multiplied to produce the final boost  value
>> for the field.
>
> It's not really the same thing as I tried to describe though.

No, it's not, you're right.

>>> How about refactoring fields to something like:
>>> [Document](fieldName)<#>---- {0..1} ->[Field +boost]<#>---- {0..*}  
>>> -> [FieldValue +store +index +termVector]
>>
>>
>> That would be a big, incompatible change to one of Lucene's primary  
>> APIs, no?
>
> Not if I got it right in my head. Then it's really just a matter of  
> handling deprication. The field-methods in Document could be the same.

If you think you have a simple, back-compatible way to do this, please
submit a patch.  Perhaps it is simpler than I imagined.

>> Long-term, an API which supports per token boosting will probably  
>> prove useful, as a part of #11 on http://wiki.apache.org/jakarta- 
>> lucene/Lucene2Whiteboard.
>
> I've wanted that feature a few times. Let me know if there is  something
> I can do to help when the time is right.

The time will be right as soon as someone decides they want to implement
this!  Ideally every part of the index would be pluggable, but the most
important is postings, so probably we should start there.

My idea is that the logic of DocumentWriter.invertDocument() remain much
the same, and that DocumentWriter.addPosition() is replaced with a
method on a pluggable class.  So invertDocument() would keep a
FieldIndexer for each field and call a method like addPosition() for
each token found.  (We might add a boost field to Token that's passed
into this method.)  Then, at the end, invertDocument() would flush all
of the FieldIndexers().  SegmentMerger would need to be changed
similarly.  Implementing FieldIndexers that can sensibly share output
files may be tricky.  We should implement FieldIndexers that are
back-compatible with the existing index format, and also probably a
no-positions version, a no-freqs version and a weight-per-position
version.  TermFreqs and TermPositions should be replaced with a generic
Postings API.  Applications can then downcast the Postings instance
based on the FieldInfo.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: boosting fields

Karl Wettin-3

26 apr 2006 kl. 19.18 skrev Doug Cutting:

> karl wettin wrote:
>>>> How about refactoring fields to something like:
>>>> [Document](fieldName)<#>---- {0..1} ->[Field +boost]<#>----  
>>>> {0..*}  -> [FieldValue +store +index +termVector]
>
> If you think you have a simple, back-compatible way to do this,  
> please submit a patch.  Perhaps it is simpler than I imagined.
>
>>> Long-term, an API which supports per token boosting will  
>>> probably  prove useful, as a part of #11 on http://
>>> wiki.apache.org/jakarta- lucene/Lucene2Whiteboard.
>> I've wanted that feature a few times. Let me know if there is  
>> something I can do to help when the time is right.
>
> The time will be right as soon as someone decides they want to  
> implement this!  Ideally every part of the index would be  
> pluggable, but the most important is postings, so probably we  
> should start there.
>
> My idea is that the logic of DocumentWriter

I would prefer to leave out the persistence and deprication from the  
discussion until the rest is solved, as I spend all my spare brain  
time on the InstanciatedIndex-thingy.

> and also probably a no-positions version, a no-freqs version and a  
> weight-per-position version.  TermFreqs and TermPositions should be  
> replaced with a generic Postings API.  Applications can then  
> downcast the Postings instance based on the FieldInfo.

This is much more interesting from my point of view. Let's start here.

I might be wrong and I really don't know why it is so bad, but I  
think casting based on FieldInfo would be breaking the Liskov  
subtituion principle in big way.

My own immediate thought is to compromise by allowing boost per term  
in document. Simply remove the norms-methods from the IndexReader and  
add a new one to the TermEnum and fall back on the field boost. How  
would the value be picked up by the scorer?

Boost per position, et.c. sounds very expensive.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: boosting fields

Doug Cutting
karl wettin wrote:
> My own immediate thought is to compromise by allowing boost per term  in
> document. Simply remove the norms-methods from the IndexReader and  add
> a new one to the TermEnum and fall back on the field boost. How  would
> the value be picked up by the scorer?
>
> Boost per position, et.c. sounds very expensive.

Indeed.  It will probably nearly double the size of indexes and also
increase search time.  But it is also very powerful.  Consider the
posting representation Google describes on page 9 of
http://dbpubs.stanford.edu/pub/1998-8.  The font-size stored there is in
effect a weight for each position.

The point is not that every index store this, but that it be possible
for some indexes to store this, or even more information per position,
by extending a public API.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Rich positions (was "boosting fields")

Marvin Humphrey

On Apr 27, 2006, at 9:41 AM, Doug Cutting wrote:

> karl wettin wrote:
>> My own immediate thought is to compromise by allowing boost per  
>> term  in document. Simply remove the norms-methods from the  
>> IndexReader and  add a new one to the TermEnum and fall back on  
>> the field boost. How  would the value be picked up by the scorer?
>> Boost per position, et.c. sounds very expensive.
>
> Indeed.  It will probably nearly double the size of indexes and  
> also increase search time.

I have been considering making a similar change to the KinoSearch  
file format.  Not having to cache norms radically cuts down on the  
time required to launch a fresh Searcher, especially if there aren't  
any deleted docs.  That's a win if you're launching a search app from  
scratch, like if you're running a web search under CGI rather than  
mod_perl.  It's also a win for refreshing a Searcher against a  
frequently updated index.

What I was considering was interleaving the document's score-
multiplier norm byte between the VInts in the .frq file.  That would  
mean more disk i/o for processing terms when the term takes up more  
than a block on the file system, but at least the info would be  
contiguous.

I hadn't considered interleaving the score-multiplier into .prx, but  
that opens many possibilities.  Boost positions that appear near the  
top of the doc.  Boost positions if they occur within certain HTML  
tags.  Good stuff!

Moving away from cached norms was the second of three major changes  
to the file format on my agenda, and the one I was all but certain I  
wouldn't be able to sell to the Lucene community.  The first was  
using bytecounts at the head of Strings.

The third was storing start offsets and end offsets in the ProxFile.  
It rankles that much of the information from tis/frq/prx gets  
duplicated in the term vector files, but highlighting is most  
efficient when you know the offsets, and the primary index stops  
short of storing that information.  Currently, we have this:

     ProxFile (.prx) -->  <TermPositions>TermCount

How about this?

     ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount

To get highlighting info now, you retrieve a document's term vector  
information and then extract the offsets information for the precise  
term.  This format reverses the order: first you find the term, then  
you extract the offsets info for a particular doc.

I haven't implemented this change yet, so I'm not sure how it works  
out.  The current version of KinoSearch stores term vectors in  
the .fdt file, which is a win for locality of reference.  It sure  
would be nice to eliminate all that duplicated data, though.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rich positions (was "boosting fields")

Doug Cutting
Marvin Humphrey wrote:

> Moving away from cached norms was the second of three major changes  to
> the file format on my agenda, and the one I was all but certain I  
> wouldn't be able to sell to the Lucene community.  The first was  using
> bytecounts at the head of Strings.
>
> The third was storing start offsets and end offsets in the ProxFile.  
> It rankles that much of the information from tis/frq/prx gets  
> duplicated in the term vector files, but highlighting is most  efficient
> when you know the offsets, and the primary index stops  short of storing
> that information.  Currently, we have this:
>
>     ProxFile (.prx) -->  <TermPositions>TermCount
>
> How about this?
>
>     ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount

This would at least double the size of the .prx file, the largest file
in Lucene's index.  Yes it's useful, not not all folks will use it.  So
not all folks should have to pay for it.  One way is to try to make it
arbitrarily extensible, but to some degree, that's going to end up being
language-specific.

So perhaps instead we should simply allocate more bits in the FieldInfo.
  We could allocate bits for WEIGHT_PER_POSITION, OFFSETS_IN_PRX,
NORMS_IN_FRQ, OMIT_PRX, OMIT_FREQ, etc.  We can increase the number of
bits there by turning this into a VInt, which would be back-compatible, no?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rich positions (was "boosting fields")

Marvin Humphrey

On Apr 27, 2006, at 12:17 PM, Doug Cutting wrote:

> Marvin Humphrey wrote:
>> Moving away from cached norms was the second of three major  
>> changes  to the file format on my agenda, and the one I was all  
>> but certain I  wouldn't be able to sell to the Lucene community.  
>> The first was  using bytecounts at the head of Strings.
>> The third was storing start offsets and end offsets in the  
>> ProxFile.   It rankles that much of the information from tis/frq/
>> prx gets  duplicated in the term vector files, but highlighting is  
>> most  efficient when you know the offsets, and the primary index  
>> stops  short of storing that information.  Currently, we have this:
>>     ProxFile (.prx) -->  <TermPositions>TermCount
>> How about this?
>>     ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount
>
> This would at least double the size of the .prx file, the largest  
> file in Lucene's index.  Yes it's useful, not not all folks will  
> use it.  So not all folks should have to pay for it.

Agreed.  I think it would at least triple the ProxFile, actually.  At  
least, I haven't thought of a compression scheme which could cram  
both start offset and end offset into fewer than two bytes on  
average.  But in theory, it would eliminate the need for the Term  
Vectors files, so if you need those now only for highlighting, it's a  
big gain.

> So perhaps instead we should simply allocate more bits in the  
> FieldInfo.  We could allocate bits for WEIGHT_PER_POSITION,  
> OFFSETS_IN_PRX, NORMS_IN_FRQ, OMIT_PRX, OMIT_FREQ, etc.  We can  
> increase the number of bits there by turning this into a VInt,  
> which would be back-compatible, no?

Using a VInt there sounds good 'n' clever to me.  Supporting all  
those different configs is another question.  "Flexibility is  
overrated." -- David Hansson.

It's charitable of you to include NORMS_IN_FRQ in that list, but in  
my mind, the idea was obsolesced the instant I saw  
WEIGHT_PER_POSITION.  Both enable fast launch of a Searcher.  That's  
the only benefit of NORMS_IN_FRQ, and it comes at the expense of  
increased file size in comparison to the same index without  
NORMS_IN_FRQ.  WEIGHT_PER_POSITION has much more potential.

My primary goal with enabling fast launch is to make it so only the  
largest installations have to worry about running under mod_perl and  
caching Searchers.  Simple installations, whether they are set up by  
a novice or by a sophisticated user who doesn't want to deal with  
mod_perl for a zillion possible reasons, should "just work" for as  
large an index as possible.  The main concern out-of-the-gate is ease  
of use, and I'm happy to trade off increased file size to get it.

To turn the idea on its head... I'm inclined to make  
WEIGHT_PER_POSITION the default behavior, and either deep-six the  
cacheable norms files altogether or "add" them as an "expert  
optimization": decreased file size and improved search speed at the  
expense of less control over scoring and slower startup.

Incidentally, how about calling it BOOST_PER_POSITION instead?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rich positions (was "boosting fields")

Marvin Humphrey
In reply to this post by Doug Cutting

Now that I think about it, putting the score-multiplier into the  
FreqFile does offer a benefit I hadn't considered before.  It makes  
it possible to tie the score multiplier to a term within a doc,  
rather than a field within a doc.

Say you have a doc with a "body" field that's 1000 terms long, with 3  
instances of "foo", right near the top.  Say you have another doc  
with a "body" that's 1000 terms long, and that it also has 3  
instances of "foo" but they're buried near the end, and therefore not  
as important.

Under the current implementation, these two docs will score  
identically against a query for "body:foo", as the freq is identical  
and so is the output of lengthNorm(1000).  But if you stuff the score  
multiplier into the FreqFile, a sophisticated indexing app could  
assign a higher score multiplier to the term "body:foo" in the first  
doc.

Associating a score multiplier with each position (a.k.a.  
WEIGHT_PER_POSITION, BOOST_PER_POSITION) would achieve the same end,  
but at the expense of much more processing per document, as the score  
would have to be built up position by position.   OTOH, it would  
produce more accurate results for queries where only certain  
positions ought to be considered, such as phrase queries.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rich positions (was "boosting fields")

Karl Wettin-3
In reply to this post by Doug Cutting

27 apr 2006 kl. 18.41 skrev Doug Cutting:

> karl wettin wrote:
>> Boost per position, et.c. sounds very expensive.
>
> Indeed.  It will probably nearly double the size of indexes and  
> also increase search time.  But it is also very powerful.  Consider  
> the posting representation Google describes on page 9 of http://
> dbpubs.stanford.edu/pub/1998-8.  The font-size stored there is in  
> effect a weight for each position.
>
> The point is not that every index store this, but that it be  
> possible for some indexes to store this, or even more information  
> per position, by extending a public API.

Good point.

What will be required in the IndexReader? Is it enough to add getBoost
() in the TermEnum? How would the value be sent to the scorer?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rich positions (was "boosting fields")

Doug Cutting
In reply to this post by Marvin Humphrey
Marvin Humphrey wrote:
> Incidentally, how about calling it BOOST_PER_POSITION instead?

+1, that is more consistent with other naming.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rich positions (was "boosting fields")

Marvin Humphrey
In reply to this post by Karl Wettin-3

On Apr 27, 2006, at 2:35 PM, karl wettin wrote:

> What will be required in the IndexReader? Is it enough to add  
> getBoost() in the TermEnum? How would the value be sent to the scorer?

It wouldn't be the TermEnum, it would be a TermDocs subclass.  If  
we're talking BOOST_PER_POSITION, it would have to be a  
TermPositions, and you would getBoost() per position.  The TermScorer  
would have to accumulate a multiplier for each doc by repeatedly  
calling nextPosition().  That algorithm would replace this line in  
TermScorer:

       score *= normDecoder[norms[doc] & 0xFF];    // normalize for  
field

If we're talking NORMS_IN_FREQ, then you'd replace that line with one  
call to getBoost() against the TermDocs. (or maybe getNorm?  
getMultiplier?)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rich positions (was "boosting fields")

Karl Wettin-3

28 apr 2006 kl. 00.30 skrev Marvin Humphrey:

>
> On Apr 27, 2006, at 2:35 PM, karl wettin wrote:
>
>> What will be required in the IndexReader? Is it enough to add  
>> getBoost() in the TermEnum? How would the value be sent to the  
>> scorer?
>
> It wouldn't be the TermEnum, it would be a TermDocs subclass.  If  
> we're talking BOOST_PER_POSITION, it would have to be a  
> TermPositions, and you would getBoost() per position.  The  
> TermScorer would have to accumulate a multiplier for each doc by  
> repeatedly calling nextPosition().  That algorithm would replace  
> this line in TermScorer:

Sorry, I did of course mean TermDocs.

>       score *= normDecoder[norms[doc] & 0xFF];    // normalize for  
> field
>
> If we're talking NORMS_IN_FREQ, then you'd replace that line with  
> one call to getBoost() against the TermDocs. (or maybe getNorm?  
> getMultiplier?)

I'll start there.

Considering I don't have to worry about any index format with the  
InstanciatedIndex it should be fairly easy to get it working.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rich positions (was "boosting fields")

Marvin Humphrey
>>       score *= normDecoder[norms[doc] & 0xFF];    // normalize for  
>> field
>>
>> If we're talking NORMS_IN_FREQ, then you'd replace that line with  
>> one call to getBoost() against the TermDocs. (or maybe getNorm?  
>> getMultiplier?)
>
> I'll start there.
>
> Considering I don't have to worry about any index format with the  
> InstanciatedIndex it should be fairly easy to get it working.

Here's the direction I'm headed:   One file, the "PostingsFile",  
which merges the FreqFile, ProxFile, and Boost/Norm for each posting  
into a single contiguous block, with an eye towards aggressively  
minimizing disk seeks.

I've worked up a prototype which is a hybrid of the current Lucene  
design and the version from the Google paper.  The advantage of the  
Google design is that since the postings are fixed width, it is fast  
and easy to either iterate through them or skip over them.  The  
disadvantage of that design is that the fixed width forces truncation  
of certain data -- for instance, all positions above 4096 are encoded  
as 4096, which screws up phrase matching.

For many documents, the fixed width posting format is sufficient, but  
for a minority of cases, important information can't fit.  One answer  
is to use flag bits to indicate all of the following:

   * Whether the Postings are fixed width or whether they
     had to be encoded using a variable width technique.
   * Whether positions and boosts are stored at all (for
     many queries, all you need to know is that a Term is
     present).
   * Whether the Freq is 1 or encoded seperately.
   * How the DocDelta is encoded.

Fixed with postings are two bytes wide (as with Google).  Variable  
width postings are encoded using VInts.  Non-existent postings take  
up zero bytes.  :)

The header consists of 4 flag bits, 4 bits which optionally  
contribute to the DocDelta, and either an additional byte or an  
addional VInt to complete the DocDelta.  Subsequent positions are  
delta encoded, like current Lucene and apparently unlike Google  
1998.  THe variable width posting format is required whenever the  
TermDoc contains at least one ProxDelta which exceeds the maximum  
ProxDelta the fixed width format can encode.  At 12 bits for position  
per posting, that's 4095.

The thing I haven't quite figured out yet is how to allocate bits for  
the Boost.  Google uses 4-8 bits per posting for "capitalization",  
"font size", and a flag indicating a "fancy" hit ie something from a  
title or anchor.  That leaves them 8-12 bits for position  
information.  If we just copy the full 8 bits of Lucene's current  
byte Norm format, that only leaves 8 bits for the ProxDelta per  
position.  That's not enough -- we'll end up using the variable  
format way too often, and it's terribly redundant.

It's not obvious to me how to distribute lengthNorm information over  
several 4-bit posting slots, though. Past hard experience has taught  
me that a scoring system which isn't normalized for field length  
suffers from poor precision; we definitely want the a score  
multiplier in there, and more fine-grained than 16 levels.  We do  
have the position of the posting to work with, though, so we can  
weight up-front postings more heavily at least.

A test index using the Reuters corpus and WhiteSpaceAnalyzer went  
from 8 MB to 11 MB under this system.  I haven't yet eliminated the  
norms, but they only take up 40k or so at present.

I found it surprisingly easy to make these changes to KinoSearch; the  
only two classes that had to be modified were SegTermDocs and  
PostingsWriter.  Maybe experimenting with analogous changes to Lucene  
will be just as easy.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Rich positions (was "boosting fields")

Marvin Humphrey

On Apr 29, 2006, at 12:40 AM, Marvin Humphrey wrote:
> One file, the "PostingsFile", which merges the FreqFile, ProxFile,  
> and Boost/Norm for each posting into a single contiguous block,  
> with an eye towards aggressively minimizing disk seeks.

Interpolating the positions between the Freqs is inefficient for a  
simple term query, provided that a score multiplier is available for  
each document and it does not have to be built up posting by  
posting.  However, simple term queries typically do not stress the  
system, and if the cost of scanning through positions is significant,  
at least no disk seeks are required.

Phrase queries should theoretically benefit from having the  
interleaving of positional data and frequency data.  At present,  
fetching freq data and prox data will generally require at least two  
disk seeks per term; if they are interleaved, the number of seeks is  
cut in half, roughly.  It's unlikely all the freq and prox data for a  
common term in a large index will be fetched in a single go, but it  
seems likely that there will continue to be an advantage to having  
freq data and prox data interleaved even then.

If boolean queries do not use positional information except when  
there is a sub-query which is a phrase query, then having positions  
interpolated is a loss.  However, Brin/Page 1998 proposes using  
positional data to improve precision, by scoring documents higher  
when any two terms occur near each other, even though they may not  
have been grouped together by the user.  A particularly sophisticated  
variant might also take into account word ordering in the query phrase.

If we establish the constraint that boolean queries must exploit  
positional data, then it's a clear win for merging the FreqFile and  
the ProxFile.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]