document boost

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

document boost

Mike Grafton
Hello folks,

We're trying to use Lucene's scoring to do a fairly basic thing: give a
document (in this case, we index "articles") a boost based on an integer
value that we know at index-time.  We want the  document boost to affect the
final document score linearly.

We thought that assigning a document boost based on this value would do the
trick, but the behavior we're seeing doesn't match what we expect given the
online documentation.  In fact, we see that a linear increase in document
boost yields an exponential increase in the 'fieldNorm' component of the
score for each term of the query that matched the document.    Here's a
small table of values that relate the document boost we pass in to the
fieldNorm contribution returned by Lucene:

boost  fieldNorm
1.0    0.3125    =  (5/16, 2^-1.678)
2.0    20.0      =  (2^4.3219, 2^1 * 10)
3.0    256.0     =  (2^8)
4.0    1280.0    =  (2^10.3219, 2^7 * 10)
5.0    5120.0    =  (2^12.3219, 2^9 * 10)
6.0    16384.0   =  (2^14.0)
7.0    40960.0   =  (2^15.3219, 2^12 * 10)
8.0    81920.0   =  (2^16.3219, 2^13 * 10)
10.0   327680.0  =  (2^18.3219, 2^15 * 10)

This example is using a query with two terms against a document that
contains those terms and a few others, in one searchable field.

Is this the way document boost is supposed to work?  Or have we
misconfigured something? If we cannot use document boost to affect scoring
linearly, is there some other technique we can use?

By the way, we're using SOLR to access Lucene.  We can give more information
if necessary, such as our SOLR schema.xml, if folks think that would help
explain things.  Let us know what other information we can provide.

Thanks,
Mike
Reply | Threaded
Open this post in threaded view
|

Re: document boost

Mark Miller-3
I would say you def misconfigured something. Doubling your doc boost
will double your fieldNorm approximately (I think the precision isn't
perfect).

I don't know what your doing wrong in such a small test, but your
fieldNorm should *not* be exploding like that.

Can you post some code?

- Mark

Mike Grafton wrote:

> Hello folks,
>
> We're trying to use Lucene's scoring to do a fairly basic thing: give a
> document (in this case, we index "articles") a boost based on an integer
> value that we know at index-time.  We want the  document boost to affect the
> final document score linearly.
>
> We thought that assigning a document boost based on this value would do the
> trick, but the behavior we're seeing doesn't match what we expect given the
> online documentation.  In fact, we see that a linear increase in document
> boost yields an exponential increase in the 'fieldNorm' component of the
> score for each term of the query that matched the document.    Here's a
> small table of values that relate the document boost we pass in to the
> fieldNorm contribution returned by Lucene:
>
> boost  fieldNorm
> 1.0    0.3125    =  (5/16, 2^-1.678)
> 2.0    20.0      =  (2^4.3219, 2^1 * 10)
> 3.0    256.0     =  (2^8)
> 4.0    1280.0    =  (2^10.3219, 2^7 * 10)
> 5.0    5120.0    =  (2^12.3219, 2^9 * 10)
> 6.0    16384.0   =  (2^14.0)
> 7.0    40960.0   =  (2^15.3219, 2^12 * 10)
> 8.0    81920.0   =  (2^16.3219, 2^13 * 10)
> 10.0   327680.0  =  (2^18.3219, 2^15 * 10)
>
> This example is using a query with two terms against a document that
> contains those terms and a few others, in one searchable field.
>
> Is this the way document boost is supposed to work?  Or have we
> misconfigured something? If we cannot use document boost to affect scoring
> linearly, is there some other technique we can use?
>
> By the way, we're using SOLR to access Lucene.  We can give more information
> if necessary, such as our SOLR schema.xml, if folks think that would help
> explain things.  Let us know what other information we can provide.
>
> Thanks,
> Mike
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: document boost

Mike Grafton
Thanks for your help, Mark. 

We can start by posting our SOLR config files, although I'm not sure if that will be helpful (we don't see much in there regarding boosts).  See attached.  How SOLR actually configures and interfaces with Lucene is a bit of an unknown to us, so I'm not sure we can get down to the raw Lucene configuration and interaction.

That being said, in addition to the SOLR config files, see our attached XML which we post to SOLR to add the document to the index.

How do you know that boost should affect fieldNorm linearly? Is there some code you can point us to?  We looked through the Lucene source for a while, but it was kind of hard to track this down.

One note: we're on an old version of Lucene - a nightly build between 2.0.0 and 2.1.0.

Mike

On 1/30/08, Mark Miller <[hidden email]> wrote:
I would say you def misconfigured something. Doubling your doc boost
will double your fieldNorm approximately (I think the precision isn't
perfect).

I don't know what your doing wrong in such a small test, but your
fieldNorm should *not* be exploding like that.

Can you post some code?

- Mark

Mike Grafton wrote:
> Hello folks,
>
> We're trying to use Lucene's scoring to do a fairly basic thing: give a
> document (in this case, we index "articles") a boost based on an integer
> value that we know at index-time.  We want the  document boost to affect the
> final document score linearly.
>
> We thought that assigning a document boost based on this value would do the
> trick, but the behavior we're seeing doesn't match what we expect given the
> online documentation.  In fact, we see that a linear increase in document
> boost yields an exponential increase in the 'fieldNorm' component of the
> score for each term of the query that matched the document.    Here's a
> small table of values that relate the document boost we pass in to the
> fieldNorm contribution returned by Lucene:
>
> boost  fieldNorm
> 1.0    0.3125    =  (5/16, 2^-1.678)
> 2.0    20.0      =  (2^4.3219, 2^1 * 10)
> 3.0    256.0     =  (2^8)
> 4.0    1280.0    =  (2^10.3219, 2^7 * 10)
> 5.0    5120.0    =  (2^12.3219, 2^9 * 10)
> 6.0    16384.0   =  (2^14.0)
> 7.0    40960.0   =  (2^15.3219, 2^12 * 10)
> 8.0    81920.0   =  (2^16.3219, 2^13 * 10)
> 10.0   327680.0  =  (2^18.3219, 2^15 * 10)
>
> This example is using a query with two terms against a document that
> contains those terms and a few others, in one searchable field.
>
> Is this the way document boost is supposed to work?  Or have we
> misconfigured something? If we cannot use document boost to affect scoring
> linearly, is there some other technique we can use?
>
> By the way, we're using SOLR to access Lucene.  We can give more information
> if necessary, such as our SOLR schema.xml, if folks think that would help
> explain things.  Let us know what other information we can provide.
>
> Thanks,
> Mike
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

schema.xml (15K) Download Attachment
solrconfig.xml (17K) Download Attachment
solr_post.xml (386 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: document boost

Mark Miller-3
If you look at DocumentsWriter at line 715 you will see the docBoost get
set to the docBoost you specified. At 1376 you will see boost get
assigned docBoost. Then at 1509 you see how the doc boost is multiplied
by the field boost: *  boost *= field.getBoost();
*
So now you have the default field boost of 1 * the docBoost you passed.
Still just the docBoost. Then at 690 you see where the norm is saved in
the index:    float norm = fp.boost *
writer.getSimilarity().lengthNorm(fp.fieldInfo.name, fp.length);

As you can see, its just the boost * the lengthNorm. Still no way to be
exponential.

Now the norm is read from different scorers, but for example, in
TermScorer at line 126 you see:   return raw *
Similarity.decodeNorm(norms[doc]); // normalize for field

Again just a multiplication. No way for exponential involving that norm.

And finally, i have watched Lucene in action and observed the proper
increase. For a similar setup as you describe I can put in a field boost
of 60 and still get a norm of only 20. It just doesnt explode. When I
double the boost, the field norm is doubled. Which follows from the code.

Maybe the issue is Solr?

- Mark



Mike Grafton wrote:

> Thanks for your help, Mark.
>
> We can start by posting our SOLR config files, although I'm not sure
> if that will be helpful (we don't see much in there regarding
> boosts).  See attached.  How SOLR actually configures and interfaces
> with Lucene is a bit of an unknown to us, so I'm not sure we can get
> down to the raw Lucene configuration and interaction.
>
> That being said, in addition to the SOLR config files, see our
> attached XML which we post to SOLR to add the document to the index.
>
> How do you know that boost should affect fieldNorm linearly? Is there
> some code you can point us to?  We looked through the Lucene source
> for a while, but it was kind of hard to track this down.
>
> One note: we're on an old version of Lucene - a nightly build between
> 2.0.0 and 2.1.0.
>
> Mike
>
> On 1/30/08, *Mark Miller* <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I would say you def misconfigured something. Doubling your doc boost
>     will double your fieldNorm approximately (I think the precision isn't
>     perfect).
>
>     I don't know what your doing wrong in such a small test, but your
>     fieldNorm should *not* be exploding like that.
>
>     Can you post some code?
>
>     - Mark
>
>     Mike Grafton wrote:
>     > Hello folks,
>     >
>     > We're trying to use Lucene's scoring to do a fairly basic thing:
>     give a
>     > document (in this case, we index "articles") a boost based on an
>     integer
>     > value that we know at index-time.  We want the  document boost
>     to affect the
>     > final document score linearly.
>     >
>     > We thought that assigning a document boost based on this value
>     would do the
>     > trick, but the behavior we're seeing doesn't match what we
>     expect given the
>     > online documentation.  In fact, we see that a linear increase in
>     document
>     > boost yields an exponential increase in the 'fieldNorm'
>     component of the
>     > score for each term of the query that matched the
>     document.    Here's a
>     > small table of values that relate the document boost we pass in
>     to the
>     > fieldNorm contribution returned by Lucene:
>     >
>     > boost  fieldNorm
>     > 1.0    0.3125    =  (5/16, 2^-1.678)
>     > 2.0    20.0      =  (2^4.3219, 2^1 * 10)
>     > 3.0    256.0     =  (2^8)
>     > 4.0    1280.0    =  (2^10.3219, 2^7 * 10)
>     > 5.0    5120.0    =  (2^12.3219, 2^9 * 10)
>     > 6.0    16384.0   =  (2^14.0)
>     > 7.0    40960.0   =  (2^15.3219, 2^12 * 10)
>     > 8.0    81920.0   =  (2^16.3219, 2^13 * 10)
>     > 10.0   327680.0  =  (2^18.3219, 2^15 * 10)
>     >
>     > This example is using a query with two terms against a document that
>     > contains those terms and a few others, in one searchable field.
>     >
>     > Is this the way document boost is supposed to work?  Or have we
>     > misconfigured something? If we cannot use document boost to
>     affect scoring
>     > linearly, is there some other technique we can use?
>     >
>     > By the way, we're using SOLR to access Lucene.  We can give more
>     information
>     > if necessary, such as our SOLR schema.xml, if folks think that
>     would help
>     > explain things.  Let us know what other information we can provide.
>     >
>     > Thanks,
>     > Mike
>     >
>     >
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: [hidden email]
>     <mailto:[hidden email]>
>     For additional commands, e-mail: [hidden email]
>     <mailto:[hidden email]>
>
>
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: document boost

Yonik Seeley-2
In reply to this post by Mike Grafton
Hi Mike, I think this issue probably belongs in the Solr lists since
it looks like you're indexing through it.
I did a really quick test re-adding a Solr example document but adding
a document boost of 10...
the fieldNorm increased by a factor of 10 as expected (explain below).

 <str name="id=SOLR1000-1,internal_docid=26">
5.651948 = (MATCH) fieldWeight(text:solr in 26), product of:
  1.4142135 = tf(termFreq(text:solr)=2)
  3.1972246 = idf(docFreq=2, numDocs=27)
  1.25 = fieldNorm(field=text, doc=26)
</str>
  <str name="id=SOLR1000,internal_docid=12">
0.5651948 = (MATCH) fieldWeight(text:solr in 12), product of:
  1.4142135 = tf(termFreq(text:solr)=2)
  3.1972246 = idf(docFreq=2, numDocs=27)
  0.125 = fieldNorm(field=text, doc=12)
</str>

Could you try with the latest version of Solr, and if there is still
an issue, post a bug to Solr's JIRA?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: document boost

Mike Grafton
So we upgraded to SOLR 1.2, which uses Lucene 2.1 or so, and the problem
went away.  Thanks all the help, folks!

Mike

On 1/30/08, Yonik Seeley <[hidden email]> wrote:

>
> Hi Mike, I think this issue probably belongs in the Solr lists since
> it looks like you're indexing through it.
> I did a really quick test re-adding a Solr example document but adding
> a document boost of 10...
> the fieldNorm increased by a factor of 10 as expected (explain below).
>
> <str name="id=SOLR1000-1,internal_docid=26">
> 5.651948 = (MATCH) fieldWeight(text:solr in 26), product of:
>   1.4142135 = tf(termFreq(text:solr)=2)
>   3.1972246 = idf(docFreq=2, numDocs=27)
>   1.25 = fieldNorm(field=text, doc=26)
> </str>
>   <str name="id=SOLR1000,internal_docid=12">
> 0.5651948 = (MATCH) fieldWeight(text:solr in 12), product of:
>   1.4142135 = tf(termFreq(text:solr)=2)
>   3.1972246 = idf(docFreq=2, numDocs=27)
>   0.125 = fieldNorm(field=text, doc=12)
> </str>
>
> Could you try with the latest version of Solr, and if there is still
> an issue, post a bug to Solr's JIRA?
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>