How to boost the score higher in case user query matches entire field value than just some words within a field

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

How to boost the score higher in case user query matches entire field value than just some words within a field

Simon Hu
Hi

I have a text field named prodname in the solr index. Lets say there are 3 document in the index and  here are the field values for prodname field:

Doc1: cordless drill
Doc2: cordless drill battery
Doc3: cordless drill charger

Searching for prodname:"cordless drill" will hit all three documents.  So how can I make Doc1 score higher than the other two?

BTW, I am using solr1.2.

thanks!

-Simon
Reply | Threaded
Open this post in threaded view
|

Re: How to boost the score higher in case user query matches entire field value than just some words within a field

Sean Timm
Length normalization in the Similarity class will generally favor
shorter fields.  For example, with the DefaultSimilarity, the length
norm for a 2 term field is 0.625.  For a three term field it is 0.5.  
The norm is multiplied by the score.

I say "generally will favor" because the length norm value which is
calculated as
    (float)(1.0 / numTerms)
is stored in the index as a single byte (instead of four bytes), thus
losing precision.  This works fine for searching larger documents such
as web pages or news articles, but it can cause some problems when you
are simply searching on short fields such as product names or article
titles.

To solve this, we wrote our own Similarity class which extends
DefaultSimilarity and maps numTerms 1-10 to precalculated values between
1.5f and 0.3125f.  For numTerms >10, we use the standard formula above.  
If anyone else is interested in this, I can post the code as a patch in
Jira.

-Sean

Simon Hu wrote:

> Hi
>
> I have a text field named prodname in the solr index. Lets say there are 3
> document in the index and  here are the field values for prodname field:
>
> Doc1: cordless drill
> Doc2: cordless drill battery
> Doc3: cordless drill charger
>
> Searching for prodname:"cordless drill" will hit all three documents.  So
> how can I make Doc1 score higher than the other two?
>
> BTW, I am using solr1.2.
>
> thanks!
>
> -Simon
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: How to boost the score higher in case user query matches entire field value than just some words within a field

Mark Miller-3
Sean Timm wrote:
> To solve this, we wrote our own Similarity class which extends
> DefaultSimilarity and maps numTerms 1-10 to precalculated values
> between 1.5f and 0.3125f.  For numTerms >10, we use the standard
> formula above.  If anyone else is interested in this, I can post the
> code as a patch in Jira.
>
Does this actually have a good measurable affect for you? Wouldn't it
make more sense to just turn off norms for short fields?
Reply | Threaded
Open this post in threaded view
|

Re: How to boost the score higher in case user query matches entire field value than just some words within a field

Sean Timm
In the example below, Doc1, and Doc2 will all have the same score for
the query "chevrolet tahoe."  We would prefer Doc2 to score higher than
Doc1.  The score length norm for each is also 0.5f.  I presume which one
appears first now falls to the order they were placed in the index?  By
using our score length norm function, Doc2's score will be multiplied by
1.0f and Doc1 by 0.875f resulting in the desired behavior.

Doc1: Chevrolet Tahoe Hybrid 2008
Doc2: Chevrolet Tahoe 2008

-Sean

Mark Miller wrote:
> Sean Timm wrote:
>> To solve this, we wrote our own Similarity class which extends
>> DefaultSimilarity and maps numTerms 1-10 to precalculated values
>> between 1.5f and 0.3125f.  For numTerms >10, we use the standard
>> formula above.  If anyone else is interested in this, I can post the
>> code as a patch in Jira.
>>
> Does this actually have a good measurable affect for you? Wouldn't it
> make more sense to just turn off norms for short fields?
Reply | Threaded
Open this post in threaded view
|

Re: How to boost the score higher in case user query matches entire field value than just some words within a field

Alexander Ramos Jardim
In reply to this post by Simon Hu
The strategy I use is rather simple:

I put the data I want to match in 2 fields, 1 tokenized (indexed=true,
stored=false), 1 exact match (indexed=true, stored=true)

2008/8/20 Simon Hu <[hidden email]>

>
> Hi
>
> I have a text field named prodname in the solr index. Lets say there are 3
> document in the index and  here are the field values for prodname field:
>
> Doc1: cordless drill
> Doc2: cordless drill battery
> Doc3: cordless drill charger
>
> Searching for prodname:"cordless drill" will hit all three documents.  So
> how can I make Doc1 score higher than the other two?
>
> BTW, I am using solr1.2.
>
> thanks!
>
> -Simon
>
> --
> View this message in context:
> http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-tp19079221p19079221.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


--
Alexander Ramos Jardim
Reply | Threaded
Open this post in threaded view
|

Re: How to boost the score higher in case user query matches entire field value than just some words within a field

Jason Rennie-2
In reply to this post by Sean Timm
Count me as interested.  Our "documents" are product descriptions, many
fields of which are very short.  Not sure if it would make large enough of
an impact to warrant us rolling our own solr build, but I'm definitely
interested to see the custom Similarity class.

Thanks,

Jason

On Thu, Aug 21, 2008 at 9:29 AM, Sean Timm <[hidden email]> wrote:

> Length normalization in the Similarity class will generally favor shorter
> fields.  For example, with the DefaultSimilarity, the length norm for a 2
> term field is 0.625.  For a three term field it is 0.5.  The norm is
> multiplied by the score.
>
> I say "generally will favor" because the length norm value which is
> calculated as
>   (float)(1.0 / numTerms)
> is stored in the index as a single byte (instead of four bytes), thus
> losing precision.  This works fine for searching larger documents such as
> web pages or news articles, but it can cause some problems when you are
> simply searching on short fields such as product names or article titles.
>
> To solve this, we wrote our own Similarity class which extends
> DefaultSimilarity and maps numTerms 1-10 to precalculated values between
> 1.5f and 0.3125f.  For numTerms >10, we use the standard formula above.  If
> anyone else is interested in this, I can post the code as a patch in Jira.
>
> -Sean
>
>
> Simon Hu wrote:
>
>> Hi
>>
>> I have a text field named prodname in the solr index. Lets say there are 3
>> document in the index and  here are the field values for prodname field:
>>
>> Doc1: cordless drill
>> Doc2: cordless drill battery
>> Doc3: cordless drill charger
>> Searching for prodname:"cordless drill" will hit all three documents.  So
>> how can I make Doc1 score higher than the other two?
>> BTW, I am using solr1.2.
>> thanks!
>> -Simon
>>
>>
>


--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: How to boost the score higher in case user query matches entire field value than just some words within a field

Simon Hu
In reply to this post by Sean Timm
I am definitely interested in trying your Similarity class. Can you please post the patch in jira?

thanks
-Simon



Sean Timm wrote
In the example below, Doc1, and Doc2 will all have the same score for
the query "chevrolet tahoe."  We would prefer Doc2 to score higher than
Doc1.  The score length norm for each is also 0.5f.  I presume which one
appears first now falls to the order they were placed in the index?  By
using our score length norm function, Doc2's score will be multiplied by
1.0f and Doc1 by 0.875f resulting in the desired behavior.

Doc1: Chevrolet Tahoe Hybrid 2008
Doc2: Chevrolet Tahoe 2008

-Sean

Mark Miller wrote:
> Sean Timm wrote:
>> To solve this, we wrote our own Similarity class which extends
>> DefaultSimilarity and maps numTerms 1-10 to precalculated values
>> between 1.5f and 0.3125f.  For numTerms >10, we use the standard
>> formula above.  If anyone else is interested in this, I can post the
>> code as a patch in Jira.
>>
> Does this actually have a good measurable affect for you? Wouldn't it
> make more sense to just turn off norms for short fields?
Reply | Threaded
Open this post in threaded view
|

Re: How to boost the score higher in case user query matches entire field value than just some words within a field

Sean Timm
https://issues.apache.org/jira/browse/LUCENE-1360

Simon Hu wrote:

> I am definitely interested in trying your Similarity class. Can you please
> post the patch in jira?
>
> thanks
> -Simon
>
>
>
>
> Sean Timm wrote:
>  
>> In the example below, Doc1, and Doc2 will all have the same score for
>> the query "chevrolet tahoe."  We would prefer Doc2 to score higher than
>> Doc1.  The score length norm for each is also 0.5f.  I presume which one
>> appears first now falls to the order they were placed in the index?  By
>> using our score length norm function, Doc2's score will be multiplied by
>> 1.0f and Doc1 by 0.875f resulting in the desired behavior.
>>
>> Doc1: Chevrolet Tahoe Hybrid 2008
>> Doc2: Chevrolet Tahoe 2008
>>
>> -Sean
>>
>> Mark Miller wrote:
>>    
>>> Sean Timm wrote:
>>>      
>>>> To solve this, we wrote our own Similarity class which extends
>>>> DefaultSimilarity and maps numTerms 1-10 to precalculated values
>>>> between 1.5f and 0.3125f.  For numTerms >10, we use the standard
>>>> formula above.  If anyone else is interested in this, I can post the
>>>> code as a patch in Jira.
>>>>
>>>>        
>>> Does this actually have a good measurable affect for you? Wouldn't it
>>> make more sense to just turn off norms for short fields?
>>>      
>>    
>
>