[lucy-user] C library - Scoring mechanism

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] C library - Scoring mechanism

serkanmulayim@gmail.com
Hi guys,

I have a question regarding the scoring mechanism for relevancy. Is the scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the schema? What happens when multiple terms are used? Are tf/idf's summed? How does the incorporate the location of the words to the scoring mechanism for queries with multiple words?

How about the fields which has RegexTokenizer? Is it still the same mechanism? Does the type of the tokenizer affect the scoring?  I believe the important thing is the generated tokens (and not related to the tokenizer), and maybe the order of the tokens in a document.

One more thing, if I were to change the scoring mechanism for different fields, how can I do it? Are there any predefined mechanisms eg. tf/idf doc2vec etc. Or if I want to go further and come up with my own how can I do it?

Thanks,
Serkan


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library - Scoring mechanism

Nick Wellnhofer

On Nov 21, 2017, at 02:09 , [hidden email] wrote:
> I have a question regarding the scoring mechanism for relevancy. Is the scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the schema? What happens when multiple terms are used? Are tf/idf's summed?

Lucy uses Lucene's Practical Scoring Function by default:

https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

Essentially, tf/idf values are summed after being multiplied with each term's boost and normalization factor.

> How does the incorporate the location of the words to the scoring mechanism for queries with multiple words?

> How about the fields which has RegexTokenizer? Is it still the same mechanism? Does the type of the tokenizer affect the scoring?  I believe the important thing is the generated tokens (and not related to the tokenizer), and maybe the order of the tokens in a document.

If you use the core Tokenizers, the type of Tokenizer or the location of terms in a document don’t affect scoring. But you can write a custom Tokenizer that sets different boost values for each Token, for example depending on the location within the document.

> One more thing, if I were to change the scoring mechanism for different fields, how can I do it? Are there any predefined mechanisms eg. tf/idf doc2vec etc. Or if I want to go further and come up with my own how can I do it?

You can tweak the scoring formula by supplying your own Similarity subclass for each FieldType, possibly in conjunction with your own Query/Compiler/Matcher subclasses:

https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html

The public documentation for Similarity is incomplete, unfortunately. But the class is similar to Lucene’s. The .cfh file contains more details:

https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD

You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library - Scoring mechanism

serkanmulayim@gmail.com
Thank you very much Nick for your response.

I would like to ask two more questions:
1- Are the tf/idf scores consistent accross the all segments in a non-optimized index? Or is it being calculated separately for each segment (tf would not change but idf might be different)?
2- (same question but for multiple indexes and polysearcher) If I use polysearcher with 2 or more indexes, will the tf/idf scores be consistent? Or would they be calculated separately for each index?

Regards,
Serkan

On 2017-11-21 01:49, Nick Wellnhofer <[hidden email]> wrote:

>
> On Nov 21, 2017, at 02:09 , [hidden email] wrote:
> > I have a question regarding the scoring mechanism for relevancy. Is the scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the schema? What happens when multiple terms are used? Are tf/idf's summed?
>
> Lucy uses Lucene's Practical Scoring Function by default:
>
> https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html
>
> Essentially, tf/idf values are summed after being multiplied with each term's boost and normalization factor.
>
> > How does the incorporate the location of the words to the scoring mechanism for queries with multiple words?
>
> > How about the fields which has RegexTokenizer? Is it still the same mechanism? Does the type of the tokenizer affect the scoring?  I believe the important thing is the generated tokens (and not related to the tokenizer), and maybe the order of the tokens in a document.
>
> If you use the core Tokenizers, the type of Tokenizer or the location of terms in a document don’t affect scoring. But you can write a custom Tokenizer that sets different boost values for each Token, for example depending on the location within the document.
>
> > One more thing, if I were to change the scoring mechanism for different fields, how can I do it? Are there any predefined mechanisms eg. tf/idf doc2vec etc. Or if I want to go further and come up with my own how can I do it?
>
> You can tweak the scoring formula by supplying your own Similarity subclass for each FieldType, possibly in conjunction with your own Query/Compiler/Matcher subclasses:
>
> https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html
>
> The public documentation for Similarity is incomplete, unfortunately. But the class is similar to Lucene’s. The .cfh file contains more details:
>
> https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD
>
> You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm.
>
> Nick
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library - Scoring mechanism

Nick Wellnhofer
On 21/11/2017 18:42, [hidden email] wrote:
> 1- Are the tf/idf scores consistent accross the all segments in a non-optimized index? Or is it being calculated separately for each segment (tf would not change but idf might be different)?

tf/idf is computed for the whole index.

> 2- (same question but for multiple indexes and polysearcher) If I use polysearcher with 2 or more indexes, will the tf/idf scores be consistent? Or would they be calculated separately for each index?

I don't know off top of my head. It's possible that indexes are searched
separately and the results are simply merged by normalized score. I'd have to
look at the code to answer the question, but maybe Marvin can chime in.

Nick
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library - Scoring mechanism

Marvin Humphrey
On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer <[hidden email]> wrote:
> On 21/11/2017 18:42, [hidden email] wrote:

>> 2- (same question but for multiple indexes and polysearcher) If I use
>> polysearcher with 2 or more indexes, will the tf/idf scores be consistent?
>> Or would they be calculated separately for each index?
>
> I don't know off top of my head. It's possible that indexes are searched
> separately and the results are simply merged by normalized score. I'd have
> to look at the code to answer the question, but maybe Marvin can chime in.

The scores will be consistent.

To calculate IDF for a term accurately across a composite corpus
formed from multiple indexes, you need to know two things:

1. The total number of documents in the corpus. (Doc_Max())
2. The total number of documents which contain the term. (Doc_Freq(field, term))

Both PolySearcher and ClusterSearcher calculate their doc_max on
construction by summing the doc_max totals of all subsearchers.
Similarly, both calculate Doc_Freq for a term by summing Doc_Freq
responses for all subsearchers.

https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69
https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119
https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73
https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348

This approach trades away some performance for the sake of accuracy,
particularly with Doc_Freq -- query normalization takes longer when
you have to wait for a lot of subsearchers to report Doc_Freq numbers
for N terms. However, the alternative is occasional bizarre search
results.

The best anecdote I ever heard illustrating why it's important to
calculate aggregate IDF consistently was an application searching a
multi-shard index containing news articles split by year.  If you
searched for "iphone", it would be a very common term after the first
release of the Apple iPhone. However, in the years prior to the Apple
iPhone's release, if "iphone" existed in a shard it was likely a typo,
so it would be very rare **and thus heavily weighted**. So the top hit
for "iphone", without consistent IDF calculation, would be a typo'd
article.

(A performance improvement on this stratagem is to create a shared
Doc_Freq source. So long as it contains all the common terms across
all shards, it doesn't have to be updated often -- Doc_Freq values
don't change very fast as indexes are updated.)

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] C library - Scoring mechanism

serkanmulayim@gmail.com
Thank you very much Nick and Marvin. Your replies were really helpful.

On 2017-11-23 11:38, Marvin Humphrey <[hidden email]> wrote:

> On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer <[hidden email]> wrote:
> > On 21/11/2017 18:42, [hidden email] wrote:
>
> >> 2- (same question but for multiple indexes and polysearcher) If I use
> >> polysearcher with 2 or more indexes, will the tf/idf scores be consistent?
> >> Or would they be calculated separately for each index?
> >
> > I don't know off top of my head. It's possible that indexes are searched
> > separately and the results are simply merged by normalized score. I'd have
> > to look at the code to answer the question, but maybe Marvin can chime in.
>
> The scores will be consistent.
>
> To calculate IDF for a term accurately across a composite corpus
> formed from multiple indexes, you need to know two things:
>
> 1. The total number of documents in the corpus. (Doc_Max())
> 2. The total number of documents which contain the term. (Doc_Freq(field, term))
>
> Both PolySearcher and ClusterSearcher calculate their doc_max on
> construction by summing the doc_max totals of all subsearchers.
> Similarly, both calculate Doc_Freq for a term by summing Doc_Freq
> responses for all subsearchers.
>
> https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69
> https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119
> https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73
> https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348
>
> This approach trades away some performance for the sake of accuracy,
> particularly with Doc_Freq -- query normalization takes longer when
> you have to wait for a lot of subsearchers to report Doc_Freq numbers
> for N terms. However, the alternative is occasional bizarre search
> results.
>
> The best anecdote I ever heard illustrating why it's important to
> calculate aggregate IDF consistently was an application searching a
> multi-shard index containing news articles split by year.  If you
> searched for "iphone", it would be a very common term after the first
> release of the Apple iPhone. However, in the years prior to the Apple
> iPhone's release, if "iphone" existed in a shard it was likely a typo,
> so it would be very rare **and thus heavily weighted**. So the top hit
> for "iphone", without consistent IDF calculation, would be a typo'd
> article.
>
> (A performance improvement on this stratagem is to create a shared
> Doc_Freq source. So long as it contains all the common terms across
> all shards, it doesn't have to be updated often -- Doc_Freq values
> don't change very fast as indexes are updated.)
>
> Marvin Humphrey
>