Distributed IDF in Alias

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Distributed IDF in Alias

SOLR4189
This post was updated on .
Hi all,

Can somebody explain me SOLR tip (from here):
Any alias (standard or routed) that references multiple collections may
complicate relevancy. By default, SolrCloud scores documents on a per shard
basis. With multiple collections in an alias this is always a problem, so if
you have a use case for which BM25 or TF/IDF relevancy is important you will
want to turn on one of the ExactStatsCache implementations


But there is This implementation uses global values (across the
collection) for document frequency
 in ExactStatsCache documentation (from
here)

So what does it mean "across the collection"? Does it mean that distributed
IDF is inside the same collection (across shards)? If yes, how it will help
in the alias case?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Distributed IDF in Alias

Andrzej Białecki-2
Both descriptions are correct, but in their context. The description in the Ref Guide in the section about ExactStatsCache is correct in the sense that it uses collection-wide IDF values for terms when calculating scores for different SHARDS (and merging partial per-shard lists). This means that even if local IDF (for documents in a particular shard) is biased the scores will be still comparable across shards and the documents coming from these partial lists can be merged using their absolute scores - and their rank (ordering) will be the same as if they all came from one big shard..

There’s no such mechanism for adjusting scores across two or more different COLLECTIONS. Usually IDFs for the same terms will be different in different collections - which means the absolute values of scores for the same terms won’t be comparable. Still, if you insist and you use a multi-collection alias Solr will obey ;) and it will merge these partial lists as if their scores were comparable. The end result will be that some or most of the results will be incorrectly ranked, depending on how different were the IDFs in these collections.

> On 17 May 2019, at 16:37, SOLR4189 <[hidden email]> wrote:
>
> Hi all,
>
> Can somebody explain me SOLR tip from  here
> <https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-8.x/javadoc/aliases.html>
> :
> /"Any alias (standard or routed) that references multiple collections may
> complicate relevancy. By default, SolrCloud scores documents on a per shard
> basis. With multiple collections in an alias this is always a problem, so if
> you have a use case for which BM25 or TF/IDF relevancy is important you will
> want to turn on one of the ExactStatsCache implementations"/
>
> But there is / "This implementation uses global values (across the
> collection) for document frequency" / in ExactStatsCache documentation (from
> here
> <https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-8.x/javadoc/distributed-requests.html#distributedidf>
> )
>
> So what does it mean "across the collection"? Does it mean that distributed
> IDF is inside the same collection (across shards)? If yes, how it will help
> in the alias case?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Reply | Threaded
Open this post in threaded view
|

Re: Distributed IDF in Alias

SOLR4189
I ask my question due to I want to use TRA (Time Routed Aliases). Let's say
SOLR will open new collection every month. In the beginning of month a new
collection will be empty almost.
So IDF will be different between new collection and collection of previous
month?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Distributed IDF in Alias

Andrzej Białecki
Yes, the IDFs will be different. You could probably implement a custom component that would take term statistics from the previous collections to pre-populate the stats of the current collection, but this is an uncharted area, there’s a lot that could go wrong. Eg. if there’s a genuine shift in the term distribution in more recent documents then you probably would not want the old statistics to skew the more recent results, at least you would want to use some weighting factor - and at this point predicting the final term IDFs (and consequently document rankings) becomes quite complicated.

> On 18 May 2019, at 08:14, SOLR4189 <[hidden email]> wrote:
>
> I ask my question due to I want to use TRA (Time Routed Aliases). Let's say
> SOLR will open new collection every month. In the beginning of month a new
> collection will be empty almost.
> So IDF will be different between new collection and collection of previous
> month?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Reply | Threaded
Open this post in threaded view
|

Re: Distributed IDF in Alias

Erick Erickson
In reply to this post by SOLR4189
In a word, “yes”. For time routed alias, you also have to be aware of the nature of your data. Take the canonical example of news stories for instance, and let’s assume that every day a new collection is created.

Now a hot news story breaks and the news is flooded with the latest story, “Hurricane hits Florida" for instance. The recent news will contain many more mentions of Florida .vs. older collections. So the TF/IDF statistics for recent collections will be much different than old collections.

In the normal SolrCloud case where routing is done by hashing the <uniqueKey>, the assumption is that the close-to-random distribution of stories will make the stats on individual shards “close enough”.

Best,
Erick

> On May 17, 2019, at 11:14 PM, SOLR4189 <[hidden email]> wrote:
>
> I ask my question due to I want to use TRA (Time Routed Aliases). Let's say
> SOLR will open new collection every month. In the beginning of month a new
> collection will be empty almost.
> So IDF will be different between new collection and collection of previous
> month?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html