Out of sync deletions causing differing IDF

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Out of sync deletions causing differing IDF

Upayavira-2
We have a system that has a reasonable number of changes going on on a
daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
Cloud, the data is split into 10 shards and those shards are replicated.

What we are finding is that the number of deletions is causing differing
maxDocs across the different replicas, and that is causing significantly
different IDF values between replicas of the same shard, giving
different scores and thus different orders depending upon which replica
we hit.

I would have expected that, because the data is being indexed
concurrently across replicas, that the pattern of delete/merge would be
similar across replicas, but that doesn't seem to be the case in
practice.

We could, of course, optimise the index to merge down to a single
segment. This would clear all deletes out, but would leave us in a worse
place for the future, as now most of our deletes would be concentrated
into a single large segment.

Has anyone seen this sort of thing before, and does anyone have
suggested strategies as to how to encourage IDF values into a similar
range across replicas?

Upayavira
Reply | Threaded
Open this post in threaded view
|

RE: Out of sync deletions causing differing IDF

Markus Jelsma-2
Hello - your similarity should rely on numDoc instead, it solves the problem. I believe it is already fixed in trunk, but i am not sure.
Markus
 
-----Original message-----

> From:Upayavira <[hidden email]>
> Sent: Thursday 4th August 2016 13:59
> To: [hidden email]
> Subject: Out of sync deletions causing differing IDF
>
> We have a system that has a reasonable number of changes going on on a
> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
> Cloud, the data is split into 10 shards and those shards are replicated.
>
> What we are finding is that the number of deletions is causing differing
> maxDocs across the different replicas, and that is causing significantly
> different IDF values between replicas of the same shard, giving
> different scores and thus different orders depending upon which replica
> we hit.
>
> I would have expected that, because the data is being indexed
> concurrently across replicas, that the pattern of delete/merge would be
> similar across replicas, but that doesn't seem to be the case in
> practice.
>
> We could, of course, optimise the index to merge down to a single
> segment. This would clear all deletes out, but would leave us in a worse
> place for the future, as now most of our deletes would be concentrated
> into a single large segment.
>
> Has anyone seen this sort of thing before, and does anyone have
> suggested strategies as to how to encourage IDF values into a similar
> range across replicas?
>
> Upayavira
>
Reply | Threaded
Open this post in threaded view
|

Re: Out of sync deletions causing differing IDF

Erick Erickson
Upayavira:

bq: I would have expected that, because the data is being indexed
concurrently across replicas, that the pattern of delete/merge would be
similar across replicas.

Except for the pesky timing issue. The timers start for autocommit when a
request is received. So the time the autocommit timer expires won't be
the same wall-clock time on all servers and thus may not have the same docs
in the same segments. It would be _really nice_ if they did, because then
we wouldn't have to fall back to full replication so often for recovery.

I think there's a JIRA out there for trying to coordinate all the commits across
replicas in a shard, but I can't find it on a quick look.

Would distributed IDF help here?
https://issues.apache.org/jira/browse/SOLR-1632 (even though this is
really old, it's in 5.0+)

Best,
Erick

On Thu, Aug 4, 2016 at 5:12 AM, Markus Jelsma
<[hidden email]> wrote:

> Hello - your similarity should rely on numDoc instead, it solves the problem. I believe it is already fixed in trunk, but i am not sure.
> Markus
>
> -----Original message-----
>> From:Upayavira <[hidden email]>
>> Sent: Thursday 4th August 2016 13:59
>> To: [hidden email]
>> Subject: Out of sync deletions causing differing IDF
>>
>> We have a system that has a reasonable number of changes going on on a
>> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
>> Cloud, the data is split into 10 shards and those shards are replicated.
>>
>> What we are finding is that the number of deletions is causing differing
>> maxDocs across the different replicas, and that is causing significantly
>> different IDF values between replicas of the same shard, giving
>> different scores and thus different orders depending upon which replica
>> we hit.
>>
>> I would have expected that, because the data is being indexed
>> concurrently across replicas, that the pattern of delete/merge would be
>> similar across replicas, but that doesn't seem to be the case in
>> practice.
>>
>> We could, of course, optimise the index to merge down to a single
>> segment. This would clear all deletes out, but would leave us in a worse
>> place for the future, as now most of our deletes would be concentrated
>> into a single large segment.
>>
>> Has anyone seen this sort of thing before, and does anyone have
>> suggested strategies as to how to encourage IDF values into a similar
>> range across replicas?
>>
>> Upayavira
>>
Reply | Threaded
Open this post in threaded view
|

Re: Out of sync deletions causing differing IDF

Malcolm Upayavira Holmes
Thx for these both, we'll give them both a try, see what difference they
make.

Upayavira

On Thu, 4 Aug 2016, at 12:27 PM, Erick Erickson wrote:

> Upayavira:
>
> bq: I would have expected that, because the data is being indexed
> concurrently across replicas, that the pattern of delete/merge would be
> similar across replicas.
>
> Except for the pesky timing issue. The timers start for autocommit when a
> request is received. So the time the autocommit timer expires won't be
> the same wall-clock time on all servers and thus may not have the same
> docs
> in the same segments. It would be _really nice_ if they did, because then
> we wouldn't have to fall back to full replication so often for recovery.
>
> I think there's a JIRA out there for trying to coordinate all the commits
> across
> replicas in a shard, but I can't find it on a quick look.
>
> Would distributed IDF help here?
> https://issues.apache.org/jira/browse/SOLR-1632 (even though this is
> really old, it's in 5.0+)
>
> Best,
> Erick
>
> On Thu, Aug 4, 2016 at 5:12 AM, Markus Jelsma
> <[hidden email]> wrote:
> > Hello - your similarity should rely on numDoc instead, it solves the problem. I believe it is already fixed in trunk, but i am not sure.
> > Markus
> >
> > -----Original message-----
> >> From:Upayavira <[hidden email]>
> >> Sent: Thursday 4th August 2016 13:59
> >> To: [hidden email]
> >> Subject: Out of sync deletions causing differing IDF
> >>
> >> We have a system that has a reasonable number of changes going on on a
> >> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
> >> Cloud, the data is split into 10 shards and those shards are replicated.
> >>
> >> What we are finding is that the number of deletions is causing differing
> >> maxDocs across the different replicas, and that is causing significantly
> >> different IDF values between replicas of the same shard, giving
> >> different scores and thus different orders depending upon which replica
> >> we hit.
> >>
> >> I would have expected that, because the data is being indexed
> >> concurrently across replicas, that the pattern of delete/merge would be
> >> similar across replicas, but that doesn't seem to be the case in
> >> practice.
> >>
> >> We could, of course, optimise the index to merge down to a single
> >> segment. This would clear all deletes out, but would leave us in a worse
> >> place for the future, as now most of our deletes would be concentrated
> >> into a single large segment.
> >>
> >> Has anyone seen this sort of thing before, and does anyone have
> >> suggested strategies as to how to encourage IDF values into a similar
> >> range across replicas?
> >>
> >> Upayavira
> >>