Inconsistent replicas in a shard

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistent replicas in a shard

WebsterHomer
We are using Solr 6.2.0 in solrcloud mode

I have a QA solrcloud that has multiple collections. All collections have 2
shards each with two replicas.

I have several replicas where the numDocs in the same shard do not match.
In two collections with three different shards I have one replica with data
and the other has no data. All six replicas appear healthy in the Solr
console.

So how does that happen where two replicas in the same shard have different
amounts of data?

How do you diagnose this when the replicas are active and seemingly healthy?

How do I get the replicas with no data, get data from their leader? In all
three cases the replica with data is the leader.

I also see two other collections where the replica's numDocs don't quite
match
In those two cases the leader has a few more docs than the other replica

How to remedy this situation?

This solrcloud is a target of CDCR replication, but I'm not sure why that
would matter since I believe cdcr has the shard leaders communicate and the
followers should just get their updates from their leader as they would
from a normal update

I'm just lucky that this is not a production solrcloud! Still need to know
how to fix it.

Thanks!

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent replicas in a shard

Erick Erickson
Shouldn't be happening of course (replicas with different numbers of
docs), at least permanently. It can regularly happen on a _temporary_
basis however. And there are ways you can cause this to happen
permanently. Here's an outline.

> temporarily out of sync. Due to the fact that commits happen at different wall clock times, different replicas in the same shard can be skewed for the autocommit interval. Ways to check:
>> stop indexing, wait for the CDCR to catch up _plus_ your autocommit interval and check.
>> Fire a query at the replica that cuts off some time in the past and add distrib=false, then examine the number of hits returned. The query looks something like "..solr/collection1_shard1_replica1/query?q=*:*&fq=timestamp:[* TO NOW-(2x autocommit interval + CDCR latency)]&distrib=false". This requires a reliable timestamp of course.

> Permanently out of sync:
>> if you ever fired a FORCELEADER at a replica, you are risking this.
>> If you stopped the (non leader) replica and kept indexing, then stopped the leader and started the replica back up. Solr does the best it can to preserve the data, but if a replica is offline it doesn't have updates in the tlog to replay. So when leader election happens if the old replica is elected leader it won't have all the updates.


Best,
Erick

On Fri, Oct 6, 2017 at 12:04 PM, Webster Homer <[hidden email]> wrote:

> We are using Solr 6.2.0 in solrcloud mode
>
> I have a QA solrcloud that has multiple collections. All collections have 2
> shards each with two replicas.
>
> I have several replicas where the numDocs in the same shard do not match.
> In two collections with three different shards I have one replica with data
> and the other has no data. All six replicas appear healthy in the Solr
> console.
>
> So how does that happen where two replicas in the same shard have different
> amounts of data?
>
> How do you diagnose this when the replicas are active and seemingly healthy?
>
> How do I get the replicas with no data, get data from their leader? In all
> three cases the replica with data is the leader.
>
> I also see two other collections where the replica's numDocs don't quite
> match
> In those two cases the leader has a few more docs than the other replica
>
> How to remedy this situation?
>
> This solrcloud is a target of CDCR replication, but I'm not sure why that
> would matter since I believe cdcr has the shard leaders communicate and the
> followers should just get their updates from their leader as they would
> from a normal update
>
> I'm just lucky that this is not a production solrcloud! Still need to know
> how to fix it.
>
> Thanks!
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent replicas in a shard

WebsterHomer
In reply to this post by WebsterHomer
This solrcloud has had some issues of late. We had a network glitch which
caused a shard leader of one of the collections write over 5000 0 length
tlogs to its filesystem. Whenever it started up it ran out of file handles
which killed the IndexWriter and caused lots of unhappy collections. This
may be related to that. No one was alerted to the errors for several days.

These guys have been out of sync for a while.Indeed one of the collections
we just did a data load to today and it stayed bad. I can say that we have
NEVER done a FORCELEADER, unless some internal solrcloud code does this.

the dataload we did had no errors and both replicas are in active state. No
replica was offline

Oh just went back and looked and saw that the empty replica in the
collection we just loaded has now caught up and has data. It took a while
but it now matches its leader. Perhaps all we need to do is new data loads
to the out of whack collections?

On Fri, Oct 6, 2017 at 2:04 PM, Webster Homer <[hidden email]>
wrote:

> We are using Solr 6.2.0 in solrcloud mode
>
> I have a QA solrcloud that has multiple collections. All collections have
> 2 shards each with two replicas.
>
> I have several replicas where the numDocs in the same shard do not match.
> In two collections with three different shards I have one replica with data
> and the other has no data. All six replicas appear healthy in the Solr
> console.
>
> So how does that happen where two replicas in the same shard have
> different amounts of data?
>
> How do you diagnose this when the replicas are active and seemingly
> healthy?
>
> How do I get the replicas with no data, get data from their leader? In all
> three cases the replica with data is the leader.
>
> I also see two other collections where the replica's numDocs don't quite
> match
> In those two cases the leader has a few more docs than the other replica
>
> How to remedy this situation?
>
> This solrcloud is a target of CDCR replication, but I'm not sure why that
> would matter since I believe cdcr has the shard leaders communicate and the
> followers should just get their updates from their leader as they would
> from a normal update
>
> I'm just lucky that this is not a production solrcloud! Still need to know
> how to fix it.
>
> Thanks!
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.