Re: ClusterState says we are the leader, but locally we don't think so

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: ClusterState says we are the leader, but locally we don't think so

Jon Drews
Following up on this thread. It was finally determined that Solr was being
hard killed. The Windows service was not giving Solr enough time to shut
down and was hard killing it. We fixed this and have not had the issue
since.

On Tue, May 31, 2016 at 1:51 PM, Jon Drews <[hidden email]> wrote:

> I forgot to add that this is Apache Solr 5.3.1.
>
> There are three collections, two of which have one shard and and the other
> has 3-5 shards. Approximately 200,000 documents across all collections.
>
> Jon Drews
> jondrews.com
>
> On Tue, May 31, 2016 at 12:15 PM, Jon Drews <[hidden email]> wrote:
>
>> We have seen the following error on four separate instances of Solr. The
>> result is that all or most shards go into "Down" state and do not recover
>> on restart of Solr.
>>
>> I'm hoping one of you has some insight into what might be causing it as
>> we haven't been able to track down the issue or reproduce it reliably.
>>
>> 2016-05-26 21:00:09.000 ERROR (qtp1450821318-15) [c:log s:20160526
>> r:core_node4 x:log_20160526_replica1] o.a.s.c.SolrCore
>> org.apache.solr.common.SolrException: ClusterState says we are the
>> leader (https://localhost:8984/solr/log_20160526_replica1), but locally
>> we don't think so. Request came from https://localhost:8984/solr/
>> log_20160524_replica1/
>>
>> We were able to recover by using https://github.com/echoma/zkui/ to
>> manually edit the /clusterstate.json and /collections/log/state.json to set
>> shards from "Down" to "Active". After that the error subsided and
>> functionality was restored.
>>
>> A few notes:
>> - All four systems were on either Windows 7 or Windows Server 2012.
>> - All four systems are on single servers with embedded zookeepers.
>> - SSL was enabled in Solr, but no authentication
>> - After the issue, we increased the zkClientTimeout and restarted,
>> however all shards were still in a Down state and error persisted.
>> - Migrating the solr instance to a new Windows install did not solve
>> issue.
>>
>> Please let me know if you have any ideas as to why this is happening and
>> possible solutions. Thanks!
>>
>
>