Solr moved all replicas from node

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr moved all replicas from node

Hendrik Haddorp
Hi,

I have two Solr clouds using Version 7.6.0 with 4 nodes each and about
500 collections with one shard and a replication factor of 2 per Solr
cloud. The data is stored in the HDFS. I restarted the nodes one by one
and always waited for the replicas to fully recover before I restarted
the next. Once the last node was restarted I noticed that Solr was
starting to move replicas to other nodes. Actually it started to move
all replicas from one node, which is now left empty. Is there any way to
figure out why Solr decided to move all replicas to other nodes?
The only problem that I see is that during the recovery the Solr
instance logged a problem with the HDFS, claiming that the filesystem is
closed. The recovery seems to have continued after that just fine though
and the logs are clean for the time after wards.
I restarted the node now and invoked the UTILIZENODE action that moved a
few replicas back to the node but then failed with this exception:

{
   "responseHeader":{
     "status":500,
     "QTime":40220},
   "Operation utilizenode caused
exception:":"java.lang.IllegalArgumentException:java.lang.IllegalArgumentException:
Comparison method violates its general contract!",
   "exception":{
     "msg":"Comparison method violates its general contract!",
     "rspCode":-1},
   "error":{
     "metadata":[
       "error-class","org.apache.solr.common.SolrException",
       "root-error-class","org.apache.solr.common.SolrException"],
     "msg":"Comparison method violates its general contract!",
     "trace":"org.apache.solr.common.SolrException: Comparison method
violates its general contract!\n\tat
org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:274)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat
java.lang.Thread.run(Thread.java:748)\n",
     "code":500}}

When I invoke it again it moved a few more replicas but then failed in
the same way again. The log has this additional exception:
2019-02-10 00:09:00.539 ERROR
(OverseerThreadFactory-1268-thread-38-processing-n:agent2:9151_solr) [  
] o.a.s.c.a.c.OverseerCollectionMessageHandler Operation utilizenode
failed:java.lang.IllegalArgumentException: Comparison method violates
its general contract!
     at java.util.TimSort.mergeLo(TimSort.java:777)
     at java.util.TimSort.mergeAt(TimSort.java:514)
     at java.util.TimSort.mergeCollapse(TimSort.java:439)
     at java.util.TimSort.sort(TimSort.java:245)
     at java.util.Arrays.sort(Arrays.java:1512)
     at java.util.ArrayList.sort(ArrayList.java:1462)
     at
org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.tryEachNode(MoveReplicaSuggester.java:50)
     at
org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.init(MoveReplicaSuggester.java:38)
     at
org.apache.solr.client.solrj.cloud.autoscaling.Suggester.getSuggestion(Suggester.java:187)
     at
org.apache.solr.cloud.api.collections.UtilizeNodeCmd.call(UtilizeNodeCmd.java:100)
     at
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:259)
     at
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:478)
     at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
     at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)

Not quite sure what it compares but the comparator should be this one:
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/MoveReplicaSuggester.java#L98
Not sure if it's possible but if both replicas are leaders the result
looks wrong to me.

Anyhow, my main issue is that I don't see why Solr suddenly decided to
move all replicas of my node.

regards,
Hendrik
Reply | Threaded
Open this post in threaded view
|

Re: Solr moved all replicas from node

Erick Erickson
What version of Solr? Do you have any of the autoscaling stuff turned
on? What about autoAddReplicas (which does not need Solr 7x)?

On Sat, Feb 9, 2019 at 4:35 PM Hendrik Haddorp <[hidden email]> wrote:

>
> Hi,
>
> I have two Solr clouds using Version 7.6.0 with 4 nodes each and about
> 500 collections with one shard and a replication factor of 2 per Solr
> cloud. The data is stored in the HDFS. I restarted the nodes one by one
> and always waited for the replicas to fully recover before I restarted
> the next. Once the last node was restarted I noticed that Solr was
> starting to move replicas to other nodes. Actually it started to move
> all replicas from one node, which is now left empty. Is there any way to
> figure out why Solr decided to move all replicas to other nodes?
> The only problem that I see is that during the recovery the Solr
> instance logged a problem with the HDFS, claiming that the filesystem is
> closed. The recovery seems to have continued after that just fine though
> and the logs are clean for the time after wards.
> I restarted the node now and invoked the UTILIZENODE action that moved a
> few replicas back to the node but then failed with this exception:
>
> {
>    "responseHeader":{
>      "status":500,
>      "QTime":40220},
>    "Operation utilizenode caused
> exception:":"java.lang.IllegalArgumentException:java.lang.IllegalArgumentException:
> Comparison method violates its general contract!",
>    "exception":{
>      "msg":"Comparison method violates its general contract!",
>      "rspCode":-1},
>    "error":{
>      "metadata":[
>        "error-class","org.apache.solr.common.SolrException",
>        "root-error-class","org.apache.solr.common.SolrException"],
>      "msg":"Comparison method violates its general contract!",
>      "trace":"org.apache.solr.common.SolrException: Comparison method
> violates its general contract!\n\tat
> org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat
> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:274)\n\tat
> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat
> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat
> java.lang.Thread.run(Thread.java:748)\n",
>      "code":500}}
>
> When I invoke it again it moved a few more replicas but then failed in
> the same way again. The log has this additional exception:
> 2019-02-10 00:09:00.539 ERROR
> (OverseerThreadFactory-1268-thread-38-processing-n:agent2:9151_solr) [
> ] o.a.s.c.a.c.OverseerCollectionMessageHandler Operation utilizenode
> failed:java.lang.IllegalArgumentException: Comparison method violates
> its general contract!
>      at java.util.TimSort.mergeLo(TimSort.java:777)
>      at java.util.TimSort.mergeAt(TimSort.java:514)
>      at java.util.TimSort.mergeCollapse(TimSort.java:439)
>      at java.util.TimSort.sort(TimSort.java:245)
>      at java.util.Arrays.sort(Arrays.java:1512)
>      at java.util.ArrayList.sort(ArrayList.java:1462)
>      at
> org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.tryEachNode(MoveReplicaSuggester.java:50)
>      at
> org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.init(MoveReplicaSuggester.java:38)
>      at
> org.apache.solr.client.solrj.cloud.autoscaling.Suggester.getSuggestion(Suggester.java:187)
>      at
> org.apache.solr.cloud.api.collections.UtilizeNodeCmd.call(UtilizeNodeCmd.java:100)
>      at
> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:259)
>      at
> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:478)
>      at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>      at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>      at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>      at java.lang.Thread.run(Thread.java:748)
>
> Not quite sure what it compares but the comparator should be this one:
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/MoveReplicaSuggester.java#L98
> Not sure if it's possible but if both replicas are leaders the result
> looks wrong to me.
>
> Anyhow, my main issue is that I don't see why Solr suddenly decided to
> move all replicas of my node.
>
> regards,
> Hendrik
Reply | Threaded
Open this post in threaded view
|

Re: Solr moved all replicas from node

Hendrik Haddorp
Solr version is 7.6.0
autoAddReplicas is set to true
/api/cluster/autoscaling returns this:

{
   "responseHeader":{
     "status":0,
     "QTime":1},
   "cluster-preferences":[{
       "minimize":"cores",
       "precision":1}],
   "cluster-policy":[{
       "replica":"<2",
       "shard":"#EACH",
       "node":"#ANY"}],
   "triggers":{
     ".auto_add_replicas":{
       "name":".auto_add_replicas",
       "event":"nodeLost",
       "waitFor":1800,
       "enabled":true,
       "actions":[{
           "name":"auto_add_replicas_plan",
           "class":"solr.AutoAddReplicasPlanAction"},
         {
           "name":"execute_plan",
           "class":"solr.ExecutePlanAction"}]},
     ".scheduled_maintenance":{
       "name":".scheduled_maintenance",
       "event":"scheduled",
       "startTime":"NOW",
       "every":"+1DAY",
       "enabled":true,
       "actions":[{
           "name":"inactive_shard_plan",
           "class":"solr.InactiveShardPlanAction"},
         {
           "name":"execute_plan",
           "class":"solr.ExecutePlanAction"}]}},
   "listeners":{
     ".auto_add_replicas.system":{
       "beforeAction":[],
       "afterAction":[],
       "stage":["STARTED",
         "ABORTED",
         "SUCCEEDED",
         "FAILED",
         "BEFORE_ACTION",
         "AFTER_ACTION",
         "IGNORED"],
       "trigger":".auto_add_replicas",
       "class":"org.apache.solr.cloud.autoscaling.SystemLogListener"},
     ".scheduled_maintenance.system":{
       "beforeAction":[],
       "afterAction":[],
       "stage":["STARTED",
         "ABORTED",
         "SUCCEEDED",
         "FAILED",
         "BEFORE_ACTION",
         "AFTER_ACTION",
         "IGNORED"],
       "trigger":".scheduled_maintenance",
       "class":"org.apache.solr.cloud.autoscaling.SystemLogListener"}},
   "properties":{},
   "WARNING":"This response format is experimental.  It is likely to change in the future."}

I have two solr clouds that are setup in the same way. When restarting
the nodes only one of them showed this behavior.
Ideally I want replicas to be moved when a node is down for a longer
time but not when I just restart it. I would also like all nodes to end
up with the same number of cores.

On 10.02.2019 05:30, Erick Erickson wrote:

> What version of Solr? Do you have any of the autoscaling stuff turned
> on? What about autoAddReplicas (which does not need Solr 7x)?
>
> On Sat, Feb 9, 2019 at 4:35 PM Hendrik Haddorp <[hidden email]> wrote:
>> Hi,
>>
>> I have two Solr clouds using Version 7.6.0 with 4 nodes each and about
>> 500 collections with one shard and a replication factor of 2 per Solr
>> cloud. The data is stored in the HDFS. I restarted the nodes one by one
>> and always waited for the replicas to fully recover before I restarted
>> the next. Once the last node was restarted I noticed that Solr was
>> starting to move replicas to other nodes. Actually it started to move
>> all replicas from one node, which is now left empty. Is there any way to
>> figure out why Solr decided to move all replicas to other nodes?
>> The only problem that I see is that during the recovery the Solr
>> instance logged a problem with the HDFS, claiming that the filesystem is
>> closed. The recovery seems to have continued after that just fine though
>> and the logs are clean for the time after wards.
>> I restarted the node now and invoked the UTILIZENODE action that moved a
>> few replicas back to the node but then failed with this exception:
>>
>> {
>>     "responseHeader":{
>>       "status":500,
>>       "QTime":40220},
>>     "Operation utilizenode caused
>> exception:":"java.lang.IllegalArgumentException:java.lang.IllegalArgumentException:
>> Comparison method violates its general contract!",
>>     "exception":{
>>       "msg":"Comparison method violates its general contract!",
>>       "rspCode":-1},
>>     "error":{
>>       "metadata":[
>>         "error-class","org.apache.solr.common.SolrException",
>>         "root-error-class","org.apache.solr.common.SolrException"],
>>       "msg":"Comparison method violates its general contract!",
>>       "trace":"org.apache.solr.common.SolrException: Comparison method
>> violates its general contract!\n\tat
>> org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat
>> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:274)\n\tat
>> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
>> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
>> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>> org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat
>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
>> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat
>> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
>> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat
>> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat
>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat
>> java.lang.Thread.run(Thread.java:748)\n",
>>       "code":500}}
>>
>> When I invoke it again it moved a few more replicas but then failed in
>> the same way again. The log has this additional exception:
>> 2019-02-10 00:09:00.539 ERROR
>> (OverseerThreadFactory-1268-thread-38-processing-n:agent2:9151_solr) [
>> ] o.a.s.c.a.c.OverseerCollectionMessageHandler Operation utilizenode
>> failed:java.lang.IllegalArgumentException: Comparison method violates
>> its general contract!
>>       at java.util.TimSort.mergeLo(TimSort.java:777)
>>       at java.util.TimSort.mergeAt(TimSort.java:514)
>>       at java.util.TimSort.mergeCollapse(TimSort.java:439)
>>       at java.util.TimSort.sort(TimSort.java:245)
>>       at java.util.Arrays.sort(Arrays.java:1512)
>>       at java.util.ArrayList.sort(ArrayList.java:1462)
>>       at
>> org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.tryEachNode(MoveReplicaSuggester.java:50)
>>       at
>> org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.init(MoveReplicaSuggester.java:38)
>>       at
>> org.apache.solr.client.solrj.cloud.autoscaling.Suggester.getSuggestion(Suggester.java:187)
>>       at
>> org.apache.solr.cloud.api.collections.UtilizeNodeCmd.call(UtilizeNodeCmd.java:100)
>>       at
>> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:259)
>>       at
>> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:478)
>>       at
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>>       at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>       at java.lang.Thread.run(Thread.java:748)
>>
>> Not quite sure what it compares but the comparator should be this one:
>> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/MoveReplicaSuggester.java#L98
>> Not sure if it's possible but if both replicas are leaders the result
>> looks wrong to me.
>>
>> Anyhow, my main issue is that I don't see why Solr suddenly decided to
>> move all replicas of my node.
>>
>> regards,
>> Hendrik

Reply | Threaded
Open this post in threaded view
|

Re: Solr moved all replicas from node

Hendrik Haddorp
In reply to this post by Hendrik Haddorp
I opened https://issues.apache.org/jira/browse/SOLR-13240 for the exception.

On 10.02.2019 01:35, Hendrik Haddorp wrote:

> Hi,
>
> I have two Solr clouds using Version 7.6.0 with 4 nodes each and about
> 500 collections with one shard and a replication factor of 2 per Solr
> cloud. The data is stored in the HDFS. I restarted the nodes one by
> one and always waited for the replicas to fully recover before I
> restarted the next. Once the last node was restarted I noticed that
> Solr was starting to move replicas to other nodes. Actually it started
> to move all replicas from one node, which is now left empty. Is there
> any way to figure out why Solr decided to move all replicas to other
> nodes?
> The only problem that I see is that during the recovery the Solr
> instance logged a problem with the HDFS, claiming that the filesystem
> is closed. The recovery seems to have continued after that just fine
> though and the logs are clean for the time after wards.
> I restarted the node now and invoked the UTILIZENODE action that moved
> a few replicas back to the node but then failed with this exception:
>
> {
>   "responseHeader":{
>     "status":500,
>     "QTime":40220},
>   "Operation utilizenode caused
> exception:":"java.lang.IllegalArgumentException:java.lang.IllegalArgumentException:
> Comparison method violates its general contract!",
>   "exception":{
>     "msg":"Comparison method violates its general contract!",
>     "rspCode":-1},
>   "error":{
>     "metadata":[
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","org.apache.solr.common.SolrException"],
>     "msg":"Comparison method violates its general contract!",
>     "trace":"org.apache.solr.common.SolrException: Comparison method
> violates its general contract!\n\tat
> org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat
> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:274)\n\tat
> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat
> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat
> java.lang.Thread.run(Thread.java:748)\n",
>     "code":500}}
>
> When I invoke it again it moved a few more replicas but then failed in
> the same way again. The log has this additional exception:
> 2019-02-10 00:09:00.539 ERROR
> (OverseerThreadFactory-1268-thread-38-processing-n:agent2:9151_solr)
> [   ] o.a.s.c.a.c.OverseerCollectionMessageHandler Operation
> utilizenode failed:java.lang.IllegalArgumentException: Comparison
> method violates its general contract!
>     at java.util.TimSort.mergeLo(TimSort.java:777)
>     at java.util.TimSort.mergeAt(TimSort.java:514)
>     at java.util.TimSort.mergeCollapse(TimSort.java:439)
>     at java.util.TimSort.sort(TimSort.java:245)
>     at java.util.Arrays.sort(Arrays.java:1512)
>     at java.util.ArrayList.sort(ArrayList.java:1462)
>     at
> org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.tryEachNode(MoveReplicaSuggester.java:50)
>     at
> org.apache.solr.client.solrj.cloud.autoscaling.MoveReplicaSuggester.init(MoveReplicaSuggester.java:38)
>     at
> org.apache.solr.client.solrj.cloud.autoscaling.Suggester.getSuggestion(Suggester.java:187)
>     at
> org.apache.solr.cloud.api.collections.UtilizeNodeCmd.call(UtilizeNodeCmd.java:100)
>     at
> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:259)
>     at
> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:478)
>     at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
> Not quite sure what it compares but the comparator should be this one:
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/MoveReplicaSuggester.java#L98
> Not sure if it's possible but if both replicas are leaders the result
> looks wrong to me.
>
> Anyhow, my main issue is that I don't see why Solr suddenly decided to
> move all replicas of my node.
>
> regards,
> Hendrik