Solr 7 not removing a node completely due to too small thread pool

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr 7 not removing a node completely due to too small thread pool

Roger Lehmann
Situation

I'm currently trying to set up SolrCloud in an AWS Autoscaling Group, so
that it can scale dynamically.

I've also added the following triggers to Solr, so that each node will have
1 (and only one) replication of each collection:

{
"set-cluster-policy": [
  {"replica": "<2", "shard": "#EACH", "node": "#EACH"}
  ],
  "set-trigger": [{
    "name": "node_added_trigger",
    "event": "nodeAdded",
    "waitFor": "5s",
    "preferredOperation": "ADDREPLICA"
  },{
    "name": "node_lost_trigger",
    "event": "nodeLost",
    "waitFor": "120s",
    "preferredOperation": "DELETENODE"
  }]
}

This works pretty well. But my problem is that when the a node gets
removed, it doesn't remove all 19 replicas from this node and I have
problems when accessing the "nodes" page:

[image: enter image description here] <https://i.stack.imgur.com/QyJrY.png>

In the logs, this exception occurs:

Operation deletenode
failed:java.util.concurrent.RejectedExecutionException: Task
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$45/1104948431@467049e2
rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@773563df[Running,
pool size = 10, active threads = 10, queued tasks = 0, completed tasks
= 1]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
    at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:194)
    at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
    at org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteCore(DeleteReplicaCmd.java:276)
    at org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteReplica(DeleteReplicaCmd.java:95)
    at org.apache.solr.cloud.api.collections.DeleteNodeCmd.cleanupReplicas(DeleteNodeCmd.java:109)
    at org.apache.solr.cloud.api.collections.DeleteNodeCmd.call(DeleteNodeCmd.java:62)
    at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:292)
    at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:496)
    at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Problem description

So, the problem is that it only has a pool size of 10, of which 10 are busy
and nothing gets queued (synchronous execution). In fact, it really only
removed 10 replicas and the other 9 replicas stayed there. When manually
sending the API command to delete this node it works fine, since Solr only
needs to remove the remaining 9 replicas and everything is good again.
Question

How can I either increase this (small) thread pool size and/or activate
queueing the remaining deletion tasks? Another solution might be to retry
the failed task until it succeeds.

Using Solr 7.7.1 on Ubuntu Server installed with the installation script
from Solr (so I guess it's using Jetty?).

Thanks for your help!
Reply | Threaded
Open this post in threaded view
|

Re: Solr 7 not removing a node completely due to too small thread pool

Roger Lehmann
To be more specific: I currently have 19 collections, where each node has
exactly one replica per collection. A new node will automatically create
new replicas on itself, one for each existing collection (see
cluster-policy above).
So when removing a node, all 19 collection replicas of it need to be
removed. This can't be done in one go due to thread count (parallel
synchronous execution) being only 10 and is not scaling up when necessary.

On Fri, 29 Mar 2019 at 14:20, Roger Lehmann <[hidden email]>
wrote:

> Situation
>
> I'm currently trying to set up SolrCloud in an AWS Autoscaling Group, so
> that it can scale dynamically.
>
> I've also added the following triggers to Solr, so that each node will
> have 1 (and only one) replication of each collection:
>
> {
> "set-cluster-policy": [
>   {"replica": "<2", "shard": "#EACH", "node": "#EACH"}
>   ],
>   "set-trigger": [{
>     "name": "node_added_trigger",
>     "event": "nodeAdded",
>     "waitFor": "5s",
>     "preferredOperation": "ADDREPLICA"
>   },{
>     "name": "node_lost_trigger",
>     "event": "nodeLost",
>     "waitFor": "120s",
>     "preferredOperation": "DELETENODE"
>   }]
> }
>
> This works pretty well. But my problem is that when the a node gets
> removed, it doesn't remove all 19 replicas from this node and I have
> problems when accessing the "nodes" page:
>
> [image: enter image description here]
> <https://i.stack.imgur.com/QyJrY.png>
>
> In the logs, this exception occurs:
>
> Operation deletenode failed:java.util.concurrent.RejectedExecutionException: Task org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$45/1104948431@467049e2 rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@773563df[Running, pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 1]
>     at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
>     at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
>     at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
>     at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:194)
>     at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
>     at org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteCore(DeleteReplicaCmd.java:276)
>     at org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteReplica(DeleteReplicaCmd.java:95)
>     at org.apache.solr.cloud.api.collections.DeleteNodeCmd.cleanupReplicas(DeleteNodeCmd.java:109)
>     at org.apache.solr.cloud.api.collections.DeleteNodeCmd.call(DeleteNodeCmd.java:62)
>     at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:292)
>     at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:496)
>     at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
> Problem description
>
> So, the problem is that it only has a pool size of 10, of which 10 are
> busy and nothing gets queued (synchronous execution). In fact, it really
> only removed 10 replicas and the other 9 replicas stayed there. When
> manually sending the API command to delete this node it works fine, since
> Solr only needs to remove the remaining 9 replicas and everything is good
> again.
> Question
>
> How can I either increase this (small) thread pool size and/or activate
> queueing the remaining deletion tasks? Another solution might be to retry
> the failed task until it succeeds.
>
> Using Solr 7.7.1 on Ubuntu Server installed with the installation script
> from Solr (so I guess it's using Jetty?).
>
> Thanks for your help!
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr 7 not removing a node completely due to too small thread pool

Shalin Shekhar Mangar
Thanks Roger. This was reported earlier but missed our attention.

The issue is https://issues.apache.org/jira/browse/SOLR-11208

On Tue, Apr 2, 2019 at 5:56 PM Roger Lehmann <[hidden email]>
wrote:

> To be more specific: I currently have 19 collections, where each node has
> exactly one replica per collection. A new node will automatically create
> new replicas on itself, one for each existing collection (see
> cluster-policy above).
> So when removing a node, all 19 collection replicas of it need to be
> removed. This can't be done in one go due to thread count (parallel
> synchronous execution) being only 10 and is not scaling up when necessary.
>
> On Fri, 29 Mar 2019 at 14:20, Roger Lehmann <[hidden email]>
> wrote:
>
> > Situation
> >
> > I'm currently trying to set up SolrCloud in an AWS Autoscaling Group, so
> > that it can scale dynamically.
> >
> > I've also added the following triggers to Solr, so that each node will
> > have 1 (and only one) replication of each collection:
> >
> > {
> > "set-cluster-policy": [
> >   {"replica": "<2", "shard": "#EACH", "node": "#EACH"}
> >   ],
> >   "set-trigger": [{
> >     "name": "node_added_trigger",
> >     "event": "nodeAdded",
> >     "waitFor": "5s",
> >     "preferredOperation": "ADDREPLICA"
> >   },{
> >     "name": "node_lost_trigger",
> >     "event": "nodeLost",
> >     "waitFor": "120s",
> >     "preferredOperation": "DELETENODE"
> >   }]
> > }
> >
> > This works pretty well. But my problem is that when the a node gets
> > removed, it doesn't remove all 19 replicas from this node and I have
> > problems when accessing the "nodes" page:
> >
> > [image: enter image description here]
> > <https://i.stack.imgur.com/QyJrY.png>
> >
> > In the logs, this exception occurs:
> >
> > Operation deletenode
> failed:java.util.concurrent.RejectedExecutionException: Task
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$45/1104948431@467049e2
> rejected from
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@773563df[Running,
> pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 1]
> >     at
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> >     at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> >     at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> >     at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:194)
> >     at
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
> >     at
> org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteCore(DeleteReplicaCmd.java:276)
> >     at
> org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteReplica(DeleteReplicaCmd.java:95)
> >     at
> org.apache.solr.cloud.api.collections.DeleteNodeCmd.cleanupReplicas(DeleteNodeCmd.java:109)
> >     at
> org.apache.solr.cloud.api.collections.DeleteNodeCmd.call(DeleteNodeCmd.java:62)
> >     at
> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:292)
> >     at
> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:496)
> >     at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> >     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >     at java.lang.Thread.run(Thread.java:748)
> >
> > Problem description
> >
> > So, the problem is that it only has a pool size of 10, of which 10 are
> > busy and nothing gets queued (synchronous execution). In fact, it really
> > only removed 10 replicas and the other 9 replicas stayed there. When
> > manually sending the API command to delete this node it works fine, since
> > Solr only needs to remove the remaining 9 replicas and everything is good
> > again.
> > Question
> >
> > How can I either increase this (small) thread pool size and/or activate
> > queueing the remaining deletion tasks? Another solution might be to retry
> > the failed task until it succeeds.
> >
> > Using Solr 7.7.1 on Ubuntu Server installed with the installation script
> > from Solr (so I guess it's using Jetty?).
> >
> > Thanks for your help!
> >
>


--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: Solr 7 not removing a node completely due to too small thread pool

Roger Lehmann
Oh great, thanks for the hint!
I've upvoted this issue, since I think it might be worth to be able to
configure that (rather low) ThreadPool count.

On Wed, 3 Apr 2019 at 10:23, Shalin Shekhar Mangar <[hidden email]>
wrote:

> Thanks Roger. This was reported earlier but missed our attention.
>
> The issue is https://issues.apache.org/jira/browse/SOLR-11208
>
> On Tue, Apr 2, 2019 at 5:56 PM Roger Lehmann <[hidden email]>
> wrote:
>
> > To be more specific: I currently have 19 collections, where each node has
> > exactly one replica per collection. A new node will automatically create
> > new replicas on itself, one for each existing collection (see
> > cluster-policy above).
> > So when removing a node, all 19 collection replicas of it need to be
> > removed. This can't be done in one go due to thread count (parallel
> > synchronous execution) being only 10 and is not scaling up when
> necessary.
> >
> > On Fri, 29 Mar 2019 at 14:20, Roger Lehmann <[hidden email]
> >
> > wrote:
> >
> > > Situation
> > >
> > > I'm currently trying to set up SolrCloud in an AWS Autoscaling Group,
> so
> > > that it can scale dynamically.
> > >
> > > I've also added the following triggers to Solr, so that each node will
> > > have 1 (and only one) replication of each collection:
> > >
> > > {
> > > "set-cluster-policy": [
> > >   {"replica": "<2", "shard": "#EACH", "node": "#EACH"}
> > >   ],
> > >   "set-trigger": [{
> > >     "name": "node_added_trigger",
> > >     "event": "nodeAdded",
> > >     "waitFor": "5s",
> > >     "preferredOperation": "ADDREPLICA"
> > >   },{
> > >     "name": "node_lost_trigger",
> > >     "event": "nodeLost",
> > >     "waitFor": "120s",
> > >     "preferredOperation": "DELETENODE"
> > >   }]
> > > }
> > >
> > > This works pretty well. But my problem is that when the a node gets
> > > removed, it doesn't remove all 19 replicas from this node and I have
> > > problems when accessing the "nodes" page:
> > >
> > > [image: enter image description here]
> > > <https://i.stack.imgur.com/QyJrY.png>
> > >
> > > In the logs, this exception occurs:
> > >
> > > Operation deletenode
> > failed:java.util.concurrent.RejectedExecutionException: Task
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$45/1104948431@467049e2
> > rejected from
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@773563df
> [Running,
> > pool size = 10, active threads = 10, queued tasks = 0, completed tasks =
> 1]
> > >     at
> >
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> > >     at
> >
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> > >     at
> >
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> > >     at
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:194)
> > >     at
> >
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
> > >     at
> >
> org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteCore(DeleteReplicaCmd.java:276)
> > >     at
> >
> org.apache.solr.cloud.api.collections.DeleteReplicaCmd.deleteReplica(DeleteReplicaCmd.java:95)
> > >     at
> >
> org.apache.solr.cloud.api.collections.DeleteNodeCmd.cleanupReplicas(DeleteNodeCmd.java:109)
> > >     at
> >
> org.apache.solr.cloud.api.collections.DeleteNodeCmd.call(DeleteNodeCmd.java:62)
> > >     at
> >
> org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:292)
> > >     at
> >
> org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:496)
> > >     at
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> > >     at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >     at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >     at java.lang.Thread.run(Thread.java:748)
> > >
> > > Problem description
> > >
> > > So, the problem is that it only has a pool size of 10, of which 10 are
> > > busy and nothing gets queued (synchronous execution). In fact, it
> really
> > > only removed 10 replicas and the other 9 replicas stayed there. When
> > > manually sending the API command to delete this node it works fine,
> since
> > > Solr only needs to remove the remaining 9 replicas and everything is
> good
> > > again.
> > > Question
> > >
> > > How can I either increase this (small) thread pool size and/or activate
> > > queueing the remaining deletion tasks? Another solution might be to
> retry
> > > the failed task until it succeeds.
> > >
> > > Using Solr 7.7.1 on Ubuntu Server installed with the installation
> script
> > > from Solr (so I guess it's using Jetty?).
> > >
> > > Thanks for your help!
> > >
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


--

Roger Lehmann
Linux-System-Engineer

T: 0351-418 894 –76

*[hidden email]
<[hidden email]>**https://www.xing.com/profile/Roger_Lehmann8
<https://www.xing.com/profile/Roger_Lehmann8>*


* <https://www.offerista.com/>*__________________________________________

Offerista Group GmbH | Schützenplatz 14 | D - 01067 Dresden
Geschäftsführung: Tobias Bräuer, Benjamin Thym
Sitz Dresden | Amtsgericht Dresden | HRB 28678