Async exceptions during distributed update

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Async exceptions during distributed update

Jay Potharaju-2
Hi,
I am seeing the following lines in the error log. My setup has 2 nodes in
the solrcloud cluster, each node has 3 shards with no replication. From the
error log it seems like all the shards on this box are throwing async
exception errors. Other node in the cluster does not have any errors in the
logs. Any suggestions on how to tackle this error?

Solr setup
Solr:6.6.3
2Nodes: 3 shards each


ERROR org.apache.solr.servlet.HttpSolrCall  [test_shard3_replica1] ?
null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
Async exception during distributed update: Read timed out
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:972)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1911)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:78)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Unknown Source)


Thanks
Jay
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Emir Arnautović
Hi Jay,
My first guess would be that there was some major GC on other box so it did not respond on time. Are your nodes well balanced - do they serve equal amount of data?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 May 2018, at 18:11, Jay Potharaju <[hidden email]> wrote:
>
> Hi,
> I am seeing the following lines in the error log. My setup has 2 nodes in
> the solrcloud cluster, each node has 3 shards with no replication. From the
> error log it seems like all the shards on this box are throwing async
> exception errors. Other node in the cluster does not have any errors in the
> logs. Any suggestions on how to tackle this error?
>
> Solr setup
> Solr:6.6.3
> 2Nodes: 3 shards each
>
>
> ERROR org.apache.solr.servlet.HttpSolrCall  [test_shard3_replica1] ?
> null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
> Async exception during distributed update: Read timed out
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:972)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1911)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:78)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Unknown Source)
>
>
> Thanks
> Jay

Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
Yes, the nodes are well balanced. I am just using these boxes for indexing
the data and is not serving any traffic at this time.  The error indicates
it is having issues errors on the shards that are hosted on the box and not
on the other box.
I will check GC logs to see if there were any issues.
thanks

Thanks
Jay Potharaju


On Mon, May 7, 2018 at 9:34 AM, Emir Arnautović <
[hidden email]> wrote:

> Hi Jay,
> My first guess would be that there was some major GC on other box so it
> did not respond on time. Are your nodes well balanced - do they serve equal
> amount of data?
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 7 May 2018, at 18:11, Jay Potharaju <[hidden email]> wrote:
> >
> > Hi,
> > I am seeing the following lines in the error log. My setup has 2 nodes in
> > the solrcloud cluster, each node has 3 shards with no replication. From
> the
> > error log it seems like all the shards on this box are throwing async
> > exception errors. Other node in the cluster does not have any errors in
> the
> > logs. Any suggestions on how to tackle this error?
> >
> > Solr setup
> > Solr:6.6.3
> > 2Nodes: 3 shards each
> >
> >
> > ERROR org.apache.solr.servlet.HttpSolrCall  [test_shard3_replica1] ?
> > null:org.apache.solr.update.processor.DistributedUpdateProcessor$
> DistributedUpdatesAsyncException:
> > Async exception during distributed update: Read timed out
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(
> DistributedUpdateProcessor.java:972)
> > at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(
> DistributedUpdateProcessor.java:1911)
> > at
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
> ContentStreamHandlerBase.java:78)
> > at
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:173)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:361)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:305)
> > at
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1691)
> > at
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(
> ServletHandler.java:582)
> > at
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:143)
> > at
> > org.eclipse.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:548)
> > at
> > org.eclipse.jetty.server.session.SessionHandler.
> doHandle(SessionHandler.java:226)
> > at
> > org.eclipse.jetty.server.handler.ContextHandler.
> doHandle(ContextHandler.java:1180)
> > at org.eclipse.jetty.servlet.ServletHandler.doScope(
> ServletHandler.java:512)
> > at
> > org.eclipse.jetty.server.session.SessionHandler.
> doScope(SessionHandler.java:185)
> > at
> > org.eclipse.jetty.server.handler.ContextHandler.
> doScope(ContextHandler.java:1112)
> > at
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(
> ScopedHandler.java:141)
> > at
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
> ContextHandlerCollection.java:213)
> > at
> > org.eclipse.jetty.server.handler.HandlerCollection.
> handle(HandlerCollection.java:119)
> > at
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)
> > at
> > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
> RewriteHandler.java:335)
> > at
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> HandlerWrapper.java:134)
> > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> > at
> > org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:251)
> > at
> > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
> AbstractConnection.java:273)
> > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> > at
> > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
> SelectChannelEndPoint.java:93)
> > at
> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:671)
> > at
> > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
> QueuedThreadPool.java:589)
> > at java.lang.Thread.run(Unknown Source)
> >
> >
> > Thanks
> > Jay
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Emir Arnautović
Node A receives batch of documents to index. It forwards documents to shards that are on the node B. Node B is having issues with GC so it takes a while to respond. Node A sees it as read timeout and reports it in logs. So the issue is on node B not node A.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 May 2018, at 18:39, Jay Potharaju <[hidden email]> wrote:
>
> Yes, the nodes are well balanced. I am just using these boxes for indexing
> the data and is not serving any traffic at this time.  The error indicates
> it is having issues errors on the shards that are hosted on the box and not
> on the other box.
> I will check GC logs to see if there were any issues.
> thanks
>
> Thanks
> Jay Potharaju
>
>
> On Mon, May 7, 2018 at 9:34 AM, Emir Arnautović <
> [hidden email]> wrote:
>
>> Hi Jay,
>> My first guess would be that there was some major GC on other box so it
>> did not respond on time. Are your nodes well balanced - do they serve equal
>> amount of data?
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 7 May 2018, at 18:11, Jay Potharaju <[hidden email]> wrote:
>>>
>>> Hi,
>>> I am seeing the following lines in the error log. My setup has 2 nodes in
>>> the solrcloud cluster, each node has 3 shards with no replication. From
>> the
>>> error log it seems like all the shards on this box are throwing async
>>> exception errors. Other node in the cluster does not have any errors in
>> the
>>> logs. Any suggestions on how to tackle this error?
>>>
>>> Solr setup
>>> Solr:6.6.3
>>> 2Nodes: 3 shards each
>>>
>>>
>>> ERROR org.apache.solr.servlet.HttpSolrCall  [test_shard3_replica1] ?
>>> null:org.apache.solr.update.processor.DistributedUpdateProcessor$
>> DistributedUpdatesAsyncException:
>>> Async exception during distributed update: Read timed out
>>> at
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(
>> DistributedUpdateProcessor.java:972)
>>> at
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(
>> DistributedUpdateProcessor.java:1911)
>>> at
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
>> ContentStreamHandlerBase.java:78)
>>> at
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(
>> RequestHandlerBase.java:173)
>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
>>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
>>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
>>> at
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:361)
>>> at
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:305)
>>> at
>>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.
>> doFilter(ServletHandler.java:1691)
>>> at
>>> org.eclipse.jetty.servlet.ServletHandler.doHandle(
>> ServletHandler.java:582)
>>> at
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> ScopedHandler.java:143)
>>> at
>>> org.eclipse.jetty.security.SecurityHandler.handle(
>> SecurityHandler.java:548)
>>> at
>>> org.eclipse.jetty.server.session.SessionHandler.
>> doHandle(SessionHandler.java:226)
>>> at
>>> org.eclipse.jetty.server.handler.ContextHandler.
>> doHandle(ContextHandler.java:1180)
>>> at org.eclipse.jetty.servlet.ServletHandler.doScope(
>> ServletHandler.java:512)
>>> at
>>> org.eclipse.jetty.server.session.SessionHandler.
>> doScope(SessionHandler.java:185)
>>> at
>>> org.eclipse.jetty.server.handler.ContextHandler.
>> doScope(ContextHandler.java:1112)
>>> at
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> ScopedHandler.java:141)
>>> at
>>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
>> ContextHandlerCollection.java:213)
>>> at
>>> org.eclipse.jetty.server.handler.HandlerCollection.
>> handle(HandlerCollection.java:119)
>>> at
>>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
>> HandlerWrapper.java:134)
>>> at
>>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
>> RewriteHandler.java:335)
>>> at
>>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
>> HandlerWrapper.java:134)
>>> at org.eclipse.jetty.server.Server.handle(Server.java:534)
>>> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>>> at
>>> org.eclipse.jetty.server.HttpConnection.onFillable(
>> HttpConnection.java:251)
>>> at
>>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
>> AbstractConnection.java:273)
>>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
>>> at
>>> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
>> SelectChannelEndPoint.java:93)
>>> at
>>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
>> QueuedThreadPool.java:671)
>>> at
>>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
>> QueuedThreadPool.java:589)
>>> at java.lang.Thread.run(Unknown Source)
>>>
>>>
>>> Thanks
>>> Jay
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
Ah thanks for explaining that!

Thanks
Jay Potharaju


On Mon, May 7, 2018 at 9:45 AM, Emir Arnautović <
[hidden email]> wrote:

> Node A receives batch of documents to index. It forwards documents to
> shards that are on the node B. Node B is having issues with GC so it takes
> a while to respond. Node A sees it as read timeout and reports it in logs.
> So the issue is on node B not node A.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 7 May 2018, at 18:39, Jay Potharaju <[hidden email]> wrote:
> >
> > Yes, the nodes are well balanced. I am just using these boxes for
> indexing
> > the data and is not serving any traffic at this time.  The error
> indicates
> > it is having issues errors on the shards that are hosted on the box and
> not
> > on the other box.
> > I will check GC logs to see if there were any issues.
> > thanks
> >
> > Thanks
> > Jay Potharaju
> >
> >
> > On Mon, May 7, 2018 at 9:34 AM, Emir Arnautović <
> > [hidden email]> wrote:
> >
> >> Hi Jay,
> >> My first guess would be that there was some major GC on other box so it
> >> did not respond on time. Are your nodes well balanced - do they serve
> equal
> >> amount of data?
> >>
> >> Thanks,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 7 May 2018, at 18:11, Jay Potharaju <[hidden email]> wrote:
> >>>
> >>> Hi,
> >>> I am seeing the following lines in the error log. My setup has 2 nodes
> in
> >>> the solrcloud cluster, each node has 3 shards with no replication. From
> >> the
> >>> error log it seems like all the shards on this box are throwing async
> >>> exception errors. Other node in the cluster does not have any errors in
> >> the
> >>> logs. Any suggestions on how to tackle this error?
> >>>
> >>> Solr setup
> >>> Solr:6.6.3
> >>> 2Nodes: 3 shards each
> >>>
> >>>
> >>> ERROR org.apache.solr.servlet.HttpSolrCall  [test_shard3_replica1] ?
> >>> null:org.apache.solr.update.processor.DistributedUpdateProcessor$
> >> DistributedUpdatesAsyncException:
> >>> Async exception during distributed update: Read timed out
> >>> at
> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(
> >> DistributedUpdateProcessor.java:972)
> >>> at
> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(
> >> DistributedUpdateProcessor.java:1911)
> >>> at
> >>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
> >> ContentStreamHandlerBase.java:78)
> >>> at
> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(
> >> RequestHandlerBase.java:173)
> >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> >>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> >>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> >>> at
> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> >> SolrDispatchFilter.java:361)
> >>> at
> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> >> SolrDispatchFilter.java:305)
> >>> at
> >>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> >> doFilter(ServletHandler.java:1691)
> >>> at
> >>> org.eclipse.jetty.servlet.ServletHandler.doHandle(
> >> ServletHandler.java:582)
> >>> at
> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> >> ScopedHandler.java:143)
> >>> at
> >>> org.eclipse.jetty.security.SecurityHandler.handle(
> >> SecurityHandler.java:548)
> >>> at
> >>> org.eclipse.jetty.server.session.SessionHandler.
> >> doHandle(SessionHandler.java:226)
> >>> at
> >>> org.eclipse.jetty.server.handler.ContextHandler.
> >> doHandle(ContextHandler.java:1180)
> >>> at org.eclipse.jetty.servlet.ServletHandler.doScope(
> >> ServletHandler.java:512)
> >>> at
> >>> org.eclipse.jetty.server.session.SessionHandler.
> >> doScope(SessionHandler.java:185)
> >>> at
> >>> org.eclipse.jetty.server.handler.ContextHandler.
> >> doScope(ContextHandler.java:1112)
> >>> at
> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> >> ScopedHandler.java:141)
> >>> at
> >>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
> >> ContextHandlerCollection.java:213)
> >>> at
> >>> org.eclipse.jetty.server.handler.HandlerCollection.
> >> handle(HandlerCollection.java:119)
> >>> at
> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> >> HandlerWrapper.java:134)
> >>> at
> >>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
> >> RewriteHandler.java:335)
> >>> at
> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> >> HandlerWrapper.java:134)
> >>> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> >>> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> >>> at
> >>> org.eclipse.jetty.server.HttpConnection.onFillable(
> >> HttpConnection.java:251)
> >>> at
> >>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
> >> AbstractConnection.java:273)
> >>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> >>> at
> >>> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
> >> SelectChannelEndPoint.java:93)
> >>> at
> >>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> >> QueuedThreadPool.java:671)
> >>> at
> >>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
> >> QueuedThreadPool.java:589)
> >>> at java.lang.Thread.run(Unknown Source)
> >>>
> >>>
> >>> Thanks
> >>> Jay
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
I didn't see any OOM errors in the logs on either of the nodes. I saw GC
pause of 1 second on the box that was throwing error ...but nothing on the
other node. Any other recommendations?
Thanks


Thanks
Jay Potharaju


On Mon, May 7, 2018 at 9:48 AM, Jay Potharaju <[hidden email]> wrote:

> Ah thanks for explaining that!
>
> Thanks
> Jay Potharaju
>
>
> On Mon, May 7, 2018 at 9:45 AM, Emir Arnautović <
> [hidden email]> wrote:
>
>> Node A receives batch of documents to index. It forwards documents to
>> shards that are on the node B. Node B is having issues with GC so it takes
>> a while to respond. Node A sees it as read timeout and reports it in logs.
>> So the issue is on node B not node A.
>>
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 7 May 2018, at 18:39, Jay Potharaju <[hidden email]> wrote:
>> >
>> > Yes, the nodes are well balanced. I am just using these boxes for
>> indexing
>> > the data and is not serving any traffic at this time.  The error
>> indicates
>> > it is having issues errors on the shards that are hosted on the box and
>> not
>> > on the other box.
>> > I will check GC logs to see if there were any issues.
>> > thanks
>> >
>> > Thanks
>> > Jay Potharaju
>> >
>> >
>> > On Mon, May 7, 2018 at 9:34 AM, Emir Arnautović <
>> > [hidden email]> wrote:
>> >
>> >> Hi Jay,
>> >> My first guess would be that there was some major GC on other box so it
>> >> did not respond on time. Are your nodes well balanced - do they serve
>> equal
>> >> amount of data?
>> >>
>> >> Thanks,
>> >> Emir
>> >> --
>> >> Monitoring - Log Management - Alerting - Anomaly Detection
>> >> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>> >>
>> >>
>> >>
>> >>> On 7 May 2018, at 18:11, Jay Potharaju <[hidden email]> wrote:
>> >>>
>> >>> Hi,
>> >>> I am seeing the following lines in the error log. My setup has 2
>> nodes in
>> >>> the solrcloud cluster, each node has 3 shards with no replication.
>> From
>> >> the
>> >>> error log it seems like all the shards on this box are throwing async
>> >>> exception errors. Other node in the cluster does not have any errors
>> in
>> >> the
>> >>> logs. Any suggestions on how to tackle this error?
>> >>>
>> >>> Solr setup
>> >>> Solr:6.6.3
>> >>> 2Nodes: 3 shards each
>> >>>
>> >>>
>> >>> ERROR org.apache.solr.servlet.HttpSolrCall  [test_shard3_replica1] ?
>> >>> null:org.apache.solr.update.processor.DistributedUpdateProcessor$
>> >> DistributedUpdatesAsyncException:
>> >>> Async exception during distributed update: Read timed out
>> >>> at
>> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(
>> >> DistributedUpdateProcessor.java:972)
>> >>> at
>> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(
>> >> DistributedUpdateProcessor.java:1911)
>> >>> at
>> >>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
>> >> ContentStreamHandlerBase.java:78)
>> >>> at
>> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(
>> >> RequestHandlerBase.java:173)
>> >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
>> >>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.
>> java:723)
>> >>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
>> >>> at
>> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> >> SolrDispatchFilter.java:361)
>> >>> at
>> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> >> SolrDispatchFilter.java:305)
>> >>> at
>> >>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.
>> >> doFilter(ServletHandler.java:1691)
>> >>> at
>> >>> org.eclipse.jetty.servlet.ServletHandler.doHandle(
>> >> ServletHandler.java:582)
>> >>> at
>> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> >> ScopedHandler.java:143)
>> >>> at
>> >>> org.eclipse.jetty.security.SecurityHandler.handle(
>> >> SecurityHandler.java:548)
>> >>> at
>> >>> org.eclipse.jetty.server.session.SessionHandler.
>> >> doHandle(SessionHandler.java:226)
>> >>> at
>> >>> org.eclipse.jetty.server.handler.ContextHandler.
>> >> doHandle(ContextHandler.java:1180)
>> >>> at org.eclipse.jetty.servlet.ServletHandler.doScope(
>> >> ServletHandler.java:512)
>> >>> at
>> >>> org.eclipse.jetty.server.session.SessionHandler.
>> >> doScope(SessionHandler.java:185)
>> >>> at
>> >>> org.eclipse.jetty.server.handler.ContextHandler.
>> >> doScope(ContextHandler.java:1112)
>> >>> at
>> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> >> ScopedHandler.java:141)
>> >>> at
>> >>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
>> >> ContextHandlerCollection.java:213)
>> >>> at
>> >>> org.eclipse.jetty.server.handler.HandlerCollection.
>> >> handle(HandlerCollection.java:119)
>> >>> at
>> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
>> >> HandlerWrapper.java:134)
>> >>> at
>> >>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
>> >> RewriteHandler.java:335)
>> >>> at
>> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
>> >> HandlerWrapper.java:134)
>> >>> at org.eclipse.jetty.server.Server.handle(Server.java:534)
>> >>> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>> >>> at
>> >>> org.eclipse.jetty.server.HttpConnection.onFillable(
>> >> HttpConnection.java:251)
>> >>> at
>> >>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
>> >> AbstractConnection.java:273)
>> >>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
>> >>> at
>> >>> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
>> >> SelectChannelEndPoint.java:93)
>> >>> at
>> >>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
>> >> QueuedThreadPool.java:671)
>> >>> at
>> >>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
>> >> QueuedThreadPool.java:589)
>> >>> at java.lang.Thread.run(Unknown Source)
>> >>>
>> >>>
>> >>> Thanks
>> >>> Jay
>> >>
>> >>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Emir Arnautović
How do you send documents? Large batches? Complex analysis? Do you send all
batches to the same node? How do you commit? Do you delete by query while
indexing?

Emir

On Tue, May 8, 2018, 12:30 AM Jay Potharaju <[hidden email]> wrote:

> I didn't see any OOM errors in the logs on either of the nodes. I saw GC
> pause of 1 second on the box that was throwing error ...but nothing on the
> other node. Any other recommendations?
> Thanks
>
>
> Thanks
> Jay Potharaju
>
>
> On Mon, May 7, 2018 at 9:48 AM, Jay Potharaju <[hidden email]>
> wrote:
>
> > Ah thanks for explaining that!
> >
> > Thanks
> > Jay Potharaju
> >
> >
> > On Mon, May 7, 2018 at 9:45 AM, Emir Arnautović <
> > [hidden email]> wrote:
> >
> >> Node A receives batch of documents to index. It forwards documents to
> >> shards that are on the node B. Node B is having issues with GC so it
> takes
> >> a while to respond. Node A sees it as read timeout and reports it in
> logs.
> >> So the issue is on node B not node A.
> >>
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >> > On 7 May 2018, at 18:39, Jay Potharaju <[hidden email]> wrote:
> >> >
> >> > Yes, the nodes are well balanced. I am just using these boxes for
> >> indexing
> >> > the data and is not serving any traffic at this time.  The error
> >> indicates
> >> > it is having issues errors on the shards that are hosted on the box
> and
> >> not
> >> > on the other box.
> >> > I will check GC logs to see if there were any issues.
> >> > thanks
> >> >
> >> > Thanks
> >> > Jay Potharaju
> >> >
> >> >
> >> > On Mon, May 7, 2018 at 9:34 AM, Emir Arnautović <
> >> > [hidden email]> wrote:
> >> >
> >> >> Hi Jay,
> >> >> My first guess would be that there was some major GC on other box so
> it
> >> >> did not respond on time. Are your nodes well balanced - do they serve
> >> equal
> >> >> amount of data?
> >> >>
> >> >> Thanks,
> >> >> Emir
> >> >> --
> >> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> >> Solr & Elasticsearch Consulting Support Training -
> >> http://sematext.com/
> >> >>
> >> >>
> >> >>
> >> >>> On 7 May 2018, at 18:11, Jay Potharaju <[hidden email]>
> wrote:
> >> >>>
> >> >>> Hi,
> >> >>> I am seeing the following lines in the error log. My setup has 2
> >> nodes in
> >> >>> the solrcloud cluster, each node has 3 shards with no replication.
> >> From
> >> >> the
> >> >>> error log it seems like all the shards on this box are throwing
> async
> >> >>> exception errors. Other node in the cluster does not have any errors
> >> in
> >> >> the
> >> >>> logs. Any suggestions on how to tackle this error?
> >> >>>
> >> >>> Solr setup
> >> >>> Solr:6.6.3
> >> >>> 2Nodes: 3 shards each
> >> >>>
> >> >>>
> >> >>> ERROR org.apache.solr.servlet.HttpSolrCall  [test_shard3_replica1] ?
> >> >>> null:org.apache.solr.update.processor.DistributedUpdateProcessor$
> >> >> DistributedUpdatesAsyncException:
> >> >>> Async exception during distributed update: Read timed out
> >> >>> at
> >> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(
> >> >> DistributedUpdateProcessor.java:972)
> >> >>> at
> >> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(
> >> >> DistributedUpdateProcessor.java:1911)
> >> >>> at
> >> >>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
> >> >> ContentStreamHandlerBase.java:78)
> >> >>> at
> >> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(
> >> >> RequestHandlerBase.java:173)
> >> >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> >> >>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.
> >> java:723)
> >> >>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> >> >>> at
> >> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> >> >> SolrDispatchFilter.java:361)
> >> >>> at
> >> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> >> >> SolrDispatchFilter.java:305)
> >> >>> at
> >> >>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> >> >> doFilter(ServletHandler.java:1691)
> >> >>> at
> >> >>> org.eclipse.jetty.servlet.ServletHandler.doHandle(
> >> >> ServletHandler.java:582)
> >> >>> at
> >> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> >> >> ScopedHandler.java:143)
> >> >>> at
> >> >>> org.eclipse.jetty.security.SecurityHandler.handle(
> >> >> SecurityHandler.java:548)
> >> >>> at
> >> >>> org.eclipse.jetty.server.session.SessionHandler.
> >> >> doHandle(SessionHandler.java:226)
> >> >>> at
> >> >>> org.eclipse.jetty.server.handler.ContextHandler.
> >> >> doHandle(ContextHandler.java:1180)
> >> >>> at org.eclipse.jetty.servlet.ServletHandler.doScope(
> >> >> ServletHandler.java:512)
> >> >>> at
> >> >>> org.eclipse.jetty.server.session.SessionHandler.
> >> >> doScope(SessionHandler.java:185)
> >> >>> at
> >> >>> org.eclipse.jetty.server.handler.ContextHandler.
> >> >> doScope(ContextHandler.java:1112)
> >> >>> at
> >> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> >> >> ScopedHandler.java:141)
> >> >>> at
> >> >>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
> >> >> ContextHandlerCollection.java:213)
> >> >>> at
> >> >>> org.eclipse.jetty.server.handler.HandlerCollection.
> >> >> handle(HandlerCollection.java:119)
> >> >>> at
> >> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> >> >> HandlerWrapper.java:134)
> >> >>> at
> >> >>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
> >> >> RewriteHandler.java:335)
> >> >>> at
> >> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> >> >> HandlerWrapper.java:134)
> >> >>> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> >> >>> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> >> >>> at
> >> >>> org.eclipse.jetty.server.HttpConnection.onFillable(
> >> >> HttpConnection.java:251)
> >> >>> at
> >> >>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
> >> >> AbstractConnection.java:273)
> >> >>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> >> >>> at
> >> >>> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
> >> >> SelectChannelEndPoint.java:93)
> >> >>> at
> >> >>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> >> >> QueuedThreadPool.java:671)
> >> >>> at
> >> >>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
> >> >> QueuedThreadPool.java:589)
> >> >>> at java.lang.Thread.run(Unknown Source)
> >> >>>
> >> >>>
> >> >>> Thanks
> >> >>> Jay
> >> >>
> >> >>
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
The updates are pushed in real time not batched. No complex analysis and
everything is committed using autocommit settings in solr.

Thanks
Jay Potharaju


On Mon, May 7, 2018 at 4:00 PM, Emir Arnautović <
[hidden email]> wrote:

> How do you send documents? Large batches? Complex analysis? Do you send all
> batches to the same node? How do you commit? Do you delete by query while
> indexing?
>
> Emir
>
> On Tue, May 8, 2018, 12:30 AM Jay Potharaju <[hidden email]> wrote:
>
> > I didn't see any OOM errors in the logs on either of the nodes. I saw GC
> > pause of 1 second on the box that was throwing error ...but nothing on
> the
> > other node. Any other recommendations?
> > Thanks
> >
> >
> > Thanks
> > Jay Potharaju
> >
> >
> > On Mon, May 7, 2018 at 9:48 AM, Jay Potharaju <[hidden email]>
> > wrote:
> >
> > > Ah thanks for explaining that!
> > >
> > > Thanks
> > > Jay Potharaju
> > >
> > >
> > > On Mon, May 7, 2018 at 9:45 AM, Emir Arnautović <
> > > [hidden email]> wrote:
> > >
> > >> Node A receives batch of documents to index. It forwards documents to
> > >> shards that are on the node B. Node B is having issues with GC so it
> > takes
> > >> a while to respond. Node A sees it as read timeout and reports it in
> > logs.
> > >> So the issue is on node B not node A.
> > >>
> > >> Emir
> > >> --
> > >> Monitoring - Log Management - Alerting - Anomaly Detection
> > >> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> > >>
> > >>
> > >>
> > >> > On 7 May 2018, at 18:39, Jay Potharaju <[hidden email]>
> wrote:
> > >> >
> > >> > Yes, the nodes are well balanced. I am just using these boxes for
> > >> indexing
> > >> > the data and is not serving any traffic at this time.  The error
> > >> indicates
> > >> > it is having issues errors on the shards that are hosted on the box
> > and
> > >> not
> > >> > on the other box.
> > >> > I will check GC logs to see if there were any issues.
> > >> > thanks
> > >> >
> > >> > Thanks
> > >> > Jay Potharaju
> > >> >
> > >> >
> > >> > On Mon, May 7, 2018 at 9:34 AM, Emir Arnautović <
> > >> > [hidden email]> wrote:
> > >> >
> > >> >> Hi Jay,
> > >> >> My first guess would be that there was some major GC on other box
> so
> > it
> > >> >> did not respond on time. Are your nodes well balanced - do they
> serve
> > >> equal
> > >> >> amount of data?
> > >> >>
> > >> >> Thanks,
> > >> >> Emir
> > >> >> --
> > >> >> Monitoring - Log Management - Alerting - Anomaly Detection
> > >> >> Solr & Elasticsearch Consulting Support Training -
> > >> http://sematext.com/
> > >> >>
> > >> >>
> > >> >>
> > >> >>> On 7 May 2018, at 18:11, Jay Potharaju <[hidden email]>
> > wrote:
> > >> >>>
> > >> >>> Hi,
> > >> >>> I am seeing the following lines in the error log. My setup has 2
> > >> nodes in
> > >> >>> the solrcloud cluster, each node has 3 shards with no replication.
> > >> From
> > >> >> the
> > >> >>> error log it seems like all the shards on this box are throwing
> > async
> > >> >>> exception errors. Other node in the cluster does not have any
> errors
> > >> in
> > >> >> the
> > >> >>> logs. Any suggestions on how to tackle this error?
> > >> >>>
> > >> >>> Solr setup
> > >> >>> Solr:6.6.3
> > >> >>> 2Nodes: 3 shards each
> > >> >>>
> > >> >>>
> > >> >>> ERROR org.apache.solr.servlet.HttpSolrCall
> [test_shard3_replica1] ?
> > >> >>> null:org.apache.solr.update.processor.DistributedUpdateProcessor$
> > >> >> DistributedUpdatesAsyncException:
> > >> >>> Async exception during distributed update: Read timed out
> > >> >>> at
> > >> >>>
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(
> > >> >> DistributedUpdateProcessor.java:972)
> > >> >>> at
> > >> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.
> finish(
> > >> >> DistributedUpdateProcessor.java:1911)
> > >> >>> at
> > >> >>> org.apache.solr.handler.ContentStreamHandlerBase.
> handleRequestBody(
> > >> >> ContentStreamHandlerBase.java:78)
> > >> >>> at
> > >> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(
> > >> >> RequestHandlerBase.java:173)
> > >> >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > >> >>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.
> > >> java:723)
> > >> >>> at org.apache.solr.servlet.HttpSolrCall.call(
> HttpSolrCall.java:529)
> > >> >>> at
> > >> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> > >> >> SolrDispatchFilter.java:361)
> > >> >>> at
> > >> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> > >> >> SolrDispatchFilter.java:305)
> > >> >>> at
> > >> >>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> > >> >> doFilter(ServletHandler.java:1691)
> > >> >>> at
> > >> >>> org.eclipse.jetty.servlet.ServletHandler.doHandle(
> > >> >> ServletHandler.java:582)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> > >> >> ScopedHandler.java:143)
> > >> >>> at
> > >> >>> org.eclipse.jetty.security.SecurityHandler.handle(
> > >> >> SecurityHandler.java:548)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.session.SessionHandler.
> > >> >> doHandle(SessionHandler.java:226)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.handler.ContextHandler.
> > >> >> doHandle(ContextHandler.java:1180)
> > >> >>> at org.eclipse.jetty.servlet.ServletHandler.doScope(
> > >> >> ServletHandler.java:512)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.session.SessionHandler.
> > >> >> doScope(SessionHandler.java:185)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.handler.ContextHandler.
> > >> >> doScope(ContextHandler.java:1112)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> > >> >> ScopedHandler.java:141)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
> > >> >> ContextHandlerCollection.java:213)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.handler.HandlerCollection.
> > >> >> handle(HandlerCollection.java:119)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> > >> >> HandlerWrapper.java:134)
> > >> >>> at
> > >> >>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
> > >> >> RewriteHandler.java:335)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> > >> >> HandlerWrapper.java:134)
> > >> >>> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > >> >>> at org.eclipse.jetty.server.HttpChannel.handle(
> HttpChannel.java:320)
> > >> >>> at
> > >> >>> org.eclipse.jetty.server.HttpConnection.onFillable(
> > >> >> HttpConnection.java:251)
> > >> >>> at
> > >> >>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
> > >> >> AbstractConnection.java:273)
> > >> >>> at org.eclipse.jetty.io.FillInterest.fillable(
> FillInterest.java:95)
> > >> >>> at
> > >> >>> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
> > >> >> SelectChannelEndPoint.java:93)
> > >> >>> at
> > >> >>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> > >> >> QueuedThreadPool.java:671)
> > >> >>> at
> > >> >>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
> > >> >> QueuedThreadPool.java:589)
> > >> >>> at java.lang.Thread.run(Unknown Source)
> > >> >>>
> > >> >>>
> > >> >>> Thanks
> > >> >>> Jay
> > >> >>
> > >> >>
> > >>
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
There are some deletes by query. I have not had any issues with DBQ,
currently have 5.3 running in production.

Thanks
Jay Potharaju


On Mon, May 7, 2018 at 4:02 PM, Jay Potharaju <[hidden email]> wrote:

> The updates are pushed in real time not batched. No complex analysis and
> everything is committed using autocommit settings in solr.
>
> Thanks
> Jay Potharaju
>
>
> On Mon, May 7, 2018 at 4:00 PM, Emir Arnautović <
> [hidden email]> wrote:
>
>> How do you send documents? Large batches? Complex analysis? Do you send
>> all
>> batches to the same node? How do you commit? Do you delete by query while
>> indexing?
>>
>> Emir
>>
>> On Tue, May 8, 2018, 12:30 AM Jay Potharaju <[hidden email]>
>> wrote:
>>
>> > I didn't see any OOM errors in the logs on either of the nodes. I saw GC
>> > pause of 1 second on the box that was throwing error ...but nothing on
>> the
>> > other node. Any other recommendations?
>> > Thanks
>> >
>> >
>> > Thanks
>> > Jay Potharaju
>> >
>> >
>> > On Mon, May 7, 2018 at 9:48 AM, Jay Potharaju <[hidden email]>
>> > wrote:
>> >
>> > > Ah thanks for explaining that!
>> > >
>> > > Thanks
>> > > Jay Potharaju
>> > >
>> > >
>> > > On Mon, May 7, 2018 at 9:45 AM, Emir Arnautović <
>> > > [hidden email]> wrote:
>> > >
>> > >> Node A receives batch of documents to index. It forwards documents to
>> > >> shards that are on the node B. Node B is having issues with GC so it
>> > takes
>> > >> a while to respond. Node A sees it as read timeout and reports it in
>> > logs.
>> > >> So the issue is on node B not node A.
>> > >>
>> > >> Emir
>> > >> --
>> > >> Monitoring - Log Management - Alerting - Anomaly Detection
>> > >> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>> > >>
>> > >>
>> > >>
>> > >> > On 7 May 2018, at 18:39, Jay Potharaju <[hidden email]>
>> wrote:
>> > >> >
>> > >> > Yes, the nodes are well balanced. I am just using these boxes for
>> > >> indexing
>> > >> > the data and is not serving any traffic at this time.  The error
>> > >> indicates
>> > >> > it is having issues errors on the shards that are hosted on the box
>> > and
>> > >> not
>> > >> > on the other box.
>> > >> > I will check GC logs to see if there were any issues.
>> > >> > thanks
>> > >> >
>> > >> > Thanks
>> > >> > Jay Potharaju
>> > >> >
>> > >> >
>> > >> > On Mon, May 7, 2018 at 9:34 AM, Emir Arnautović <
>> > >> > [hidden email]> wrote:
>> > >> >
>> > >> >> Hi Jay,
>> > >> >> My first guess would be that there was some major GC on other box
>> so
>> > it
>> > >> >> did not respond on time. Are your nodes well balanced - do they
>> serve
>> > >> equal
>> > >> >> amount of data?
>> > >> >>
>> > >> >> Thanks,
>> > >> >> Emir
>> > >> >> --
>> > >> >> Monitoring - Log Management - Alerting - Anomaly Detection
>> > >> >> Solr & Elasticsearch Consulting Support Training -
>> > >> http://sematext.com/
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>> On 7 May 2018, at 18:11, Jay Potharaju <[hidden email]>
>> > wrote:
>> > >> >>>
>> > >> >>> Hi,
>> > >> >>> I am seeing the following lines in the error log. My setup has 2
>> > >> nodes in
>> > >> >>> the solrcloud cluster, each node has 3 shards with no
>> replication.
>> > >> From
>> > >> >> the
>> > >> >>> error log it seems like all the shards on this box are throwing
>> > async
>> > >> >>> exception errors. Other node in the cluster does not have any
>> errors
>> > >> in
>> > >> >> the
>> > >> >>> logs. Any suggestions on how to tackle this error?
>> > >> >>>
>> > >> >>> Solr setup
>> > >> >>> Solr:6.6.3
>> > >> >>> 2Nodes: 3 shards each
>> > >> >>>
>> > >> >>>
>> > >> >>> ERROR org.apache.solr.servlet.HttpSolrCall
>> [test_shard3_replica1] ?
>> > >> >>> null:org.apache.solr.update.processor.DistributedUpdateProce
>> ssor$
>> > >> >> DistributedUpdatesAsyncException:
>> > >> >>> Async exception during distributed update: Read timed out
>> > >> >>> at
>> > >> >>>
>> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(
>> > >> >> DistributedUpdateProcessor.java:972)
>> > >> >>> at
>> > >> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.
>> finish(
>> > >> >> DistributedUpdateProcessor.java:1911)
>> > >> >>> at
>> > >> >>> org.apache.solr.handler.ContentStreamHandlerBase.handleReque
>> stBody(
>> > >> >> ContentStreamHandlerBase.java:78)
>> > >> >>> at
>> > >> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(
>> > >> >> RequestHandlerBase.java:173)
>> > >> >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
>> > >> >>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.
>> > >> java:723)
>> > >> >>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
>> 529)
>> > >> >>> at
>> > >> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> > >> >> SolrDispatchFilter.java:361)
>> > >> >>> at
>> > >> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> > >> >> SolrDispatchFilter.java:305)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.
>> > >> >> doFilter(ServletHandler.java:1691)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.servlet.ServletHandler.doHandle(
>> > >> >> ServletHandler.java:582)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> > >> >> ScopedHandler.java:143)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.security.SecurityHandler.handle(
>> > >> >> SecurityHandler.java:548)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.session.SessionHandler.
>> > >> >> doHandle(SessionHandler.java:226)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.handler.ContextHandler.
>> > >> >> doHandle(ContextHandler.java:1180)
>> > >> >>> at org.eclipse.jetty.servlet.ServletHandler.doScope(
>> > >> >> ServletHandler.java:512)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.session.SessionHandler.
>> > >> >> doScope(SessionHandler.java:185)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.handler.ContextHandler.
>> > >> >> doScope(ContextHandler.java:1112)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> > >> >> ScopedHandler.java:141)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
>> ndle(
>> > >> >> ContextHandlerCollection.java:213)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.handler.HandlerCollection.
>> > >> >> handle(HandlerCollection.java:119)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
>> > >> >> HandlerWrapper.java:134)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
>> > >> >> RewriteHandler.java:335)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
>> > >> >> HandlerWrapper.java:134)
>> > >> >>> at org.eclipse.jetty.server.Server.handle(Server.java:534)
>> > >> >>> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
>> java:320)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.server.HttpConnection.onFillable(
>> > >> >> HttpConnection.java:251)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
>> > >> >> AbstractConnection.java:273)
>> > >> >>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.
>> java:95)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
>> > >> >> SelectChannelEndPoint.java:93)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
>> > >> >> QueuedThreadPool.java:671)
>> > >> >>> at
>> > >> >>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
>> > >> >> QueuedThreadPool.java:589)
>> > >> >>> at java.lang.Thread.run(Unknown Source)
>> > >> >>>
>> > >> >>>
>> > >> >>> Thanks
>> > >> >>> Jay
>> > >> >>
>> > >> >>
>> > >>
>> > >>
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Emir Arnautović
In reply to this post by Jay Potharaju-2
How many concurrent updates can be sent? Do you always send updates to the
same node? Do you use solrj?

Emir

On Tue, May 8, 2018, 1:02 AM Jay Potharaju <[hidden email]> wrote:

> The updates are pushed in real time not batched. No complex analysis and
> everything is committed using autocommit settings in solr.
>
> Thanks
> Jay Potharaju
>
>
> On Mon, May 7, 2018 at 4:00 PM, Emir Arnautović <
> [hidden email]> wrote:
>
> > How do you send documents? Large batches? Complex analysis? Do you send
> all
> > batches to the same node? How do you commit? Do you delete by query while
> > indexing?
> >
> > Emir
> >
> > On Tue, May 8, 2018, 12:30 AM Jay Potharaju <[hidden email]>
> wrote:
> >
> > > I didn't see any OOM errors in the logs on either of the nodes. I saw
> GC
> > > pause of 1 second on the box that was throwing error ...but nothing on
> > the
> > > other node. Any other recommendations?
> > > Thanks
> > >
> > >
> > > Thanks
> > > Jay Potharaju
> > >
> > >
> > > On Mon, May 7, 2018 at 9:48 AM, Jay Potharaju <[hidden email]>
> > > wrote:
> > >
> > > > Ah thanks for explaining that!
> > > >
> > > > Thanks
> > > > Jay Potharaju
> > > >
> > > >
> > > > On Mon, May 7, 2018 at 9:45 AM, Emir Arnautović <
> > > > [hidden email]> wrote:
> > > >
> > > >> Node A receives batch of documents to index. It forwards documents
> to
> > > >> shards that are on the node B. Node B is having issues with GC so it
> > > takes
> > > >> a while to respond. Node A sees it as read timeout and reports it in
> > > logs.
> > > >> So the issue is on node B not node A.
> > > >>
> > > >> Emir
> > > >> --
> > > >> Monitoring - Log Management - Alerting - Anomaly Detection
> > > >> Solr & Elasticsearch Consulting Support Training -
> > http://sematext.com/
> > > >>
> > > >>
> > > >>
> > > >> > On 7 May 2018, at 18:39, Jay Potharaju <[hidden email]>
> > wrote:
> > > >> >
> > > >> > Yes, the nodes are well balanced. I am just using these boxes for
> > > >> indexing
> > > >> > the data and is not serving any traffic at this time.  The error
> > > >> indicates
> > > >> > it is having issues errors on the shards that are hosted on the
> box
> > > and
> > > >> not
> > > >> > on the other box.
> > > >> > I will check GC logs to see if there were any issues.
> > > >> > thanks
> > > >> >
> > > >> > Thanks
> > > >> > Jay Potharaju
> > > >> >
> > > >> >
> > > >> > On Mon, May 7, 2018 at 9:34 AM, Emir Arnautović <
> > > >> > [hidden email]> wrote:
> > > >> >
> > > >> >> Hi Jay,
> > > >> >> My first guess would be that there was some major GC on other box
> > so
> > > it
> > > >> >> did not respond on time. Are your nodes well balanced - do they
> > serve
> > > >> equal
> > > >> >> amount of data?
> > > >> >>
> > > >> >> Thanks,
> > > >> >> Emir
> > > >> >> --
> > > >> >> Monitoring - Log Management - Alerting - Anomaly Detection
> > > >> >> Solr & Elasticsearch Consulting Support Training -
> > > >> http://sematext.com/
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >>> On 7 May 2018, at 18:11, Jay Potharaju <[hidden email]>
> > > wrote:
> > > >> >>>
> > > >> >>> Hi,
> > > >> >>> I am seeing the following lines in the error log. My setup has 2
> > > >> nodes in
> > > >> >>> the solrcloud cluster, each node has 3 shards with no
> replication.
> > > >> From
> > > >> >> the
> > > >> >>> error log it seems like all the shards on this box are throwing
> > > async
> > > >> >>> exception errors. Other node in the cluster does not have any
> > errors
> > > >> in
> > > >> >> the
> > > >> >>> logs. Any suggestions on how to tackle this error?
> > > >> >>>
> > > >> >>> Solr setup
> > > >> >>> Solr:6.6.3
> > > >> >>> 2Nodes: 3 shards each
> > > >> >>>
> > > >> >>>
> > > >> >>> ERROR org.apache.solr.servlet.HttpSolrCall
> > [test_shard3_replica1] ?
> > > >> >>>
> null:org.apache.solr.update.processor.DistributedUpdateProcessor$
> > > >> >> DistributedUpdatesAsyncException:
> > > >> >>> Async exception during distributed update: Read timed out
> > > >> >>> at
> > > >> >>>
> > > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(
> > > >> >> DistributedUpdateProcessor.java:972)
> > > >> >>> at
> > > >> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.
> > finish(
> > > >> >> DistributedUpdateProcessor.java:1911)
> > > >> >>> at
> > > >> >>> org.apache.solr.handler.ContentStreamHandlerBase.
> > handleRequestBody(
> > > >> >> ContentStreamHandlerBase.java:78)
> > > >> >>> at
> > > >> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(
> > > >> >> RequestHandlerBase.java:173)
> > > >> >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > > >> >>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.
> > > >> java:723)
> > > >> >>> at org.apache.solr.servlet.HttpSolrCall.call(
> > HttpSolrCall.java:529)
> > > >> >>> at
> > > >> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> > > >> >> SolrDispatchFilter.java:361)
> > > >> >>> at
> > > >> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> > > >> >> SolrDispatchFilter.java:305)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.
> > > >> >> doFilter(ServletHandler.java:1691)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.servlet.ServletHandler.doHandle(
> > > >> >> ServletHandler.java:582)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> > > >> >> ScopedHandler.java:143)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.security.SecurityHandler.handle(
> > > >> >> SecurityHandler.java:548)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.session.SessionHandler.
> > > >> >> doHandle(SessionHandler.java:226)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.handler.ContextHandler.
> > > >> >> doHandle(ContextHandler.java:1180)
> > > >> >>> at org.eclipse.jetty.servlet.ServletHandler.doScope(
> > > >> >> ServletHandler.java:512)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.session.SessionHandler.
> > > >> >> doScope(SessionHandler.java:185)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.handler.ContextHandler.
> > > >> >> doScope(ContextHandler.java:1112)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(
> > > >> >> ScopedHandler.java:141)
> > > >> >>> at
> > > >> >>>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
> > > >> >> ContextHandlerCollection.java:213)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.handler.HandlerCollection.
> > > >> >> handle(HandlerCollection.java:119)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> > > >> >> HandlerWrapper.java:134)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(
> > > >> >> RewriteHandler.java:335)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(
> > > >> >> HandlerWrapper.java:134)
> > > >> >>> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > > >> >>> at org.eclipse.jetty.server.HttpChannel.handle(
> > HttpChannel.java:320)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.server.HttpConnection.onFillable(
> > > >> >> HttpConnection.java:251)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(
> > > >> >> AbstractConnection.java:273)
> > > >> >>> at org.eclipse.jetty.io.FillInterest.fillable(
> > FillInterest.java:95)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(
> > > >> >> SelectChannelEndPoint.java:93)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> > > >> >> QueuedThreadPool.java:671)
> > > >> >>> at
> > > >> >>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
> > > >> >> QueuedThreadPool.java:589)
> > > >> >>> at java.lang.Thread.run(Unknown Source)
> > > >> >>>
> > > >> >>>
> > > >> >>> Thanks
> > > >> >>> Jay
> > > >> >>
> > > >> >>
> > > >>
> > > >>
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Shawn Heisey-2
In reply to this post by Jay Potharaju-2
On 5/7/2018 5:05 PM, Jay Potharaju wrote:
> There are some deletes by query. I have not had any issues with DBQ,
> currently have 5.3 running in production.

Here's the big problem with DBQ.  Imagine this sequence of events with
these timestamps:

13:00:00: A commit for change visibility happens.
13:00:00: A segment merge is triggered by the commit.
(It's a big merge that takes exactly 3 minutes.)
13:00:05: A deleteByQuery is sent.
13:00:15: An update to the index is sent.
13:00:25: An update to the index is sent.
13:00:35: An update to the index is sent.
13:00:45: An update to the index is sent.
13:00:55: An update to the index is sent.
13:01:05: An update to the index is sent.
13:01:15: An update to the index is sent.
13:01:25: An update to the index is sent.
{time passes, more updates might be sent}
13:03:00: The merge finishes.

Here's what would happen in this scenario:  The DBQ and all of the
update requests sent *after* the DBQ will block until the merge
finishes.  That means that it's going to take up to three minutes for
Solr to respond to those requests.  If the client that is sending the
request is configured with a 60 second socket timeout, which inter-node
requests made by Solr are by default, then it is going to experience a
timeout error.  The request will probably complete successfully once the
merge finishes, but the connection is gone, and the client has already
received an error.

Now imagine what happens if an optimize (forced merge of the entire
index) is requested on an index that's 50GB.  That optimize may take 2-3
hours, possibly longer.  A deleteByQuery started on that index after the
optimize begins (and any updates requested after the DBQ) will pause
until the optimize is done.  A pause of 2 hours or more is a BIG problem.

This is why deleteByQuery is not recommended.

If the deleteByQuery were changed into a two-step process involving a
query to retrieve ID values and then one or more deleteById requests,
then none of that blocking would occur.  The deleteById operation can
run at the same time as a segment merge, so neither it nor subsequent
update requests will have the significant pause.  From what I
understand, you can even do commits in this scenario and have changes be
visible before the merge completes.  I haven't verified that this is the
case.

Experienced devs: Can we fix this problem with DBQ?  On indexes with a
uniqueKey, can DBQ be changed to use the two-step process I mentioned?

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
Thanks for explaining that Shawn!
Emir, I use php library called solarium to do updates/deletes to solr. The request is sent to any of the available nodes in the cluster.

> On May 7, 2018, at 5:02 PM, Shawn Heisey <[hidden email]> wrote:
>
>> On 5/7/2018 5:05 PM, Jay Potharaju wrote:
>> There are some deletes by query. I have not had any issues with DBQ,
>> currently have 5.3 running in production.
>
> Here's the big problem with DBQ.  Imagine this sequence of events with
> these timestamps:
>
> 13:00:00: A commit for change visibility happens.
> 13:00:00: A segment merge is triggered by the commit.
> (It's a big merge that takes exactly 3 minutes.)
> 13:00:05: A deleteByQuery is sent.
> 13:00:15: An update to the index is sent.
> 13:00:25: An update to the index is sent.
> 13:00:35: An update to the index is sent.
> 13:00:45: An update to the index is sent.
> 13:00:55: An update to the index is sent.
> 13:01:05: An update to the index is sent.
> 13:01:15: An update to the index is sent.
> 13:01:25: An update to the index is sent.
> {time passes, more updates might be sent}
> 13:03:00: The merge finishes.
>
> Here's what would happen in this scenario:  The DBQ and all of the
> update requests sent *after* the DBQ will block until the merge
> finishes.  That means that it's going to take up to three minutes for
> Solr to respond to those requests.  If the client that is sending the
> request is configured with a 60 second socket timeout, which inter-node
> requests made by Solr are by default, then it is going to experience a
> timeout error.  The request will probably complete successfully once the
> merge finishes, but the connection is gone, and the client has already
> received an error.
>
> Now imagine what happens if an optimize (forced merge of the entire
> index) is requested on an index that's 50GB.  That optimize may take 2-3
> hours, possibly longer.  A deleteByQuery started on that index after the
> optimize begins (and any updates requested after the DBQ) will pause
> until the optimize is done.  A pause of 2 hours or more is a BIG problem.
>
> This is why deleteByQuery is not recommended.
>
> If the deleteByQuery were changed into a two-step process involving a
> query to retrieve ID values and then one or more deleteById requests,
> then none of that blocking would occur.  The deleteById operation can
> run at the same time as a segment merge, so neither it nor subsequent
> update requests will have the significant pause.  From what I
> understand, you can even do commits in this scenario and have changes be
> visible before the merge completes.  I haven't verified that this is the
> case.
>
> Experienced devs: Can we fix this problem with DBQ?  On indexes with a
> uniqueKey, can DBQ be changed to use the two-step process I mentioned?
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
In reply to this post by Shawn Heisey-2
I have about 3-5 updates per second.


> On May 7, 2018, at 5:02 PM, Shawn Heisey <[hidden email]> wrote:
>
>> On 5/7/2018 5:05 PM, Jay Potharaju wrote:
>> There are some deletes by query. I have not had any issues with DBQ,
>> currently have 5.3 running in production.
>
> Here's the big problem with DBQ.  Imagine this sequence of events with
> these timestamps:
>
> 13:00:00: A commit for change visibility happens.
> 13:00:00: A segment merge is triggered by the commit.
> (It's a big merge that takes exactly 3 minutes.)
> 13:00:05: A deleteByQuery is sent.
> 13:00:15: An update to the index is sent.
> 13:00:25: An update to the index is sent.
> 13:00:35: An update to the index is sent.
> 13:00:45: An update to the index is sent.
> 13:00:55: An update to the index is sent.
> 13:01:05: An update to the index is sent.
> 13:01:15: An update to the index is sent.
> 13:01:25: An update to the index is sent.
> {time passes, more updates might be sent}
> 13:03:00: The merge finishes.
>
> Here's what would happen in this scenario:  The DBQ and all of the
> update requests sent *after* the DBQ will block until the merge
> finishes.  That means that it's going to take up to three minutes for
> Solr to respond to those requests.  If the client that is sending the
> request is configured with a 60 second socket timeout, which inter-node
> requests made by Solr are by default, then it is going to experience a
> timeout error.  The request will probably complete successfully once the
> merge finishes, but the connection is gone, and the client has already
> received an error.
>
> Now imagine what happens if an optimize (forced merge of the entire
> index) is requested on an index that's 50GB.  That optimize may take 2-3
> hours, possibly longer.  A deleteByQuery started on that index after the
> optimize begins (and any updates requested after the DBQ) will pause
> until the optimize is done.  A pause of 2 hours or more is a BIG problem.
>
> This is why deleteByQuery is not recommended.
>
> If the deleteByQuery were changed into a two-step process involving a
> query to retrieve ID values and then one or more deleteById requests,
> then none of that blocking would occur.  The deleteById operation can
> run at the same time as a segment merge, so neither it nor subsequent
> update requests will have the significant pause.  From what I
> understand, you can even do commits in this scenario and have changes be
> visible before the merge completes.  I haven't verified that this is the
> case.
>
> Experienced devs: Can we fix this problem with DBQ?  On indexes with a
> uniqueKey, can DBQ be changed to use the two-step process I mentioned?
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Emir Arnautović
Hi Jay,
This is low ingestion rate. What is the size of your index? What is heap size? I am guessing that this is not a huge index, so  I am leaning toward what Shawn mentioned - some combination of DBQ/merge/commit/optimise that is blocking indexing. Though, it is strange that it is happening only on one node if you are sending updates randomly to both nodes. Do you monitor your hosts/Solr? Do you see anything different at the time when timeouts happen?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 8 May 2018, at 03:23, Jay Potharaju <[hidden email]> wrote:
>
> I have about 3-5 updates per second.
>
>
>> On May 7, 2018, at 5:02 PM, Shawn Heisey <[hidden email]> wrote:
>>
>>> On 5/7/2018 5:05 PM, Jay Potharaju wrote:
>>> There are some deletes by query. I have not had any issues with DBQ,
>>> currently have 5.3 running in production.
>>
>> Here's the big problem with DBQ.  Imagine this sequence of events with
>> these timestamps:
>>
>> 13:00:00: A commit for change visibility happens.
>> 13:00:00: A segment merge is triggered by the commit.
>> (It's a big merge that takes exactly 3 minutes.)
>> 13:00:05: A deleteByQuery is sent.
>> 13:00:15: An update to the index is sent.
>> 13:00:25: An update to the index is sent.
>> 13:00:35: An update to the index is sent.
>> 13:00:45: An update to the index is sent.
>> 13:00:55: An update to the index is sent.
>> 13:01:05: An update to the index is sent.
>> 13:01:15: An update to the index is sent.
>> 13:01:25: An update to the index is sent.
>> {time passes, more updates might be sent}
>> 13:03:00: The merge finishes.
>>
>> Here's what would happen in this scenario:  The DBQ and all of the
>> update requests sent *after* the DBQ will block until the merge
>> finishes.  That means that it's going to take up to three minutes for
>> Solr to respond to those requests.  If the client that is sending the
>> request is configured with a 60 second socket timeout, which inter-node
>> requests made by Solr are by default, then it is going to experience a
>> timeout error.  The request will probably complete successfully once the
>> merge finishes, but the connection is gone, and the client has already
>> received an error.
>>
>> Now imagine what happens if an optimize (forced merge of the entire
>> index) is requested on an index that's 50GB.  That optimize may take 2-3
>> hours, possibly longer.  A deleteByQuery started on that index after the
>> optimize begins (and any updates requested after the DBQ) will pause
>> until the optimize is done.  A pause of 2 hours or more is a BIG problem.
>>
>> This is why deleteByQuery is not recommended.
>>
>> If the deleteByQuery were changed into a two-step process involving a
>> query to retrieve ID values and then one or more deleteById requests,
>> then none of that blocking would occur.  The deleteById operation can
>> run at the same time as a segment merge, so neither it nor subsequent
>> update requests will have the significant pause.  From what I
>> understand, you can even do commits in this scenario and have changes be
>> visible before the merge completes.  I haven't verified that this is the
>> case.
>>
>> Experienced devs: Can we fix this problem with DBQ?  On indexes with a
>> uniqueKey, can DBQ be changed to use the two-step process I mentioned?
>>
>> Thanks,
>> Shawn
>>

Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
Hi Emir,
I was seeing this error as long as the indexing was running. Once I stopped
the indexing the errors also stopped.  Yes, we do monitor both hosts & solr
but have not seen anything out of the ordinary except for a small network
blip. In my experience solr generally recovers after a network blip and
there are a few errors for streaming solr client...but have never seen this
error before.

Thanks
Jay

Thanks
Jay Potharaju


On Tue, May 8, 2018 at 12:56 AM, Emir Arnautović <
[hidden email]> wrote:

> Hi Jay,
> This is low ingestion rate. What is the size of your index? What is heap
> size? I am guessing that this is not a huge index, so  I am leaning toward
> what Shawn mentioned - some combination of DBQ/merge/commit/optimise that
> is blocking indexing. Though, it is strange that it is happening only on
> one node if you are sending updates randomly to both nodes. Do you monitor
> your hosts/Solr? Do you see anything different at the time when timeouts
> happen?
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 8 May 2018, at 03:23, Jay Potharaju <[hidden email]> wrote:
> >
> > I have about 3-5 updates per second.
> >
> >
> >> On May 7, 2018, at 5:02 PM, Shawn Heisey <[hidden email]> wrote:
> >>
> >>> On 5/7/2018 5:05 PM, Jay Potharaju wrote:
> >>> There are some deletes by query. I have not had any issues with DBQ,
> >>> currently have 5.3 running in production.
> >>
> >> Here's the big problem with DBQ.  Imagine this sequence of events with
> >> these timestamps:
> >>
> >> 13:00:00: A commit for change visibility happens.
> >> 13:00:00: A segment merge is triggered by the commit.
> >> (It's a big merge that takes exactly 3 minutes.)
> >> 13:00:05: A deleteByQuery is sent.
> >> 13:00:15: An update to the index is sent.
> >> 13:00:25: An update to the index is sent.
> >> 13:00:35: An update to the index is sent.
> >> 13:00:45: An update to the index is sent.
> >> 13:00:55: An update to the index is sent.
> >> 13:01:05: An update to the index is sent.
> >> 13:01:15: An update to the index is sent.
> >> 13:01:25: An update to the index is sent.
> >> {time passes, more updates might be sent}
> >> 13:03:00: The merge finishes.
> >>
> >> Here's what would happen in this scenario:  The DBQ and all of the
> >> update requests sent *after* the DBQ will block until the merge
> >> finishes.  That means that it's going to take up to three minutes for
> >> Solr to respond to those requests.  If the client that is sending the
> >> request is configured with a 60 second socket timeout, which inter-node
> >> requests made by Solr are by default, then it is going to experience a
> >> timeout error.  The request will probably complete successfully once the
> >> merge finishes, but the connection is gone, and the client has already
> >> received an error.
> >>
> >> Now imagine what happens if an optimize (forced merge of the entire
> >> index) is requested on an index that's 50GB.  That optimize may take 2-3
> >> hours, possibly longer.  A deleteByQuery started on that index after the
> >> optimize begins (and any updates requested after the DBQ) will pause
> >> until the optimize is done.  A pause of 2 hours or more is a BIG
> problem.
> >>
> >> This is why deleteByQuery is not recommended.
> >>
> >> If the deleteByQuery were changed into a two-step process involving a
> >> query to retrieve ID values and then one or more deleteById requests,
> >> then none of that blocking would occur.  The deleteById operation can
> >> run at the same time as a segment merge, so neither it nor subsequent
> >> update requests will have the significant pause.  From what I
> >> understand, you can even do commits in this scenario and have changes be
> >> visible before the merge completes.  I haven't verified that this is the
> >> case.
> >>
> >> Experienced devs: Can we fix this problem with DBQ?  On indexes with a
> >> uniqueKey, can DBQ be changed to use the two-step process I mentioned?
> >>
> >> Thanks,
> >> Shawn
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Emir Arnautović
Hi Jay,
Network blip might be the cause, but also the consequence of this issue. Maybe you can try avoiding DBQ while indexing and see if it is the cause. You can do thread dump on “the other” node and see if there are blocked threads and that can give you more clues what’s going on.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 8 May 2018, at 17:53, Jay Potharaju <[hidden email]> wrote:
>
> Hi Emir,
> I was seeing this error as long as the indexing was running. Once I stopped
> the indexing the errors also stopped.  Yes, we do monitor both hosts & solr
> but have not seen anything out of the ordinary except for a small network
> blip. In my experience solr generally recovers after a network blip and
> there are a few errors for streaming solr client...but have never seen this
> error before.
>
> Thanks
> Jay
>
> Thanks
> Jay Potharaju
>
>
> On Tue, May 8, 2018 at 12:56 AM, Emir Arnautović <
> [hidden email]> wrote:
>
>> Hi Jay,
>> This is low ingestion rate. What is the size of your index? What is heap
>> size? I am guessing that this is not a huge index, so  I am leaning toward
>> what Shawn mentioned - some combination of DBQ/merge/commit/optimise that
>> is blocking indexing. Though, it is strange that it is happening only on
>> one node if you are sending updates randomly to both nodes. Do you monitor
>> your hosts/Solr? Do you see anything different at the time when timeouts
>> happen?
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 8 May 2018, at 03:23, Jay Potharaju <[hidden email]> wrote:
>>>
>>> I have about 3-5 updates per second.
>>>
>>>
>>>> On May 7, 2018, at 5:02 PM, Shawn Heisey <[hidden email]> wrote:
>>>>
>>>>> On 5/7/2018 5:05 PM, Jay Potharaju wrote:
>>>>> There are some deletes by query. I have not had any issues with DBQ,
>>>>> currently have 5.3 running in production.
>>>>
>>>> Here's the big problem with DBQ.  Imagine this sequence of events with
>>>> these timestamps:
>>>>
>>>> 13:00:00: A commit for change visibility happens.
>>>> 13:00:00: A segment merge is triggered by the commit.
>>>> (It's a big merge that takes exactly 3 minutes.)
>>>> 13:00:05: A deleteByQuery is sent.
>>>> 13:00:15: An update to the index is sent.
>>>> 13:00:25: An update to the index is sent.
>>>> 13:00:35: An update to the index is sent.
>>>> 13:00:45: An update to the index is sent.
>>>> 13:00:55: An update to the index is sent.
>>>> 13:01:05: An update to the index is sent.
>>>> 13:01:15: An update to the index is sent.
>>>> 13:01:25: An update to the index is sent.
>>>> {time passes, more updates might be sent}
>>>> 13:03:00: The merge finishes.
>>>>
>>>> Here's what would happen in this scenario:  The DBQ and all of the
>>>> update requests sent *after* the DBQ will block until the merge
>>>> finishes.  That means that it's going to take up to three minutes for
>>>> Solr to respond to those requests.  If the client that is sending the
>>>> request is configured with a 60 second socket timeout, which inter-node
>>>> requests made by Solr are by default, then it is going to experience a
>>>> timeout error.  The request will probably complete successfully once the
>>>> merge finishes, but the connection is gone, and the client has already
>>>> received an error.
>>>>
>>>> Now imagine what happens if an optimize (forced merge of the entire
>>>> index) is requested on an index that's 50GB.  That optimize may take 2-3
>>>> hours, possibly longer.  A deleteByQuery started on that index after the
>>>> optimize begins (and any updates requested after the DBQ) will pause
>>>> until the optimize is done.  A pause of 2 hours or more is a BIG
>> problem.
>>>>
>>>> This is why deleteByQuery is not recommended.
>>>>
>>>> If the deleteByQuery were changed into a two-step process involving a
>>>> query to retrieve ID values and then one or more deleteById requests,
>>>> then none of that blocking would occur.  The deleteById operation can
>>>> run at the same time as a segment merge, so neither it nor subsequent
>>>> update requests will have the significant pause.  From what I
>>>> understand, you can even do commits in this scenario and have changes be
>>>> visible before the merge completes.  I haven't verified that this is the
>>>> case.
>>>>
>>>> Experienced devs: Can we fix this problem with DBQ?  On indexes with a
>>>> uniqueKey, can DBQ be changed to use the two-step process I mentioned?
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
Hi,
I restarted both my solr servers but I am seeing the async error again. In
older 5x version of solrcloud, solr would normally recover gracefully in
case of network errors, but solr 6.6.3 does not seem to be doing that. At
this time I am not doing only a small percentage of  deletebyquery
operations, its mostly indexing of documents only.
I have not noticed any network blip like last time.  Any suggestions or is
any else also having the same issue on solr 6.6.3?

  I am again seeing the following two errors back to back.

 ERROR org.apache.solr.update.StreamingSolrClients

org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
Async exception during distributed update: Read timed out
Thanks
Jay



On Wed, May 9, 2018 at 12:34 AM Emir Arnautović <
[hidden email]> wrote:

> Hi Jay,
> Network blip might be the cause, but also the consequence of this issue.
> Maybe you can try avoiding DBQ while indexing and see if it is the cause.
> You can do thread dump on “the other” node and see if there are blocked
> threads and that can give you more clues what’s going on.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 8 May 2018, at 17:53, Jay Potharaju <[hidden email]> wrote:
> >
> > Hi Emir,
> > I was seeing this error as long as the indexing was running. Once I
> stopped
> > the indexing the errors also stopped.  Yes, we do monitor both hosts &
> solr
> > but have not seen anything out of the ordinary except for a small network
> > blip. In my experience solr generally recovers after a network blip and
> > there are a few errors for streaming solr client...but have never seen
> this
> > error before.
> >
> > Thanks
> > Jay
> >
> > Thanks
> > Jay Potharaju
> >
> >
> > On Tue, May 8, 2018 at 12:56 AM, Emir Arnautović <
> > [hidden email]> wrote:
> >
> >> Hi Jay,
> >> This is low ingestion rate. What is the size of your index? What is heap
> >> size? I am guessing that this is not a huge index, so  I am leaning
> toward
> >> what Shawn mentioned - some combination of DBQ/merge/commit/optimise
> that
> >> is blocking indexing. Though, it is strange that it is happening only on
> >> one node if you are sending updates randomly to both nodes. Do you
> monitor
> >> your hosts/Solr? Do you see anything different at the time when timeouts
> >> happen?
> >>
> >> Thanks,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 8 May 2018, at 03:23, Jay Potharaju <[hidden email]> wrote:
> >>>
> >>> I have about 3-5 updates per second.
> >>>
> >>>
> >>>> On May 7, 2018, at 5:02 PM, Shawn Heisey <[hidden email]> wrote:
> >>>>
> >>>>> On 5/7/2018 5:05 PM, Jay Potharaju wrote:
> >>>>> There are some deletes by query. I have not had any issues with DBQ,
> >>>>> currently have 5.3 running in production.
> >>>>
> >>>> Here's the big problem with DBQ.  Imagine this sequence of events with
> >>>> these timestamps:
> >>>>
> >>>> 13:00:00: A commit for change visibility happens.
> >>>> 13:00:00: A segment merge is triggered by the commit.
> >>>> (It's a big merge that takes exactly 3 minutes.)
> >>>> 13:00:05: A deleteByQuery is sent.
> >>>> 13:00:15: An update to the index is sent.
> >>>> 13:00:25: An update to the index is sent.
> >>>> 13:00:35: An update to the index is sent.
> >>>> 13:00:45: An update to the index is sent.
> >>>> 13:00:55: An update to the index is sent.
> >>>> 13:01:05: An update to the index is sent.
> >>>> 13:01:15: An update to the index is sent.
> >>>> 13:01:25: An update to the index is sent.
> >>>> {time passes, more updates might be sent}
> >>>> 13:03:00: The merge finishes.
> >>>>
> >>>> Here's what would happen in this scenario:  The DBQ and all of the
> >>>> update requests sent *after* the DBQ will block until the merge
> >>>> finishes.  That means that it's going to take up to three minutes for
> >>>> Solr to respond to those requests.  If the client that is sending the
> >>>> request is configured with a 60 second socket timeout, which
> inter-node
> >>>> requests made by Solr are by default, then it is going to experience a
> >>>> timeout error.  The request will probably complete successfully once
> the
> >>>> merge finishes, but the connection is gone, and the client has already
> >>>> received an error.
> >>>>
> >>>> Now imagine what happens if an optimize (forced merge of the entire
> >>>> index) is requested on an index that's 50GB.  That optimize may take
> 2-3
> >>>> hours, possibly longer.  A deleteByQuery started on that index after
> the
> >>>> optimize begins (and any updates requested after the DBQ) will pause
> >>>> until the optimize is done.  A pause of 2 hours or more is a BIG
> >> problem.
> >>>>
> >>>> This is why deleteByQuery is not recommended.
> >>>>
> >>>> If the deleteByQuery were changed into a two-step process involving a
> >>>> query to retrieve ID values and then one or more deleteById requests,
> >>>> then none of that blocking would occur.  The deleteById operation can
> >>>> run at the same time as a segment merge, so neither it nor subsequent
> >>>> update requests will have the significant pause.  From what I
> >>>> understand, you can even do commits in this scenario and have changes
> be
> >>>> visible before the merge completes.  I haven't verified that this is
> the
> >>>> case.
> >>>>
> >>>> Experienced devs: Can we fix this problem with DBQ?  On indexes with a
> >>>> uniqueKey, can DBQ be changed to use the two-step process I mentioned?
> >>>>
> >>>> Thanks,
> >>>> Shawn
> >>>>
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions during distributed update

Jay Potharaju-2
Adding some more context to my last email....
Solr:6.6.3
2 nodes : 3 shards each
No replication .
Can someone answer the following questions
1) any ideas on why the following errors keep happening. AFAIK streaming solr clients error is  because of timeouts when connecting to other nodes.
Async errors are also network related as explained earlier in the email by Emir.
There were no network issues but the error has comeback and filling up my logs.
2) is anyone using solr 6.6.3 in production and what has their experience been so far.
3) is there any good documentation or blog post that would explain about inner working of solrcloud networking?

Thanks
Jay
org.apache.solr.update.StreamingSolrClients  
>  
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async exception during


> On May 13, 2018, at 9:21 PM, Jay Potharaju <[hidden email]> wrote:
>
> Hi,
> I restarted both my solr servers but I am seeing the async error again. In older 5x version of solrcloud, solr would normally recover gracefully in case of network errors, but solr 6.6.3 does not seem to be doing that. At this time I am not doing only a small percentage of  deletebyquery operations, its mostly indexing of documents only.
> I have not noticed any network blip like last time.  Any suggestions or is any else also having the same issue on solr 6.6.3?
>
>   I am again seeing the following two errors back to back.
>
>  ERROR org.apache.solr.update.StreamingSolrClients  
>  
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async exception during distributed update: Read timed out
> Thanks
> Jay
>  
>
>
>> On Wed, May 9, 2018 at 12:34 AM Emir Arnautović <[hidden email]> wrote:
>> Hi Jay,
>> Network blip might be the cause, but also the consequence of this issue. Maybe you can try avoiding DBQ while indexing and see if it is the cause. You can do thread dump on “the other” node and see if there are blocked threads and that can give you more clues what’s going on.
>>
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 8 May 2018, at 17:53, Jay Potharaju <[hidden email]> wrote:
>> >
>> > Hi Emir,
>> > I was seeing this error as long as the indexing was running. Once I stopped
>> > the indexing the errors also stopped.  Yes, we do monitor both hosts & solr
>> > but have not seen anything out of the ordinary except for a small network
>> > blip. In my experience solr generally recovers after a network blip and
>> > there are a few errors for streaming solr client...but have never seen this
>> > error before.
>> >
>> > Thanks
>> > Jay
>> >
>> > Thanks
>> > Jay Potharaju
>> >
>> >
>> > On Tue, May 8, 2018 at 12:56 AM, Emir Arnautović <
>> > [hidden email]> wrote:
>> >
>> >> Hi Jay,
>> >> This is low ingestion rate. What is the size of your index? What is heap
>> >> size? I am guessing that this is not a huge index, so  I am leaning toward
>> >> what Shawn mentioned - some combination of DBQ/merge/commit/optimise that
>> >> is blocking indexing. Though, it is strange that it is happening only on
>> >> one node if you are sending updates randomly to both nodes. Do you monitor
>> >> your hosts/Solr? Do you see anything different at the time when timeouts
>> >> happen?
>> >>
>> >> Thanks,
>> >> Emir
>> >> --
>> >> Monitoring - Log Management - Alerting - Anomaly Detection
>> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> >>
>> >>
>> >>
>> >>> On 8 May 2018, at 03:23, Jay Potharaju <[hidden email]> wrote:
>> >>>
>> >>> I have about 3-5 updates per second.
>> >>>
>> >>>
>> >>>> On May 7, 2018, at 5:02 PM, Shawn Heisey <[hidden email]> wrote:
>> >>>>
>> >>>>> On 5/7/2018 5:05 PM, Jay Potharaju wrote:
>> >>>>> There are some deletes by query. I have not had any issues with DBQ,
>> >>>>> currently have 5.3 running in production.
>> >>>>
>> >>>> Here's the big problem with DBQ.  Imagine this sequence of events with
>> >>>> these timestamps:
>> >>>>
>> >>>> 13:00:00: A commit for change visibility happens.
>> >>>> 13:00:00: A segment merge is triggered by the commit.
>> >>>> (It's a big merge that takes exactly 3 minutes.)
>> >>>> 13:00:05: A deleteByQuery is sent.
>> >>>> 13:00:15: An update to the index is sent.
>> >>>> 13:00:25: An update to the index is sent.
>> >>>> 13:00:35: An update to the index is sent.
>> >>>> 13:00:45: An update to the index is sent.
>> >>>> 13:00:55: An update to the index is sent.
>> >>>> 13:01:05: An update to the index is sent.
>> >>>> 13:01:15: An update to the index is sent.
>> >>>> 13:01:25: An update to the index is sent.
>> >>>> {time passes, more updates might be sent}
>> >>>> 13:03:00: The merge finishes.
>> >>>>
>> >>>> Here's what would happen in this scenario:  The DBQ and all of the
>> >>>> update requests sent *after* the DBQ will block until the merge
>> >>>> finishes.  That means that it's going to take up to three minutes for
>> >>>> Solr to respond to those requests.  If the client that is sending the
>> >>>> request is configured with a 60 second socket timeout, which inter-node
>> >>>> requests made by Solr are by default, then it is going to experience a
>> >>>> timeout error.  The request will probably complete successfully once the
>> >>>> merge finishes, but the connection is gone, and the client has already
>> >>>> received an error.
>> >>>>
>> >>>> Now imagine what happens if an optimize (forced merge of the entire
>> >>>> index) is requested on an index that's 50GB.  That optimize may take 2-3
>> >>>> hours, possibly longer.  A deleteByQuery started on that index after the
>> >>>> optimize begins (and any updates requested after the DBQ) will pause
>> >>>> until the optimize is done.  A pause of 2 hours or more is a BIG
>> >> problem.
>> >>>>
>> >>>> This is why deleteByQuery is not recommended.
>> >>>>
>> >>>> If the deleteByQuery were changed into a two-step process involving a
>> >>>> query to retrieve ID values and then one or more deleteById requests,
>> >>>> then none of that blocking would occur.  The deleteById operation can
>> >>>> run at the same time as a segment merge, so neither it nor subsequent
>> >>>> update requests will have the significant pause.  From what I
>> >>>> understand, you can even do commits in this scenario and have changes be
>> >>>> visible before the merge completes.  I haven't verified that this is the
>> >>>> case.
>> >>>>
>> >>>> Experienced devs: Can we fix this problem with DBQ?  On indexes with a
>> >>>> uniqueKey, can DBQ be changed to use the two-step process I mentioned?
>> >>>>
>> >>>> Thanks,
>> >>>> Shawn
>> >>>>
>> >>
>> >>
>>