Updates blocked in Tlog solr cloud?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Updates blocked in Tlog solr cloud?

weiwang19
Hi,

I am puzzled by a problem in solr cloud with Tlog replicas and would
appreciate your insights.  Our solr cloud has two shards and each shard
have 5 tlog replicas. When one of the non-leader replica has hardware issue
and become unreachable,  updates to the whole cloud stopped.  We are on
solr 7.6 and use solrj client to send updates only to leaders.  To my
understanding,  with Tlog replica type, the leader only forward update
requests to replicas for transaction log update and each replica
periodically pulls the segment from leader.  When one replica fails to
respond,  why update requests to the cloud are blocked?  Does leader need
to wait for response from each replica to inform client that update is
successful?

Best,
Wei
Reply | Threaded
Open this post in threaded view
|

Re: Updates blocked in Tlog solr cloud?

Erick Erickson
How long are updates blocked and how did the tlog replica on the bad hardware go down?

Solr has to wait for an ack back from the tlog follower to be certain that the follower has all the documents in case it has to switch to that replica to become the leader. If the update to the follower times out, the leader will put it into a recovering state.

So I’d expect the collection to queue up indexing until the request to the follower on the bad hardware timed out, did you wait at least that long?

Best,
Erick

> On Nov 18, 2019, at 7:11 PM, Wei <[hidden email]> wrote:
>
> Hi,
>
> I am puzzled by a problem in solr cloud with Tlog replicas and would
> appreciate your insights.  Our solr cloud has two shards and each shard
> have 5 tlog replicas. When one of the non-leader replica has hardware issue
> and become unreachable,  updates to the whole cloud stopped.  We are on
> solr 7.6 and use solrj client to send updates only to leaders.  To my
> understanding,  with Tlog replica type, the leader only forward update
> requests to replicas for transaction log update and each replica
> periodically pulls the segment from leader.  When one replica fails to
> respond,  why update requests to the cloud are blocked?  Does leader need
> to wait for response from each replica to inform client that update is
> successful?
>
> Best,
> Wei

Reply | Threaded
Open this post in threaded view
|

Re: Updates blocked in Tlog solr cloud?

weiwang19
Hi Erick,

I observed that the update request rate dropped from 20 per sec to 3 per
sec for about 8 minutes. After that there is a huge burst of updates. This
looks quite match the queue up behavior you mentioned. But I don't think
the time out took that long. Is there a configurable setting for the time
out?
Also the bad tlog replica is not reachable at the time, so we did a
DELETEREPLICA command with collections API to remove it from the cloud.

Thanks,
Wei


On Tue, Nov 19, 2019 at 5:52 AM Erick Erickson <[hidden email]>
wrote:

> How long are updates blocked and how did the tlog replica on the bad
> hardware go down?
>
> Solr has to wait for an ack back from the tlog follower to be certain that
> the follower has all the documents in case it has to switch to that replica
> to become the leader. If the update to the follower times out, the leader
> will put it into a recovering state.
>
> So I’d expect the collection to queue up indexing until the request to the
> follower on the bad hardware timed out, did you wait at least that long?
>
> Best,
> Erick
>
> > On Nov 18, 2019, at 7:11 PM, Wei <[hidden email]> wrote:
> >
> > Hi,
> >
> > I am puzzled by a problem in solr cloud with Tlog replicas and would
> > appreciate your insights.  Our solr cloud has two shards and each shard
> > have 5 tlog replicas. When one of the non-leader replica has hardware
> issue
> > and become unreachable,  updates to the whole cloud stopped.  We are on
> > solr 7.6 and use solrj client to send updates only to leaders.  To my
> > understanding,  with Tlog replica type, the leader only forward update
> > requests to replicas for transaction log update and each replica
> > periodically pulls the segment from leader.  When one replica fails to
> > respond,  why update requests to the cloud are blocked?  Does leader need
> > to wait for response from each replica to inform client that update is
> > successful?
> >
> > Best,
> > Wei
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Updates blocked in Tlog solr cloud?

weiwang19
Update for another observation: after the follower replica become
unresponsive, I notice there are multiple commits happen on the leader
within two minutes, and then seeing the following OOM error on leader:

o.a.s.s.HttpSolrCall null:java.lang.RuntimeException:
java.lang.OutOfMemoryError: Direct buffer memory    at
org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:662)    at
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:530)    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
  at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
  at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
  at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
  at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
  at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
  at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
  at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
  at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
  at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
  at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
  at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
  at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
  at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
  at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
  at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
  at org.eclipse.jetty.server.Server.handle(Server.java:531)    at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)    at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
  at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)    at
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)    at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
  at

....


The commits are not inline with our autocommit interval. I am wondering if
the commits could be caused by the leader initialed recovery process.  Will
the Tlog leader do extra commits  for the replica to sync up in recovery
process?


Best,

Wei



On Tue, Nov 19, 2019 at 1:22 PM Wei <[hidden email]> wrote:

> Hi Erick,
>
> I observed that the update request rate dropped from 20 per sec to 3 per
> sec for about 8 minutes. After that there is a huge burst of updates. This
> looks quite match the queue up behavior you mentioned. But I don't think
> the time out took that long. Is there a configurable setting for the time
> out?
> Also the bad tlog replica is not reachable at the time, so we did a
> DELETEREPLICA command with collections API to remove it from the cloud.
>
> Thanks,
> Wei
>
>
> On Tue, Nov 19, 2019 at 5:52 AM Erick Erickson <[hidden email]>
> wrote:
>
>> How long are updates blocked and how did the tlog replica on the bad
>> hardware go down?
>>
>> Solr has to wait for an ack back from the tlog follower to be certain
>> that the follower has all the documents in case it has to switch to that
>> replica to become the leader. If the update to the follower times out, the
>> leader will put it into a recovering state.
>>
>> So I’d expect the collection to queue up indexing until the request to
>> the follower on the bad hardware timed out, did you wait at least that long?
>>
>> Best,
>> Erick
>>
>> > On Nov 18, 2019, at 7:11 PM, Wei <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I am puzzled by a problem in solr cloud with Tlog replicas and would
>> > appreciate your insights.  Our solr cloud has two shards and each shard
>> > have 5 tlog replicas. When one of the non-leader replica has hardware
>> issue
>> > and become unreachable,  updates to the whole cloud stopped.  We are on
>> > solr 7.6 and use solrj client to send updates only to leaders.  To my
>> > understanding,  with Tlog replica type, the leader only forward update
>> > requests to replicas for transaction log update and each replica
>> > periodically pulls the segment from leader.  When one replica fails to
>> > respond,  why update requests to the cloud are blocked?  Does leader
>> need
>> > to wait for response from each replica to inform client that update is
>> > successful?
>> >
>> > Best,
>> > Wei
>>
>>