Solr Cloud in recovering state & down state for long

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr Cloud in recovering state & down state for long

GaneshSe
Hi

We are using 2 node SolrCloud 7.2.1 cluster with external 3 node ZK
ensemble in AWS. There are about 60 collections at any point in time. We
have per JVM max heap of 8GB.

The problem is: We are seeing few collection's few replicas in "recovering"
state and few in the "down". Since we have 2 replicas for each of the
shard, the system is still functional, even though few replicas are in
unhealthy state. Currently we are seeing less then 50% heap memory is used
and there are free physical memory available as well. The GC seems to be
fine now.

We think issue could have happened, when we were accidentally trying to
read the zookeeper transactions logs (to see the count of zk transactions,
we understand this is not a good practice now) during an Solr data load and
load failed during that time, as Solr was not able to find the leader with
this error("*Cannot talk to ZooKeeper - Updates are disabled*"). We stopped
reading it further. But, this changed the Solr Leader and since then we
were able to do load just fine, but the leader remains switched.
Detailed *error
message 1 <https://pastebin.com/embed_iframe/wcp3L9nk>*

But as stated above problem, we will have few collection replicas in the
recovering and down state. In the past we have seen it come back to normal
by restarting the solr server, but we want to understand is there any way
to get this back to normal (all synched up with Zookeeper) through command
line/admin? Another question is, being in this state can it cause data
issue? How do we check that (distrib=false on collection count?)?

We predominantly use Solr realtime GET by key in our application.

Regards,
Ganesh
Reply | Threaded
Open this post in threaded view
|

Re: Solr Cloud in recovering state & down state for long

Shawn Heisey-2
On 10/2/2018 8:55 PM, Ganesh Sethuraman wrote:
> We are using 2 node SolrCloud 7.2.1 cluster with external 3 node ZK
> ensemble in AWS. There are about 60 collections at any point in time. We
> have per JVM max heap of 8GB.

Let's focus for right now on a single Solr machine, rather than the
whole cluster.  How many shard replicas (cores) are on one server?  How
much disk space does all the index data take? How many documents
(maxDoc, which includes deleted docs) are in all those cores?  What is
the total amount of RAM on the server? Is there any other software
besides Solr running on each server?

https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue

> But as stated above problem, we will have few collection replicas in the
> recovering and down state. In the past we have seen it come back to normal
> by restarting the solr server, but we want to understand is there any way
> to get this back to normal (all synched up with Zookeeper) through command
> line/admin? Another question is, being in this state can it cause data
> issue? How do we check that (distrib=false on collection count?)?

As long as you have at least one replica operational on every shard, you
should be OK.  But if you only have one replica operational, then you're
in a precarious state, where one additional problem could result in
something being unavailable.

If all is well, SolrCloud should not have replicas stay in down or
recovering state for very long, unless they're really large, in which
case it can take a while to copy the data from the leader.  If that
state persists for a long time, there's probably something going wrong
with your Solr install.  Usually restarting Solr is the only way to
recover persistently down replicas.  If it happens again after restart,
then the root problem has not been dealt with, and you're going to need
to figure it out.

The log snippet you shared only covers a timespan of less than one
second, so it's not very helpful in making any kind of determination. 
The "session expired" message sounds like what happens when the
zkClientTimeout value is exceeded.  Internally, this value defaults to
15 seconds, and typical example configs set it to 30 seconds ... so when
the session expires, it means there's a SERIOUS problem.  For computer
software, 15 or 30 seconds is a relative eternity.  A properly running
system should NEVER exceed that timeout.

Can you share your solr log when the problem happens, covering a
timespan of at least a few minutes (and ideally much longer), as well as
a gc log from a time when Solr was up for a long time?  Hopefully the
solr.log and gc log will cover the same timeframe.  You'll need to use a
file sharing site for the GC log, since it's likely to be a large file. 
I would suggest compressing it.  If the solr.log is small enough, you
could use a paste website for that, but if it's large, you'll need to
use a file sharing site.  Attachments to list email are almost never
preserved.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Solr Cloud in recovering state & down state for long

GaneshSe
On Tue, Oct 2, 2018 at 11:46 PM Shawn Heisey <[hidden email]> wrote:

> On 10/2/2018 8:55 PM, Ganesh Sethuraman wrote:
> > We are using 2 node SolrCloud 7.2.1 cluster with external 3 node ZK
> > ensemble in AWS. There are about 60 collections at any point in time. We
> > have per JVM max heap of 8GB.
>
> Let's focus for right now on a single Solr machine, rather than the
> whole cluster.  How many shard replicas (cores) are on one server?  How
> much disk space does all the index data take? How many documents
> (maxDoc, which includes deleted docs) are in all those cores?  What is
> the total amount of RAM on the server? Is there any other software
> besides Solr running on each server?
>
> We have  471 replicas are available in each server we have about 60
collections each with 8 shards and 2 replica. Couple of them just 2 shards
they are small size. Note that only about 30 of them are actively used. Old
collections are periodically deleted.
470 GB of index data per node
Max Doc per collection is about 300M. However average per collection will
be about 50M Docs.
256GB RAM (24 vCPUs) on each of the two AWS
No other software running on the box

https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue

>
> > But as stated above problem, we will have few collection replicas in the
> > recovering and down state. In the past we have seen it come back to
> normal
> > by restarting the solr server, but we want to understand is there any way
> > to get this back to normal (all synched up with Zookeeper) through
> command
> > line/admin? Another question is, being in this state can it cause data
> > issue? How do we check that (distrib=false on collection count?)?
>
> As long as you have at least one replica operational on every shard, you
> should be OK.  But if you only have one replica operational, then you're
> in a precarious state, where one additional problem could result in
> something being unavailable.
>
> thanks for info.

> If all is well, SolrCloud should not have replicas stay in down or
> recovering state for very long, unless they're really large, in which
> case it can take a while to copy the data from the leader.  If that
> state persists for a long time, there's probably something going wrong
> with your Solr install.  Usually restarting Solr is the only way to
> recover persistently down replicas.  If it happens again after restart,
> then the root problem has not been dealt with, and you're going to need
> to figure it out.
>
> Ok. Based on the point above it looks restarting the only option, no other
way to sync with ZK.  Thanks for that

The log snippet you shared only covers a timespan of less than one
> second, so it's not very helpful in making any kind of determination.
> The "session expired" message sounds like what happens when the
> zkClientTimeout value is exceeded.  Internally, this value defaults to
> 15 seconds, and typical example configs set it to 30 seconds ... so when
> the session expires, it means there's a SERIOUS problem.  For computer
> software, 15 or 30 seconds is a relative eternity.  A properly running
> system should NEVER exceed that timeout.
>
> I don't think we have a memory issue (GC Log for busy day is posted here),
we had Solr going out of sync with ZK because of the manual ZK Transaction
log parsing/checking on the server (we did that on the Sept 17 16:00 UTC as
you can see in the log), which resulted in ZK timeout. Since then the Solr
has not returned to normal.  Is there a possibility of the Solr query (real
time GET )response time increasing due the solr servers being in
recovering/Down state?

Here is the full Solr Log file (Note that it is in INFO mode):
https://raw.githubusercontent.com/ganeshmailbox/har/master/SolrLogFile
Here is the GC Log:
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMTAvMy8tLTAxX3NvbHJfZ2MubG9nLjUtLTIxLTE5LTU3


Can you share your solr log when the problem happens, covering a

> timespan of at least a few minutes (and ideally much longer), as well as
> a gc log from a time when Solr was up for a long time?  Hopefully the
> solr.log and gc log will cover the same timeframe.  You'll need to use a
> file sharing site for the GC log, since it's likely to be a large file.
> I would suggest compressing it.  If the solr.log is small enough, you
> could use a paste website for that, but if it's large, you'll need to
> use a file sharing site.  Attachments to list email are almost never
> preserved.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr Cloud in recovering state & down state for long

GaneshSe
1. Does GC and Solr Logs help to why the Solr replicas server continues to
be in the recovering/ state? Our assumption is that Sept 17 16:00 hrs we
had done ZK transaction log reading, that might have caused the issue. Is
that correct?
2. Does this state can cause slowness to Solr Queries for reads?
3. Is there any way to get notified/email if the servers has any replica
gets into the recovery mode?


On Wed, Oct 3, 2018 at 5:26 PM Ganesh Sethuraman <[hidden email]>
wrote:

>
>
>
> On Tue, Oct 2, 2018 at 11:46 PM Shawn Heisey <[hidden email]> wrote:
>
>> On 10/2/2018 8:55 PM, Ganesh Sethuraman wrote:
>> > We are using 2 node SolrCloud 7.2.1 cluster with external 3 node ZK
>> > ensemble in AWS. There are about 60 collections at any point in time. We
>> > have per JVM max heap of 8GB.
>>
>> Let's focus for right now on a single Solr machine, rather than the
>> whole cluster.  How many shard replicas (cores) are on one server?  How
>> much disk space does all the index data take? How many documents
>> (maxDoc, which includes deleted docs) are in all those cores?  What is
>> the total amount of RAM on the server? Is there any other software
>> besides Solr running on each server?
>>
>> We have  471 replicas are available in each server we have about 60
> collections each with 8 shards and 2 replica. Couple of them just 2 shards
> they are small size. Note that only about 30 of them are actively used. Old
> collections are periodically deleted.
> 470 GB of index data per node
> Max Doc per collection is about 300M. However average per collection will
> be about 50M Docs.
> 256GB RAM (24 vCPUs) on each of the two AWS
> No other software running on the box
>
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
>
>>
>> > But as stated above problem, we will have few collection replicas in the
>> > recovering and down state. In the past we have seen it come back to
>> normal
>> > by restarting the solr server, but we want to understand is there any
>> way
>> > to get this back to normal (all synched up with Zookeeper) through
>> command
>> > line/admin? Another question is, being in this state can it cause data
>> > issue? How do we check that (distrib=false on collection count?)?
>>
>> As long as you have at least one replica operational on every shard, you
>> should be OK.  But if you only have one replica operational, then you're
>> in a precarious state, where one additional problem could result in
>> something being unavailable.
>>
>> thanks for info.
>
>> If all is well, SolrCloud should not have replicas stay in down or
>> recovering state for very long, unless they're really large, in which
>> case it can take a while to copy the data from the leader.  If that
>> state persists for a long time, there's probably something going wrong
>> with your Solr install.  Usually restarting Solr is the only way to
>> recover persistently down replicas.  If it happens again after restart,
>> then the root problem has not been dealt with, and you're going to need
>> to figure it out.
>>
>> Ok. Based on the point above it looks restarting the only option, no
> other way to sync with ZK.  Thanks for that
>
> The log snippet you shared only covers a timespan of less than one
>> second, so it's not very helpful in making any kind of determination.
>> The "session expired" message sounds like what happens when the
>> zkClientTimeout value is exceeded.  Internally, this value defaults to
>> 15 seconds, and typical example configs set it to 30 seconds ... so when
>> the session expires, it means there's a SERIOUS problem.  For computer
>> software, 15 or 30 seconds is a relative eternity.  A properly running
>> system should NEVER exceed that timeout.
>>
>> I don't think we have a memory issue (GC Log for busy day is posted
> here), we had Solr going out of sync with ZK because of the manual ZK
> Transaction log parsing/checking on the server (we did that on the Sept 17
> 16:00 UTC as you can see in the log), which resulted in ZK timeout. Since
> then the Solr has not returned to normal.  Is there a possibility of the
> Solr query (real time GET )response time increasing due the solr servers
> being in recovering/Down state?
>
> Here is the full Solr Log file (Note that it is in INFO mode):
> https://raw.githubusercontent.com/ganeshmailbox/har/master/SolrLogFile
> Here is the GC Log:
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMTAvMy8tLTAxX3NvbHJfZ2MubG9nLjUtLTIxLTE5LTU3
>
>
> Can you share your solr log when the problem happens, covering a
>> timespan of at least a few minutes (and ideally much longer), as well as
>> a gc log from a time when Solr was up for a long time?  Hopefully the
>> solr.log and gc log will cover the same timeframe.  You'll need to use a
>> file sharing site for the GC log, since it's likely to be a large file.
>> I would suggest compressing it.  If the solr.log is small enough, you
>> could use a paste website for that, but if it's large, you'll need to
>> use a file sharing site.  Attachments to list email are almost never
>> preserved.
>>
>> Thanks,
>> Shawn
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Solr Cloud in recovering state & down state for long

Shawn Heisey-2
On 10/5/2018 5:15 AM, Ganesh Sethuraman wrote:
> 1. Does GC and Solr Logs help to why the Solr replicas server continues to
> be in the recovering/ state? Our assumption is that Sept 17 16:00 hrs we
> had done ZK transaction log reading, that might have caused the issue. Is
> that correct?
> 2. Does this state can cause slowness to Solr Queries for reads?
> 3. Is there any way to get notified/email if the servers has any replica
> gets into the recovery mode?

Seeing the GC log and Solr log will allow us to look for problems.  It
won't solve anything, it just lets us examine the situation, see if
there is any evidence to point to the root issue and maybe a solution.

If you're running with a heap that's too small, you can get into a
situation where you never actually run out of memory, but the amount of
available memory is so small that Java must continually run full garbage
collections to keep enough of it free for the program to stay running. 
This can happen to ANY java program, including your ZK servers.

If that happens, the program itself will only be running a small
percentage of the time, and there will be extremely long pauses where
very little happens other than garbage collection, and then when the
program starts running again, it realizes that its timeouts have been
exceeded, which in SolrCloud, will initiate recovery operations ... and
that will probably keep the GC pause storm happening.

With an 8 GB heap and likely billions of documents being handled by one
Solr instance, that low-memory situation I just described seems very
possible.  The solution is to make the heap bigger.  Your Solr install
is very large ... it seems unlikely to me that 8GB would be enough. 
Solr is not typically a memory hog kind of application, if what it is
asked to do is small.  When it is asked to do a bigger job, more memory
will be required.

Running without sufficient system memory to effectively cache the
indexes that are actively used can also cause performance problems. 
This is memory *NOT* allocated to programs like Solr, that the OS is
free to use for caching purposes.  With a busy enough server,
performance problems caused by that can spiral and lead to SolrCloud
recovery issues.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Solr Cloud in recovering state & down state for long

GaneshSe
Reading the ZK transaction log  could be issue, as ZK seems to be sensitive
to this (
https://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#The+Log+Directory
)

> incorrect placement of transasction log
> The most performance critical part of ZooKeeper is the transaction log.
> ZooKeeper syncs transactions to media before it returns a response. A
> dedicated transaction log device is key to consistent good performance.
> Putting the log on a busy device will adversely effect performance. If you
> only have one storage device, put trace files on NFS and increase the
> snapshotCount; it doesn't eliminate the problem, but it should mitigate it.


I am not sure the logs and GC logs were evident from my previous mail.
Re-posting it here for your reference:

Here is the full Solr Log file (Note that it is in INFO mode):
https://raw.githubusercontent.com/ganeshmailbox/har/master/SolrLogFile
Here is the GC Log:
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMTAvMy8tLTAxX3NvbHJfZ2MubG9nLjUtLTIxLTE5LTU3

Thanks
Ganesh

On Fri, Oct 5, 2018 at 10:13 AM Shawn Heisey <[hidden email]> wrote:

> On 10/5/2018 5:15 AM, Ganesh Sethuraman wrote:
> > 1. Does GC and Solr Logs help to why the Solr replicas server continues
> to
> > be in the recovering/ state? Our assumption is that Sept 17 16:00 hrs we
> > had done ZK transaction log reading, that might have caused the issue. Is
> > that correct?
> > 2. Does this state can cause slowness to Solr Queries for reads?
> > 3. Is there any way to get notified/email if the servers has any replica
> > gets into the recovery mode?
>
> Seeing the GC log and Solr log will allow us to look for problems.  It
> won't solve anything, it just lets us examine the situation, see if
> there is any evidence to point to the root issue and maybe a solution.
>
> If you're running with a heap that's too small, you can get into a
> situation where you never actually run out of memory, but the amount of
> available memory is so small that Java must continually run full garbage
> collections to keep enough of it free for the program to stay running.
> This can happen to ANY java program, including your ZK servers.
>
> If that happens, the program itself will only be running a small
> percentage of the time, and there will be extremely long pauses where
> very little happens other than garbage collection, and then when the
> program starts running again, it realizes that its timeouts have been
> exceeded, which in SolrCloud, will initiate recovery operations ... and
> that will probably keep the GC pause storm happening.
>
> With an 8 GB heap and likely billions of documents being handled by one
> Solr instance, that low-memory situation I just described seems very
> possible.  The solution is to make the heap bigger.  Your Solr install
> is very large ... it seems unlikely to me that 8GB would be enough.
> Solr is not typically a memory hog kind of application, if what it is
> asked to do is small.  When it is asked to do a bigger job, more memory
> will be required.
>
> Running without sufficient system memory to effectively cache the
> indexes that are actively used can also cause performance problems.
> This is memory *NOT* allocated to programs like Solr, that the OS is
> free to use for caching purposes.  With a busy enough server,
> performance problems caused by that can spiral and lead to SolrCloud
> recovery issues.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr Cloud in recovering state & down state for long

Shawn Heisey-2
On 10/5/2018 9:15 AM, Ganesh Sethuraman wrote:
> I am not sure the logs and GC logs were evident from my previous mail.
> Re-posting it here for your reference:
>
> Here is the full Solr Log file (Note that it is in INFO mode):
> https://raw.githubusercontent.com/ganeshmailbox/har/master/SolrLogFile
> Here is the GC Log:
> http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMTAvMy8tLTAxX3NvbHJfZ2MubG9nLjUtLTIxLTE5LTU3

The GC log shows pretty good performance.  The note at the top talks
about consecutive full GCs, but the peak usage on the heap isn't close
to max heap, so I don't know why that would be happening.  It also says
that there's a lot of application waiting for resources ... which can be
caused by not having enough memory for caching purposes.  The solution
there would be to add total memory to the system ... no config changes
are likely to help.

Even though the GC log doesn't seem to indicate extreme memory pressure,
I would still suggest that you make the heap a little bit bigger.  Maybe
10GB instead of 8GB.  See if that helps at all.  It might not.

There are a TON of errors and warnings in the solr log, things that are
very strange and may indicate other problems going on.

Thanks,
Shawn