SolrCloud breaks and does not recover

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

SolrCloud breaks and does not recover

Björn Häuser
Hey there,

we are running a SolrCloud, which has 4 nodes, same config. Each node
has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
worked for a long time.

We currently run with 2 shards, 2 replicas and 11 collections. The
complete data-dir is about 5.3 GB.
I think we should move some JVM heap back to the OS.

We are running Solr 5.2.1., as I could not see any related bugs to
SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
to upgrade first.

One of our nodes (node A) reports these errors:

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
version (expected 2, but 101) or the data in not in 'javabin' format

Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171

And shortly after (4 seconds) this happens on a *different* node (Node B):

Stopping recovery for core=suggestion coreNodeName=core_node2

No Stacktrace for this, but happens for all 11 collections.

6 seconds after that Node C reports these errors:

org.apache.solr.common.SolrException:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /configs/customers/params.json

Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8

This also happens for 11 collections.

And then different errors happen:

OverseerAutoReplicaFailoverThread had an error in its thread work
loop.:org.apache.solr.common.SolrException: Error reading cluster
properties

cancelElection did not find election node to remove
/overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112

At that point the cluster is broken and stops responding to the most
queries. In the same time zookeeper looks okay.

The cluster cannot selfheal from that situation and we are forced to
take manual action and restart node after node and hope that solrcloud
eventually recovers. Which sometimes takes several minutes and several
restarts from various nodes.

We can provide more logdata if needed.

Is there anything where we can start digging to find the underlying
error for that problem?

Thanks in advance
Björn
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud breaks and does not recover

Erick Erickson
Without more data, I'd guess one of two things:

1> you're seeing stop-the-world GC pauses that cause Zookeeper to
think the node is unresponsive, which puts a node into recovery and
things go bad from there.

2> Somewhere in your solr logs you'll see OutOfMemory errors which can
also cascade a bunch of problems.

In general it's an anti-pattern to allocate such a large portion of
our physical memory to the JVM, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html



Best,
Erick



On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <[hidden email]> wrote:

> Hey there,
>
> we are running a SolrCloud, which has 4 nodes, same config. Each node
> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
> worked for a long time.
>
> We currently run with 2 shards, 2 replicas and 11 collections. The
> complete data-dir is about 5.3 GB.
> I think we should move some JVM heap back to the OS.
>
> We are running Solr 5.2.1., as I could not see any related bugs to
> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
> to upgrade first.
>
> One of our nodes (node A) reports these errors:
>
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
> version (expected 2, but 101) or the data in not in 'javabin' format
>
> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
>
> And shortly after (4 seconds) this happens on a *different* node (Node B):
>
> Stopping recovery for core=suggestion coreNodeName=core_node2
>
> No Stacktrace for this, but happens for all 11 collections.
>
> 6 seconds after that Node C reports these errors:
>
> org.apache.solr.common.SolrException:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /configs/customers/params.json
>
> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
>
> This also happens for 11 collections.
>
> And then different errors happen:
>
> OverseerAutoReplicaFailoverThread had an error in its thread work
> loop.:org.apache.solr.common.SolrException: Error reading cluster
> properties
>
> cancelElection did not find election node to remove
> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
>
> At that point the cluster is broken and stops responding to the most
> queries. In the same time zookeeper looks okay.
>
> The cluster cannot selfheal from that situation and we are forced to
> take manual action and restart node after node and hope that solrcloud
> eventually recovers. Which sometimes takes several minutes and several
> restarts from various nodes.
>
> We can provide more logdata if needed.
>
> Is there anything where we can start digging to find the underlying
> error for that problem?
>
> Thanks in advance
> Björn
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud breaks and does not recover

Björn Häuser
Hi!

Thank you for your super fast answer.

I can provide more data, the question is which data :-)

These are the config parameters solr runs with:
https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from
the admin ui)

These are the log files:

https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b

I think your first obversation is correct: SolrCloud looses the
connection to zookeeper, because the connection times out.

But why isn't solrcloud able to recover it self?

Thanks
Björn


2015-11-02 22:32 GMT+01:00 Erick Erickson <[hidden email]>:

> Without more data, I'd guess one of two things:
>
> 1> you're seeing stop-the-world GC pauses that cause Zookeeper to
> think the node is unresponsive, which puts a node into recovery and
> things go bad from there.
>
> 2> Somewhere in your solr logs you'll see OutOfMemory errors which can
> also cascade a bunch of problems.
>
> In general it's an anti-pattern to allocate such a large portion of
> our physical memory to the JVM, see:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
>
>
> Best,
> Erick
>
>
>
> On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <[hidden email]> wrote:
>> Hey there,
>>
>> we are running a SolrCloud, which has 4 nodes, same config. Each node
>> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
>> worked for a long time.
>>
>> We currently run with 2 shards, 2 replicas and 11 collections. The
>> complete data-dir is about 5.3 GB.
>> I think we should move some JVM heap back to the OS.
>>
>> We are running Solr 5.2.1., as I could not see any related bugs to
>> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
>> to upgrade first.
>>
>> One of our nodes (node A) reports these errors:
>>
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
>> version (expected 2, but 101) or the data in not in 'javabin' format
>>
>> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
>>
>> And shortly after (4 seconds) this happens on a *different* node (Node B):
>>
>> Stopping recovery for core=suggestion coreNodeName=core_node2
>>
>> No Stacktrace for this, but happens for all 11 collections.
>>
>> 6 seconds after that Node C reports these errors:
>>
>> org.apache.solr.common.SolrException:
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for /configs/customers/params.json
>>
>> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
>>
>> This also happens for 11 collections.
>>
>> And then different errors happen:
>>
>> OverseerAutoReplicaFailoverThread had an error in its thread work
>> loop.:org.apache.solr.common.SolrException: Error reading cluster
>> properties
>>
>> cancelElection did not find election node to remove
>> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
>>
>> At that point the cluster is broken and stops responding to the most
>> queries. In the same time zookeeper looks okay.
>>
>> The cluster cannot selfheal from that situation and we are forced to
>> take manual action and restart node after node and hope that solrcloud
>> eventually recovers. Which sometimes takes several minutes and several
>> restarts from various nodes.
>>
>> We can provide more logdata if needed.
>>
>> Is there anything where we can start digging to find the underlying
>> error for that problem?
>>
>> Thanks in advance
>> Björn
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud breaks and does not recover

Erick Erickson
The GC logs don't really show anything interesting, there would
be 15+ second GC pauses. The Zookeeper log isn't actually very
interesting. As far as OOM errors, I was thinking of _solr_ logs.

As to why the cluster doesn't self-heal, a couple of things:

1> Once you hit an OOM, all bets are off. The JVM needs to be
bounced. Many installations have kill scripts that bounce the
JVM. So it's explainable if you have OOM errors.

2> The system may be _trying_ to recover, but if you're
still ingesting data it may get into a resource-starved
situation where it makes progress but never catches up.

Again, though, this seems like very little memory for the
situation you describe, I suspect you're memory-starved to
a point where you can't really run. But that's a guess.

When you run, how much JVM memory are you using? The admin
UI should show that.

But the pattern of 8G physical memory and 6G for Java is a red
flag as per Uwe's blog post, you may be swapping a lot (OS
memory) and that may be slowing things down enough to have
sessions drop. Grasping at straws here, but "top" or similar
should tell you what the system is doing.

Best,
Erick

On Tue, Nov 3, 2015 at 12:04 AM, Björn Häuser <[hidden email]> wrote:

> Hi!
>
> Thank you for your super fast answer.
>
> I can provide more data, the question is which data :-)
>
> These are the config parameters solr runs with:
> https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from
> the admin ui)
>
> These are the log files:
>
> https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b
>
> I think your first obversation is correct: SolrCloud looses the
> connection to zookeeper, because the connection times out.
>
> But why isn't solrcloud able to recover it self?
>
> Thanks
> Björn
>
>
> 2015-11-02 22:32 GMT+01:00 Erick Erickson <[hidden email]>:
>> Without more data, I'd guess one of two things:
>>
>> 1> you're seeing stop-the-world GC pauses that cause Zookeeper to
>> think the node is unresponsive, which puts a node into recovery and
>> things go bad from there.
>>
>> 2> Somewhere in your solr logs you'll see OutOfMemory errors which can
>> also cascade a bunch of problems.
>>
>> In general it's an anti-pattern to allocate such a large portion of
>> our physical memory to the JVM, see:
>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>
>>
>>
>> Best,
>> Erick
>>
>>
>>
>> On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <[hidden email]> wrote:
>>> Hey there,
>>>
>>> we are running a SolrCloud, which has 4 nodes, same config. Each node
>>> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
>>> worked for a long time.
>>>
>>> We currently run with 2 shards, 2 replicas and 11 collections. The
>>> complete data-dir is about 5.3 GB.
>>> I think we should move some JVM heap back to the OS.
>>>
>>> We are running Solr 5.2.1., as I could not see any related bugs to
>>> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
>>> to upgrade first.
>>>
>>> One of our nodes (node A) reports these errors:
>>>
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
>>> version (expected 2, but 101) or the data in not in 'javabin' format
>>>
>>> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
>>>
>>> And shortly after (4 seconds) this happens on a *different* node (Node B):
>>>
>>> Stopping recovery for core=suggestion coreNodeName=core_node2
>>>
>>> No Stacktrace for this, but happens for all 11 collections.
>>>
>>> 6 seconds after that Node C reports these errors:
>>>
>>> org.apache.solr.common.SolrException:
>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> KeeperErrorCode = Session expired for /configs/customers/params.json
>>>
>>> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
>>>
>>> This also happens for 11 collections.
>>>
>>> And then different errors happen:
>>>
>>> OverseerAutoReplicaFailoverThread had an error in its thread work
>>> loop.:org.apache.solr.common.SolrException: Error reading cluster
>>> properties
>>>
>>> cancelElection did not find election node to remove
>>> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
>>>
>>> At that point the cluster is broken and stops responding to the most
>>> queries. In the same time zookeeper looks okay.
>>>
>>> The cluster cannot selfheal from that situation and we are forced to
>>> take manual action and restart node after node and hope that solrcloud
>>> eventually recovers. Which sometimes takes several minutes and several
>>> restarts from various nodes.
>>>
>>> We can provide more logdata if needed.
>>>
>>> Is there anything where we can start digging to find the underlying
>>> error for that problem?
>>>
>>> Thanks in advance
>>> Björn
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud breaks and does not recover

Björn Häuser
Hi,

thank you for your answer.

1> No OOM hit, the log does not contain any hind of that. Also solr
wasn't restarted automatically. But the gc log has some pauses which
are longer than 15 seconds.

2> So, if we need to recover a system we need to stop ingesting data into it?

3> The JVMs currently use a little bit more then 1GB of Heap, with a
now changed max-heap of 3GB. Currently thinking of lowering the heap
to 1.5 / 2 GB (following Uwe's post).

Also the RES is 4.1gb and VIRT is 12.5gb. Swap is more or less not
used (40mb of 1GB assigned swap). According to our server monitoring
sometimes an io spike happens, but again not that much.

What I am going todo:

1.) make sure that in case of failure we stop ingesting data into solrcloud
2.) lower the heap to 2GB
3.) Make sure that zookeeper can fsync its write-ahead log fast enough (<1 sec)

Thanks
Björn

2015-11-03 16:27 GMT+01:00 Erick Erickson <[hidden email]>:

> The GC logs don't really show anything interesting, there would
> be 15+ second GC pauses. The Zookeeper log isn't actually very
> interesting. As far as OOM errors, I was thinking of _solr_ logs.
>
> As to why the cluster doesn't self-heal, a couple of things:
>
> 1> Once you hit an OOM, all bets are off. The JVM needs to be
> bounced. Many installations have kill scripts that bounce the
> JVM. So it's explainable if you have OOM errors.
>
> 2> The system may be _trying_ to recover, but if you're
> still ingesting data it may get into a resource-starved
> situation where it makes progress but never catches up.
>
> Again, though, this seems like very little memory for the
> situation you describe, I suspect you're memory-starved to
> a point where you can't really run. But that's a guess.
>
> When you run, how much JVM memory are you using? The admin
> UI should show that.
>
> But the pattern of 8G physical memory and 6G for Java is a red
> flag as per Uwe's blog post, you may be swapping a lot (OS
> memory) and that may be slowing things down enough to have
> sessions drop. Grasping at straws here, but "top" or similar
> should tell you what the system is doing.
>
> Best,
> Erick
>
> On Tue, Nov 3, 2015 at 12:04 AM, Björn Häuser <[hidden email]> wrote:
>> Hi!
>>
>> Thank you for your super fast answer.
>>
>> I can provide more data, the question is which data :-)
>>
>> These are the config parameters solr runs with:
>> https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from
>> the admin ui)
>>
>> These are the log files:
>>
>> https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b
>>
>> I think your first obversation is correct: SolrCloud looses the
>> connection to zookeeper, because the connection times out.
>>
>> But why isn't solrcloud able to recover it self?
>>
>> Thanks
>> Björn
>>
>>
>> 2015-11-02 22:32 GMT+01:00 Erick Erickson <[hidden email]>:
>>> Without more data, I'd guess one of two things:
>>>
>>> 1> you're seeing stop-the-world GC pauses that cause Zookeeper to
>>> think the node is unresponsive, which puts a node into recovery and
>>> things go bad from there.
>>>
>>> 2> Somewhere in your solr logs you'll see OutOfMemory errors which can
>>> also cascade a bunch of problems.
>>>
>>> In general it's an anti-pattern to allocate such a large portion of
>>> our physical memory to the JVM, see:
>>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>
>>>
>>>
>>> Best,
>>> Erick
>>>
>>>
>>>
>>> On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <[hidden email]> wrote:
>>>> Hey there,
>>>>
>>>> we are running a SolrCloud, which has 4 nodes, same config. Each node
>>>> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
>>>> worked for a long time.
>>>>
>>>> We currently run with 2 shards, 2 replicas and 11 collections. The
>>>> complete data-dir is about 5.3 GB.
>>>> I think we should move some JVM heap back to the OS.
>>>>
>>>> We are running Solr 5.2.1., as I could not see any related bugs to
>>>> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
>>>> to upgrade first.
>>>>
>>>> One of our nodes (node A) reports these errors:
>>>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>>> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
>>>> version (expected 2, but 101) or the data in not in 'javabin' format
>>>>
>>>> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
>>>>
>>>> And shortly after (4 seconds) this happens on a *different* node (Node B):
>>>>
>>>> Stopping recovery for core=suggestion coreNodeName=core_node2
>>>>
>>>> No Stacktrace for this, but happens for all 11 collections.
>>>>
>>>> 6 seconds after that Node C reports these errors:
>>>>
>>>> org.apache.solr.common.SolrException:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired for /configs/customers/params.json
>>>>
>>>> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
>>>>
>>>> This also happens for 11 collections.
>>>>
>>>> And then different errors happen:
>>>>
>>>> OverseerAutoReplicaFailoverThread had an error in its thread work
>>>> loop.:org.apache.solr.common.SolrException: Error reading cluster
>>>> properties
>>>>
>>>> cancelElection did not find election node to remove
>>>> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
>>>>
>>>> At that point the cluster is broken and stops responding to the most
>>>> queries. In the same time zookeeper looks okay.
>>>>
>>>> The cluster cannot selfheal from that situation and we are forced to
>>>> take manual action and restart node after node and hope that solrcloud
>>>> eventually recovers. Which sometimes takes several minutes and several
>>>> restarts from various nodes.
>>>>
>>>> We can provide more logdata if needed.
>>>>
>>>> Is there anything where we can start digging to find the underlying
>>>> error for that problem?
>>>>
>>>> Thanks in advance
>>>> Björn
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud breaks and does not recover

Rallavagu
One another item to look into is to increase the zookeeper timeout in
solr.xml of Solr. This would help with timeout caused by long GC pauses.

On 11/3/15 9:12 AM, Björn Häuser wrote:

> Hi,
>
> thank you for your answer.
>
> 1> No OOM hit, the log does not contain any hind of that. Also solr
> wasn't restarted automatically. But the gc log has some pauses which
> are longer than 15 seconds.
>
> 2> So, if we need to recover a system we need to stop ingesting data into it?
>
> 3> The JVMs currently use a little bit more then 1GB of Heap, with a
> now changed max-heap of 3GB. Currently thinking of lowering the heap
> to 1.5 / 2 GB (following Uwe's post).
>
> Also the RES is 4.1gb and VIRT is 12.5gb. Swap is more or less not
> used (40mb of 1GB assigned swap). According to our server monitoring
> sometimes an io spike happens, but again not that much.
>
> What I am going todo:
>
> 1.) make sure that in case of failure we stop ingesting data into solrcloud
> 2.) lower the heap to 2GB
> 3.) Make sure that zookeeper can fsync its write-ahead log fast enough (<1 sec)
>
> Thanks
> Björn
>
> 2015-11-03 16:27 GMT+01:00 Erick Erickson <[hidden email]>:
>> The GC logs don't really show anything interesting, there would
>> be 15+ second GC pauses. The Zookeeper log isn't actually very
>> interesting. As far as OOM errors, I was thinking of _solr_ logs.
>>
>> As to why the cluster doesn't self-heal, a couple of things:
>>
>> 1> Once you hit an OOM, all bets are off. The JVM needs to be
>> bounced. Many installations have kill scripts that bounce the
>> JVM. So it's explainable if you have OOM errors.
>>
>> 2> The system may be _trying_ to recover, but if you're
>> still ingesting data it may get into a resource-starved
>> situation where it makes progress but never catches up.
>>
>> Again, though, this seems like very little memory for the
>> situation you describe, I suspect you're memory-starved to
>> a point where you can't really run. But that's a guess.
>>
>> When you run, how much JVM memory are you using? The admin
>> UI should show that.
>>
>> But the pattern of 8G physical memory and 6G for Java is a red
>> flag as per Uwe's blog post, you may be swapping a lot (OS
>> memory) and that may be slowing things down enough to have
>> sessions drop. Grasping at straws here, but "top" or similar
>> should tell you what the system is doing.
>>
>> Best,
>> Erick
>>
>> On Tue, Nov 3, 2015 at 12:04 AM, Björn Häuser <[hidden email]> wrote:
>>> Hi!
>>>
>>> Thank you for your super fast answer.
>>>
>>> I can provide more data, the question is which data :-)
>>>
>>> These are the config parameters solr runs with:
>>> https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from
>>> the admin ui)
>>>
>>> These are the log files:
>>>
>>> https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b
>>>
>>> I think your first obversation is correct: SolrCloud looses the
>>> connection to zookeeper, because the connection times out.
>>>
>>> But why isn't solrcloud able to recover it self?
>>>
>>> Thanks
>>> Björn
>>>
>>>
>>> 2015-11-02 22:32 GMT+01:00 Erick Erickson <[hidden email]>:
>>>> Without more data, I'd guess one of two things:
>>>>
>>>> 1> you're seeing stop-the-world GC pauses that cause Zookeeper to
>>>> think the node is unresponsive, which puts a node into recovery and
>>>> things go bad from there.
>>>>
>>>> 2> Somewhere in your solr logs you'll see OutOfMemory errors which can
>>>> also cascade a bunch of problems.
>>>>
>>>> In general it's an anti-pattern to allocate such a large portion of
>>>> our physical memory to the JVM, see:
>>>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>>>>
>>>>
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>
>>>>
>>>> On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <[hidden email]> wrote:
>>>>> Hey there,
>>>>>
>>>>> we are running a SolrCloud, which has 4 nodes, same config. Each node
>>>>> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
>>>>> worked for a long time.
>>>>>
>>>>> We currently run with 2 shards, 2 replicas and 11 collections. The
>>>>> complete data-dir is about 5.3 GB.
>>>>> I think we should move some JVM heap back to the OS.
>>>>>
>>>>> We are running Solr 5.2.1., as I could not see any related bugs to
>>>>> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
>>>>> to upgrade first.
>>>>>
>>>>> One of our nodes (node A) reports these errors:
>>>>>
>>>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>>>> Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
>>>>> version (expected 2, but 101) or the data in not in 'javabin' format
>>>>>
>>>>> Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
>>>>>
>>>>> And shortly after (4 seconds) this happens on a *different* node (Node B):
>>>>>
>>>>> Stopping recovery for core=suggestion coreNodeName=core_node2
>>>>>
>>>>> No Stacktrace for this, but happens for all 11 collections.
>>>>>
>>>>> 6 seconds after that Node C reports these errors:
>>>>>
>>>>> org.apache.solr.common.SolrException:
>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>> KeeperErrorCode = Session expired for /configs/customers/params.json
>>>>>
>>>>> Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
>>>>>
>>>>> This also happens for 11 collections.
>>>>>
>>>>> And then different errors happen:
>>>>>
>>>>> OverseerAutoReplicaFailoverThread had an error in its thread work
>>>>> loop.:org.apache.solr.common.SolrException: Error reading cluster
>>>>> properties
>>>>>
>>>>> cancelElection did not find election node to remove
>>>>> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
>>>>>
>>>>> At that point the cluster is broken and stops responding to the most
>>>>> queries. In the same time zookeeper looks okay.
>>>>>
>>>>> The cluster cannot selfheal from that situation and we are forced to
>>>>> take manual action and restart node after node and hope that solrcloud
>>>>> eventually recovers. Which sometimes takes several minutes and several
>>>>> restarts from various nodes.
>>>>>
>>>>> We can provide more logdata if needed.
>>>>>
>>>>> Is there anything where we can start digging to find the underlying
>>>>> error for that problem?
>>>>>
>>>>> Thanks in advance
>>>>> Björn
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud breaks and does not recover

Pushkar Raste
In reply to this post by Björn Häuser
HI,
To minimize GC pauses, try using G1GC and turn on 'ParallelRefProcEnabled'
jvm flag. G1GC works much better for heaps > 4 GB. Lowering
'InitiatingHeapOccupancyPercent'
will also help to avoid long GC pauses at the cost of more short pauses.

On 3 November 2015 at 12:12, Björn Häuser <[hidden email]> wrote:

> Hi,
>
> thank you for your answer.
>
> 1> No OOM hit, the log does not contain any hind of that. Also solr
> wasn't restarted automatically. But the gc log has some pauses which
> are longer than 15 seconds.
>
> 2> So, if we need to recover a system we need to stop ingesting data into
> it?
>
> 3> The JVMs currently use a little bit more then 1GB of Heap, with a
> now changed max-heap of 3GB. Currently thinking of lowering the heap
> to 1.5 / 2 GB (following Uwe's post).
>
> Also the RES is 4.1gb and VIRT is 12.5gb. Swap is more or less not
> used (40mb of 1GB assigned swap). According to our server monitoring
> sometimes an io spike happens, but again not that much.
>
> What I am going todo:
>
> 1.) make sure that in case of failure we stop ingesting data into solrcloud
> 2.) lower the heap to 2GB
> 3.) Make sure that zookeeper can fsync its write-ahead log fast enough (<1
> sec)
>
> Thanks
> Björn
>
> 2015-11-03 16:27 GMT+01:00 Erick Erickson <[hidden email]>:
> > The GC logs don't really show anything interesting, there would
> > be 15+ second GC pauses. The Zookeeper log isn't actually very
> > interesting. As far as OOM errors, I was thinking of _solr_ logs.
> >
> > As to why the cluster doesn't self-heal, a couple of things:
> >
> > 1> Once you hit an OOM, all bets are off. The JVM needs to be
> > bounced. Many installations have kill scripts that bounce the
> > JVM. So it's explainable if you have OOM errors.
> >
> > 2> The system may be _trying_ to recover, but if you're
> > still ingesting data it may get into a resource-starved
> > situation where it makes progress but never catches up.
> >
> > Again, though, this seems like very little memory for the
> > situation you describe, I suspect you're memory-starved to
> > a point where you can't really run. But that's a guess.
> >
> > When you run, how much JVM memory are you using? The admin
> > UI should show that.
> >
> > But the pattern of 8G physical memory and 6G for Java is a red
> > flag as per Uwe's blog post, you may be swapping a lot (OS
> > memory) and that may be slowing things down enough to have
> > sessions drop. Grasping at straws here, but "top" or similar
> > should tell you what the system is doing.
> >
> > Best,
> > Erick
> >
> > On Tue, Nov 3, 2015 at 12:04 AM, Björn Häuser <[hidden email]>
> wrote:
> >> Hi!
> >>
> >> Thank you for your super fast answer.
> >>
> >> I can provide more data, the question is which data :-)
> >>
> >> These are the config parameters solr runs with:
> >> https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from
> >> the admin ui)
> >>
> >> These are the log files:
> >>
> >> https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b
> >>
> >> I think your first obversation is correct: SolrCloud looses the
> >> connection to zookeeper, because the connection times out.
> >>
> >> But why isn't solrcloud able to recover it self?
> >>
> >> Thanks
> >> Björn
> >>
> >>
> >> 2015-11-02 22:32 GMT+01:00 Erick Erickson <[hidden email]>:
> >>> Without more data, I'd guess one of two things:
> >>>
> >>> 1> you're seeing stop-the-world GC pauses that cause Zookeeper to
> >>> think the node is unresponsive, which puts a node into recovery and
> >>> things go bad from there.
> >>>
> >>> 2> Somewhere in your solr logs you'll see OutOfMemory errors which can
> >>> also cascade a bunch of problems.
> >>>
> >>> In general it's an anti-pattern to allocate such a large portion of
> >>> our physical memory to the JVM, see:
> >>>
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >>>
> >>>
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>
> >>>
> >>> On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser <[hidden email]>
> wrote:
> >>>> Hey there,
> >>>>
> >>>> we are running a SolrCloud, which has 4 nodes, same config. Each node
> >>>> has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
> >>>> worked for a long time.
> >>>>
> >>>> We currently run with 2 shards, 2 replicas and 11 collections. The
> >>>> complete data-dir is about 5.3 GB.
> >>>> I think we should move some JVM heap back to the OS.
> >>>>
> >>>> We are running Solr 5.2.1., as I could not see any related bugs to
> >>>> SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
> >>>> to upgrade first.
> >>>>
> >>>> One of our nodes (node A) reports these errors:
> >>>>
> >>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >>>> Error from server at http://10.41.199.201:9004/solr/catalogue:
> Invalid
> >>>> version (expected 2, but 101) or the data in not in 'javabin' format
> >>>>
> >>>> Stacktrace:
> https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171
> >>>>
> >>>> And shortly after (4 seconds) this happens on a *different* node
> (Node B):
> >>>>
> >>>> Stopping recovery for core=suggestion coreNodeName=core_node2
> >>>>
> >>>> No Stacktrace for this, but happens for all 11 collections.
> >>>>
> >>>> 6 seconds after that Node C reports these errors:
> >>>>
> >>>> org.apache.solr.common.SolrException:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /configs/customers/params.json
> >>>>
> >>>> Stacktrace:
> https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8
> >>>>
> >>>> This also happens for 11 collections.
> >>>>
> >>>> And then different errors happen:
> >>>>
> >>>> OverseerAutoReplicaFailoverThread had an error in its thread work
> >>>> loop.:org.apache.solr.common.SolrException: Error reading cluster
> >>>> properties
> >>>>
> >>>> cancelElection did not find election node to remove
> >>>>
> /overseer_elect/election/6507903311068798704-10.41.199.192:9004_solr-n_0000000112
> >>>>
> >>>> At that point the cluster is broken and stops responding to the most
> >>>> queries. In the same time zookeeper looks okay.
> >>>>
> >>>> The cluster cannot selfheal from that situation and we are forced to
> >>>> take manual action and restart node after node and hope that solrcloud
> >>>> eventually recovers. Which sometimes takes several minutes and several
> >>>> restarts from various nodes.
> >>>>
> >>>> We can provide more logdata if needed.
> >>>>
> >>>> Is there anything where we can start digging to find the underlying
> >>>> error for that problem?
> >>>>
> >>>> Thanks in advance
> >>>> Björn
>