solr cloud going down repeatedly

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

solr cloud going down repeatedly

Jakov Sosic
Hi guys.

I have a solr cloud, consisting of 3 zookeper VMs running 3.4.5
backported from Ubuntu 14.04 LTS to 12.04 LTS.

They are orchestrating 4 solr nodes, which have 2 cores. Each core is
sharded, so 1 shard is on each of the solr nodes.

Solr runs under tomcat7 and ubuntus latest openjdk 7.

Version of solr is 4.2.1.

Each of the nodes have around 7GB of data, and JVM is set to run 8GB
heap. All solr nodes have 16GB RAM.


Few weeks back we started having issues with this installation. Tomcat
was filling up catalina.out with following messages:

SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:


Only solution was to restart all 4 tomcats on 4 solr nodes. After that,
issue would rectify itself, but would occur again, approximately a week
after a restart.

This happened last time yesterday, and I succeded in recording some of
the stuff happening on boxes via Zabbix and atop.


Basically at 15:35 load on machine went berzerk, jumping from around 0.5
to around 30+

Zabbix and atop didn't notice any heavy IO, all the other processes were
practicaly idle, only JVM (tomcat) exploded with cpu usage increasing
from standard ~80% to around ~750%

These are the parts of Atop recordings on one of the node. Note that
they are 10 mins appart:

(15:28:42)
CPL | avg1    0.12  |               | avg5    0.36  | avg15   0.38  |

(15:38:42)
CPL | avg1    8.54  |               | avg5    3.62  | avg15   1.61  |

(15:48:42)
CPL | avg1   30.14  |               | avg5   27.09  | avg15  14.73  |



This is the status of tomcat at last point (15:48:42):
28891        tomcat7         tomcat7          411          8.68s  70m14s
        209.9M          204K            0K         5804K --          -
       S            5        704%        java


I have noticed similar stuff happening around the solr nodes. At 17:41
on call person decided to hard reset all the solr nodes, and cloud came
back up running normally after that.

These are the logs that I found on first node:

Aug 17, 2014 3:44:58 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:

Aug 17, 2014 3:46:12 PM
org.apache.solr.cloud.OverseerCollectionProcessor run
WARNING: Overseer cannot talk to ZK
Aug 17, 2014 3:46:12 PM
org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
WARNING:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect/leader

Then a bunch of :

Aug 17, 2014 3:46:42 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:

until the server was rebooted.


On other nodes I can see:
node2:

Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for
zkNodeName=10.100.254.103:8080_solr_myappcore=myapp
Aug 17, 2014 3:44:58 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for
zkNodeName=10.100.254.103:8080_solr_myapp2core=myapp2
Aug 17, 2014 3:46:24 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: IOException occured
when talking to server at: http://node1:8080/solr/myapp

node4:

Aug 17, 2014 3:44:06 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for
zkNodeName=10.100.254.105:8080_solr_myapp2core=myapp2
Aug 17, 2014 3:44:09 PM org.apache.solr.cloud.RecoveryStrategy close
WARNING: Stopping recovery for
zkNodeName=10.100.254.105:8080_solr_myappcore=myapp
Aug 17, 2014 3:45:37 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props




My impression is that garbage collector is at fault here.

This is the cmdline of tomcat:

/usr/lib/jvm/java-7-openjdk-amd64/bin/java
-Djava.util.logging.config.file=/var/lib/tomcat7/conf/logging.properties
-Djava.awt.headless=true -Xmx8192m -XX:+UseConcMarkSweepGC -DnumShards=2
-Djetty.port=8080
-DzkHost=10.215.1.96:2181,10.215.1.97:2181,10.215.1.98:2181
-javaagent:/opt/newrelic/newrelic.jar -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djav
.endorsed.dirs=/usr/share/tomcat7/endorsed -classpath
/usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli.jar
-Dcatalina.base=/var/lib/tomcat7 -Dcatalina.home=/usr/share/tomcat7
-Djava.io.tmpdir=/tmp/tomcat7-tomcat7-tmp
org.apache.catalina.startup.Bootstrap start


So, I am using MarkSweepGC.

Do you have any suggestion how can I debug this further and potentially
eliminate the issue causing downtimes?
Reply | Threaded
Open this post in threaded view
|

Re: solr cloud going down repeatedly

Shawn Heisey-4
On 8/18/2014 11:30 AM, Jakov Sosic wrote:

> My impression is that garbage collector is at fault here.
>
> This is the cmdline of tomcat:
>
> /usr/lib/jvm/java-7-openjdk-amd64/bin/java
> -Djava.util.logging.config.file=/var/lib/tomcat7/conf/logging.properties
> -Djava.awt.headless=true -Xmx8192m -XX:+UseConcMarkSweepGC
> -DnumShards=2 -Djetty.port=8080
> -DzkHost=10.215.1.96:2181,10.215.1.97:2181,10.215.1.98:2181
> -javaagent:/opt/newrelic/newrelic.jar -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.port=9010
> -Dcom.sun.management.jmxremote.local.only=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
> -Djav .endorsed.dirs=/usr/share/tomcat7/endorsed -classpath
> /usr/share/tomcat7/bin/bootstrap.jar:/usr/share/tomcat7/bin/tomcat-juli.jar
> -Dcatalina.base=/var/lib/tomcat7 -Dcatalina.home=/usr/share/tomcat7
> -Djava.io.tmpdir=/tmp/tomcat7-tomcat7-tmp
> org.apache.catalina.startup.Bootstrap start

With an 8GB heap and "UseConcMarkSweepGC" as your only GC tuning, I can
pretty much guarantee that you'll see occasional GC pauses of 10-15
seconds, because I saw exactly that happening with my own setup.

This is what I use now:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

I can't claim that my problem is 100% solved, but collections that go
over one second are *very* rare now, and I'm pretty sure they are all
under two seconds.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: solr cloud going down repeatedly

Jakov Sosic
On 08/18/2014 08:38 PM, Shawn Heisey wrote:

> With an 8GB heap and "UseConcMarkSweepGC" as your only GC tuning, I can
> pretty much guarantee that you'll see occasional GC pauses of 10-15
> seconds, because I saw exactly that happening with my own setup.
>
> This is what I use now:
>
> http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning
>
> I can't claim that my problem is 100% solved, but collections that go
> over one second are *very* rare now, and I'm pretty sure they are all
> under two seconds.

Thank you for your comment.

How did you test these settings? I mean, that's a lot of tuning and I
would like to set up some test environment to be certain this is what I
want...

Reply | Threaded
Open this post in threaded view
|

Re: solr cloud going down repeatedly

Shawn Heisey-4
On 8/19/2014 3:12 AM, Jakov Sosic wrote:
> Thank you for your comment.
>
> How did you test these settings? I mean, that's a lot of tuning and I
> would like to set up some test environment to be certain this is what
> I want...

I included a section on tools when I wrote this page:

http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: solr cloud going down repeatedly

Jakov Sosic
On 08/19/2014 04:58 PM, Shawn Heisey wrote:

> On 8/19/2014 3:12 AM, Jakov Sosic wrote:
>> Thank you for your comment.
>>
>> How did you test these settings? I mean, that's a lot of tuning and I
>> would like to set up some test environment to be certain this is what
>> I want...
>
> I included a section on tools when I wrote this page:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems

Thanks,


we ended up using cron to restart Tomcats every 7 days, each solr node
per day... that way we avoid GC pauses.

Until we figure things out in our dev environment and test GC
optimizations, we will keep it this way.

Reply | Threaded
Open this post in threaded view
|

Re: solr cloud going down repeatedly

Shawn Heisey-4
On 8/25/2014 4:23 AM, Jakov Sosic wrote:
> we ended up using cron to restart Tomcats every 7 days, each solr node
> per day... that way we avoid GC pauses.
>
> Until we figure things out in our dev environment and test GC
> optimizations, we will keep it this way.

If it's only doing a long GC pause once a week, I think I'd prefer to go
ahead and let it do the long GC pause.  It would be less of an
interruption than restarting Solr.

Or is it getting into a mode after several days where it goes crazy and
has a lot of major GC storms?  If that's the case, is it happening even
with the GC tuning parameters I gave you before?  I run my Solr
instances for months without issues.  Right now, my production Solr
instances have been running for 25 days.

Thanks,
Shawn