Zookeeper timeout issue -

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Zookeeper timeout issue -

Ashish Bisht
Hi,

We are facing issue with solr/zookeeper where zookeeper timeouts after
10000ms. Error below.

*SolrException: java.util.concurrent.TimeoutException: Could not connect to
ZooKeeper <server1>:9181,<server2>:9182,<server2>:9183 within 10000 ms.
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:184)
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:121)
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:111)
at
org.apache.solr.common.cloud.ZkStateReader.<init>(ZkStateReader.java:295)*

We are not getting any error in zookeeper logs.Except below logs
2018-12-19 04:35:22,305 [myid:2] - INFO
[SessionTracker:ZooKeeperServer@354] - Expiring session 0x200830234de3127,
timeout of 10000ms exceeded
2018-12-19 05:35:38,304 [myid:2] - INFO
[SessionTracker:ZooKeeperServer@354] - Expiring session 0x200b4f912730086,
timeout of 10000ms exceeded
2018-12-19 05:56:58,302 [myid:2] - INFO
[SessionTracker:ZooKeeperServer@354] - Expiring session 0x100b4f9125e00bf,
timeout of 10000ms exceeded


During the issue threads go high and we could notice below in weblogic
server.

Name: Connection evictor
State: TIMED_WAITING
Total blocked: 0  Total waited: 1

Stack trace:
java.lang.Thread.sleep(Native Method)
org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
java.lang.Thread.run(Thread.java:748)

What could be going wrong here?

Regards
Ashish





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Zookeeper timeout issue -

Jan Høydahl / Cominvent
Which version of Solr?
Why do you mention Weblogic?
How is Solr deployed, what kind of servers, how much RAM, heap, how many documents in collection, shards etc.
Do you run other software on the same server? Have you noticed any hickups or GC activity?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 19. des. 2018 kl. 13:37 skrev AshB <[hidden email]>:
>
> Hi,
>
> We are facing issue with solr/zookeeper where zookeeper timeouts after
> 10000ms. Error below.
>
> *SolrException: java.util.concurrent.TimeoutException: Could not connect to
> ZooKeeper <server1>:9181,<server2>:9182,<server2>:9183 within 10000 ms.
> at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:184)
> at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:121)
> at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:111)
> at
> org.apache.solr.common.cloud.ZkStateReader.<init>(ZkStateReader.java:295)*
>
> We are not getting any error in zookeeper logs.Except below logs
> 2018-12-19 04:35:22,305 [myid:2] - INFO
> [SessionTracker:ZooKeeperServer@354] - Expiring session 0x200830234de3127,
> timeout of 10000ms exceeded
> 2018-12-19 05:35:38,304 [myid:2] - INFO
> [SessionTracker:ZooKeeperServer@354] - Expiring session 0x200b4f912730086,
> timeout of 10000ms exceeded
> 2018-12-19 05:56:58,302 [myid:2] - INFO
> [SessionTracker:ZooKeeperServer@354] - Expiring session 0x100b4f9125e00bf,
> timeout of 10000ms exceeded
>
>
> During the issue threads go high and we could notice below in weblogic
> server.
>
> Name: Connection evictor
> State: TIMED_WAITING
> Total blocked: 0  Total waited: 1
>
> Stack trace:
> java.lang.Thread.sleep(Native Method)
> org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
> java.lang.Thread.run(Thread.java:748)
>
> What could be going wrong here?
>
> Regards
> Ashish
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: Zookeeper timeout issue -

Dominique Bejean
In reply to this post by Ashish Bisht
Hi,

What is the scenario ? High query activity ? High update activity ?

Regards.

Dominique


Le mer. 19 déc. 2018 à 13:44, AshB <[hidden email]> a écrit :

> Hi,
>
> We are facing issue with solr/zookeeper where zookeeper timeouts after
> 10000ms. Error below.
>
> *SolrException: java.util.concurrent.TimeoutException: Could not connect to
> ZooKeeper <server1>:9181,<server2>:9182,<server2>:9183 within 10000 ms.
> at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:184)
> at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:121)
> at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:111)
> at
> org.apache.solr.common.cloud.ZkStateReader.<init>(ZkStateReader.java:295)*
>
> We are not getting any error in zookeeper logs.Except below logs
> 2018-12-19 04:35:22,305 [myid:2] - INFO
> [SessionTracker:ZooKeeperServer@354] - Expiring session 0x200830234de3127,
> timeout of 10000ms exceeded
> 2018-12-19 05:35:38,304 [myid:2] - INFO
> [SessionTracker:ZooKeeperServer@354] - Expiring session 0x200b4f912730086,
> timeout of 10000ms exceeded
> 2018-12-19 05:56:58,302 [myid:2] - INFO
> [SessionTracker:ZooKeeperServer@354] - Expiring session 0x100b4f9125e00bf,
> timeout of 10000ms exceeded
>
>
> During the issue threads go high and we could notice below in weblogic
> server.
>
> Name: Connection evictor
> State: TIMED_WAITING
> Total blocked: 0  Total waited: 1
>
> Stack trace:
> java.lang.Thread.sleep(Native Method)
>
> org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
> java.lang.Thread.run(Thread.java:748)
>
> What could be going wrong here?
>
> Regards
> Ashish
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Zookeeper timeout issue -

Ashish Bisht
This post was updated on .
In reply to this post by Jan Høydahl / Cominvent
Hi Jan,

Setup Details

Solr Version: 7.4.0
Weblogic hosts rest services ,one being the search service.

Mach-1 -->20Gb RAM.
Apps running :OracleDb,WeblogicServer(services deployed to call
solr),*OneSolr Node*,*One Zookeeper node*
Mach-2 -->20Gb RAM
Apps running :*One Solr Node*,*Two zookeeper nodes*.

Solr collection details : ~8k docs,~140MB size on disc,One shard on machine
1 and two replicas on mach-1 and mach-2

We did a jmeter load testing with 50 users 30 iterations i.e 1500
requests.In each call solr is called three times due to requirements.

What we noticed is when load on mach-1 goes high upto ~12 and memory
utilization goes high and then some requests time out.

Is this expected from zookeeper when load is too high?

-Ashish





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Zookeeper timeout issue -

Ashish Bisht
In reply to this post by Dominique Bejean
Hi Dominique,

Yes,we are load testing with 50 users.We tried changing the timeout but its
not reflecting.

Regards
Ashish



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Zookeeper timeout issue -

Aman Tandon
As Jan mentioned also see GC activity or memory issues, also check out for
the threads by looking if any thread pending/waiting too long.

On Fri, Dec 28, 2018, 16:14 AshB <[hidden email] wrote:

> Hi Dominique,
>
> Yes,we are load testing with 50 users.We tried changing the timeout but its
> not reflecting.
>
> Regards
> Ashish
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Zookeeper timeout issue -

Shawn Heisey-2
In reply to this post by Ashish Bisht
On 12/27/2018 10:18 AM, AshB wrote:
> Mach-1 -->20Gb RAM.
> Apps running :OracleDb,WeblogicServer(services deployed to call
> solr),*OneSolr Node*,*One Zookeeper node*
> Mach-2 -->20Gb RAM
> Apps running :*One Solr Node*,*Two zookeeper nodes*.

Hopefully you're aware that this setup is not fault tolerant.  If you
lose machine 2, your zookeeper loses quorum and Solr will go read-only. 
You must have three separate machines for zookeeper fault tolerance. 
That is how zookeeper is designed.See the note here:

https://zookeeper.apache.org/doc/r3.4.13/zookeeperAdmin.html#sc_zkMulitServerSetup

Since version 5.0, Solr is no longer supported in third-party containers
like weblogic.  It can still be done, but you're on your own.

https://wiki.apache.org/solr/WhyNoWar

> What we noticed is when load on mach-1 goes high upto ~12 and memory
> utilization goes high and then some requests time out.

When the system load goes high like that, what program is consuming CPU?

A screenshot of a detailed process listing from each server might be
helpful.  Here's how to gather that:

https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Zookeeper timeout issue -

Ashish Bisht
This post was updated on .
Hi Shawn,

Answers to your questions.

1.Yes we are aware of fault tolerance in our architecture,but its our dev
env,so we are working with solrCloud mode with limited machines.

2. Solr is running as separate app,its not on weblogic. We are using
Weblogic for rest services which further connect to zookeeper<-->Solr.

3.We used jconsole to monitor solr,zookeeper and weblogic process.In the
weblogic process looks like threads are getting stuck. One such thread
related to zookeeper is as below..

Name: zkConnectionManagerCallback-9207-thread-1
State: WAITING on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@396cda76

Total blocked: 0  Total waited: 1

Stack trace:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Have attached file containing snapshots of process.

Also attached the solr  GCeasy-report-gc.pdf
<http://lucene.472066.n3.nabble.com/file/t493329/GCeasy-report-gc.pdf>  gc
log report  TimoutIssue.docx
<http://lucene.472066.n3.nabble.com/file/t493329/TimoutIssue.docx>  of solr
during the load activity.


Further,We restarted the weblogic server and ran the test again with less load ,the test went fine.But when we put more requests,there is rise in threads which doesn't come down and we can see same zkConnectionManagerCallback   in waiting state




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html