TimeoutException, IOException, Read timed out

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

TimeoutException, IOException, Read timed out

Fengtan
Hi,

We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's.
Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ; each
VM has one Solr and one ZK instance.
The cluster hosts 1,000 collections ; each collection has 1 shard and
between 500 and 50,000 documents.
Documents are indexed incrementally every day ; the Solr client mostly does
searching.
Solr runs with -Xms7g -Xmx7g.

Everything has been working fine for about one month but a few days ago we
started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm

Also we have always seen these:
  PERFORMANCE WARNING: Overlapping onDeckSearchers=2


We are not sure what is causing the timeouts, although we have identified a
few things that could be improved:

1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProcessorFactory
-- we are aware that explicit commits are expensive

2) Drop the 1,000 collections and use a single one instead (all our
collections use the same schema/solrconfig.xml) since stability problems
are expected when the number of collections reaches the low hundreds
<https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The
downside is that the new collection would contain 1,000,000 documents which
may bring new challenges.

3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a
better performance according to this
<https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems>,
this
<https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector>
and this
<http://lucene.472066.n3.nabble.com/java-util-concurrent-TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html>.
The downside is that Lucene explicitely discourages the usage of G1
<https://wiki.apache.org/lucene-java/JavaBugs#Java_Bugs_in_various_JVMs_affecting_Lucene_.2F_Solr>
so we are not sure what to expect. We use the default GC settings:
  -XX:NewRatio=3
  -XX:SurvivorRatio=4
  -XX:TargetSurvivorRatio=90
  -XX:MaxTenuringThreshold=8
  -XX:+UseConcMarkSweepGC
  -XX:+UseParNewGC
  -XX:ConcGCThreads=4
  -XX:ParallelGCThreads=4
  -XX:+CMSScavengeBeforeRemark
  -XX:PretenureSizeThreshold=64m
  -XX:+UseCMSInitiatingOccupancyOnly
  -XX:CMSInitiatingOccupancyFraction=50
  -XX:CMSMaxAbortablePrecleanTime=6000
  -XX:+CMSParallelRemarkEnabled
  -XX:+ParallelRefProcEnabled

4) Tune the caches, possibly by increasing autowarmCount on filterCache --
our current config is:
  <filterCache class="solr.FastLRUCache" size="512" initialSize="512"
autowarmCount="0"/>
  <queryResultCache class="solr.LRUCache" size="512" initialSize="512"
autowarmCount="32"/>
  <documentCache class="solr.LRUCache" size="512" initialSize="512"
autowarmCount="0"/>

5) Tweak the timeout settings, although this would not fix the underlying
issue


Does any of these options seem relevant ? Is there anything else that might
address the timeouts ?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: TimeoutException, IOException, Read timed out

Erick Erickson
<1> It's not the explicit commits are expensive, it's that they happen
too fast. An explicit commit and an internal autocommit have exactly
the same cost. Your "overlapping ondeck searchers"  is definitely an
indication that your commits are happening from somwhere too quickly
and are piling up.

<2> Likely a good thing, each collection increases overhead. And
1,000,000 documents is quite small in Solr's terms unless the
individual documents are enormous. I'd do this for a number of
reasons.

<3> Certainly an option, but I'd put that last. Fix the commit problem first ;)

<4> If you do this, make the autowarm count quite small. That said,
this will be very little use if you have frequent commits. Let's say
you commit every second. The autowarming will warm caches, which will
then be thrown out a second later. And will increase the time it takes
to open a new searcher.

<5> Yeah, this would probably just be a band-aid.

If I were prioritizing these, I'd do
<1> first. If you control the client, just don't call commit. If you
do not control the client, then what you've outlined is fine. Tip: set
your soft commit settings to be as long as you can stand. If you must
have very short intervals, consider disabling your caches completely.
Here's a long article on commits....
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

<2> Actually, this and <1> are pretty close in priority.

Then re-evaluate. Fixing the commit issue may buy you quite a bit of
time. Having 1,000 collections is pushing the boundaries presently.
Each collection will establish watchers on the bits it cares about in
ZooKeeper, and reducing the watchers by a factor approaching 1,000 is
A Good Thing.

Frankly, between these two things I'd pretty much expect your problems
to disappear. wouldn't be the first time I've been totally wrong, but
it's where I'd start ;)

Best,
Erick

On Wed, Oct 25, 2017 at 8:54 AM, Fengtan <[hidden email]> wrote:

> Hi,
>
> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's.
> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ; each
> VM has one Solr and one ZK instance.
> The cluster hosts 1,000 collections ; each collection has 1 shard and
> between 500 and 50,000 documents.
> Documents are indexed incrementally every day ; the Solr client mostly does
> searching.
> Solr runs with -Xms7g -Xmx7g.
>
> Everything has been working fine for about one month but a few days ago we
> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm
>
> Also we have always seen these:
>   PERFORMANCE WARNING: Overlapping onDeckSearchers=2
>
>
> We are not sure what is causing the timeouts, although we have identified a
> few things that could be improved:
>
> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProcessorFactory
> -- we are aware that explicit commits are expensive
>
> 2) Drop the 1,000 collections and use a single one instead (all our
> collections use the same schema/solrconfig.xml) since stability problems
> are expected when the number of collections reaches the low hundreds
> <https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The
> downside is that the new collection would contain 1,000,000 documents which
> may bring new challenges.
>
> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a
> better performance according to this
> <https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems>,
> this
> <https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector>
> and this
> <http://lucene.472066.n3.nabble.com/java-util-concurrent-TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html>.
> The downside is that Lucene explicitely discourages the usage of G1
> <https://wiki.apache.org/lucene-java/JavaBugs#Java_Bugs_in_various_JVMs_affecting_Lucene_.2F_Solr>
> so we are not sure what to expect. We use the default GC settings:
>   -XX:NewRatio=3
>   -XX:SurvivorRatio=4
>   -XX:TargetSurvivorRatio=90
>   -XX:MaxTenuringThreshold=8
>   -XX:+UseConcMarkSweepGC
>   -XX:+UseParNewGC
>   -XX:ConcGCThreads=4
>   -XX:ParallelGCThreads=4
>   -XX:+CMSScavengeBeforeRemark
>   -XX:PretenureSizeThreshold=64m
>   -XX:+UseCMSInitiatingOccupancyOnly
>   -XX:CMSInitiatingOccupancyFraction=50
>   -XX:CMSMaxAbortablePrecleanTime=6000
>   -XX:+CMSParallelRemarkEnabled
>   -XX:+ParallelRefProcEnabled
>
> 4) Tune the caches, possibly by increasing autowarmCount on filterCache --
> our current config is:
>   <filterCache class="solr.FastLRUCache" size="512" initialSize="512"
> autowarmCount="0"/>
>   <queryResultCache class="solr.LRUCache" size="512" initialSize="512"
> autowarmCount="32"/>
>   <documentCache class="solr.LRUCache" size="512" initialSize="512"
> autowarmCount="0"/>
>
> 5) Tweak the timeout settings, although this would not fix the underlying
> issue
>
>
> Does any of these options seem relevant ? Is there anything else that might
> address the timeouts ?
>
> Thanks
Reply | Threaded
Open this post in threaded view
|

Re: TimeoutException, IOException, Read timed out

Emir Arnautović
Hi Fengtan,
I would just add that when merging collections, you might want to use document routing (https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting <https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting>) - since you are keeping separate collections, I guess you have a “collection ID” to use as routing key. This will enable you to have one collection but query only shard(s) with data from one “collection”.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 25 Oct 2017, at 19:25, Erick Erickson <[hidden email]> wrote:
>
> <1> It's not the explicit commits are expensive, it's that they happen
> too fast. An explicit commit and an internal autocommit have exactly
> the same cost. Your "overlapping ondeck searchers"  is definitely an
> indication that your commits are happening from somwhere too quickly
> and are piling up.
>
> <2> Likely a good thing, each collection increases overhead. And
> 1,000,000 documents is quite small in Solr's terms unless the
> individual documents are enormous. I'd do this for a number of
> reasons.
>
> <3> Certainly an option, but I'd put that last. Fix the commit problem first ;)
>
> <4> If you do this, make the autowarm count quite small. That said,
> this will be very little use if you have frequent commits. Let's say
> you commit every second. The autowarming will warm caches, which will
> then be thrown out a second later. And will increase the time it takes
> to open a new searcher.
>
> <5> Yeah, this would probably just be a band-aid.
>
> If I were prioritizing these, I'd do
> <1> first. If you control the client, just don't call commit. If you
> do not control the client, then what you've outlined is fine. Tip: set
> your soft commit settings to be as long as you can stand. If you must
> have very short intervals, consider disabling your caches completely.
> Here's a long article on commits....
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> <2> Actually, this and <1> are pretty close in priority.
>
> Then re-evaluate. Fixing the commit issue may buy you quite a bit of
> time. Having 1,000 collections is pushing the boundaries presently.
> Each collection will establish watchers on the bits it cares about in
> ZooKeeper, and reducing the watchers by a factor approaching 1,000 is
> A Good Thing.
>
> Frankly, between these two things I'd pretty much expect your problems
> to disappear. wouldn't be the first time I've been totally wrong, but
> it's where I'd start ;)
>
> Best,
> Erick
>
> On Wed, Oct 25, 2017 at 8:54 AM, Fengtan <[hidden email]> wrote:
>> Hi,
>>
>> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's.
>> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ; each
>> VM has one Solr and one ZK instance.
>> The cluster hosts 1,000 collections ; each collection has 1 shard and
>> between 500 and 50,000 documents.
>> Documents are indexed incrementally every day ; the Solr client mostly does
>> searching.
>> Solr runs with -Xms7g -Xmx7g.
>>
>> Everything has been working fine for about one month but a few days ago we
>> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm
>>
>> Also we have always seen these:
>>  PERFORMANCE WARNING: Overlapping onDeckSearchers=2
>>
>>
>> We are not sure what is causing the timeouts, although we have identified a
>> few things that could be improved:
>>
>> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProcessorFactory
>> -- we are aware that explicit commits are expensive
>>
>> 2) Drop the 1,000 collections and use a single one instead (all our
>> collections use the same schema/solrconfig.xml) since stability problems
>> are expected when the number of collections reaches the low hundreds
>> <https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The
>> downside is that the new collection would contain 1,000,000 documents which
>> may bring new challenges.
>>
>> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a
>> better performance according to this
>> <https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems>,
>> this
>> <https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector>
>> and this
>> <http://lucene.472066.n3.nabble.com/java-util-concurrent-TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html>.
>> The downside is that Lucene explicitely discourages the usage of G1
>> <https://wiki.apache.org/lucene-java/JavaBugs#Java_Bugs_in_various_JVMs_affecting_Lucene_.2F_Solr>
>> so we are not sure what to expect. We use the default GC settings:
>>  -XX:NewRatio=3
>>  -XX:SurvivorRatio=4
>>  -XX:TargetSurvivorRatio=90
>>  -XX:MaxTenuringThreshold=8
>>  -XX:+UseConcMarkSweepGC
>>  -XX:+UseParNewGC
>>  -XX:ConcGCThreads=4
>>  -XX:ParallelGCThreads=4
>>  -XX:+CMSScavengeBeforeRemark
>>  -XX:PretenureSizeThreshold=64m
>>  -XX:+UseCMSInitiatingOccupancyOnly
>>  -XX:CMSInitiatingOccupancyFraction=50
>>  -XX:CMSMaxAbortablePrecleanTime=6000
>>  -XX:+CMSParallelRemarkEnabled
>>  -XX:+ParallelRefProcEnabled
>>
>> 4) Tune the caches, possibly by increasing autowarmCount on filterCache --
>> our current config is:
>>  <filterCache class="solr.FastLRUCache" size="512" initialSize="512"
>> autowarmCount="0"/>
>>  <queryResultCache class="solr.LRUCache" size="512" initialSize="512"
>> autowarmCount="32"/>
>>  <documentCache class="solr.LRUCache" size="512" initialSize="512"
>> autowarmCount="0"/>
>>
>> 5) Tweak the timeout settings, although this would not fix the underlying
>> issue
>>
>>
>> Does any of these options seem relevant ? Is there anything else that might
>> address the timeouts ?
>>
>> Thanks

Reply | Threaded
Open this post in threaded view
|

Re: TimeoutException, IOException, Read timed out

Fengtan
Thanks Erick and Emir -- we are going to start with <1> and possibly <2>.

On Thu, Oct 26, 2017 at 7:06 AM, Emir Arnautović <
[hidden email]> wrote:

> Hi Fengtan,
> I would just add that when merging collections, you might want to use
> document routing (https://lucene.apache.org/solr/guide/6_6/shards-and-
> indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrClo
> ud-DocumentRouting <https://lucene.apache.org/solr/guide/6_6/shards-and-
> indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrClo
> ud-DocumentRouting>) - since you are keeping separate collections, I
> guess you have a “collection ID” to use as routing key. This will enable
> you to have one collection but query only shard(s) with data from one
> “collection”.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 25 Oct 2017, at 19:25, Erick Erickson <[hidden email]>
> wrote:
> >
> > <1> It's not the explicit commits are expensive, it's that they happen
> > too fast. An explicit commit and an internal autocommit have exactly
> > the same cost. Your "overlapping ondeck searchers"  is definitely an
> > indication that your commits are happening from somwhere too quickly
> > and are piling up.
> >
> > <2> Likely a good thing, each collection increases overhead. And
> > 1,000,000 documents is quite small in Solr's terms unless the
> > individual documents are enormous. I'd do this for a number of
> > reasons.
> >
> > <3> Certainly an option, but I'd put that last. Fix the commit problem
> first ;)
> >
> > <4> If you do this, make the autowarm count quite small. That said,
> > this will be very little use if you have frequent commits. Let's say
> > you commit every second. The autowarming will warm caches, which will
> > then be thrown out a second later. And will increase the time it takes
> > to open a new searcher.
> >
> > <5> Yeah, this would probably just be a band-aid.
> >
> > If I were prioritizing these, I'd do
> > <1> first. If you control the client, just don't call commit. If you
> > do not control the client, then what you've outlined is fine. Tip: set
> > your soft commit settings to be as long as you can stand. If you must
> > have very short intervals, consider disabling your caches completely.
> > Here's a long article on commits....
> > https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/
> >
> > <2> Actually, this and <1> are pretty close in priority.
> >
> > Then re-evaluate. Fixing the commit issue may buy you quite a bit of
> > time. Having 1,000 collections is pushing the boundaries presently.
> > Each collection will establish watchers on the bits it cares about in
> > ZooKeeper, and reducing the watchers by a factor approaching 1,000 is
> > A Good Thing.
> >
> > Frankly, between these two things I'd pretty much expect your problems
> > to disappear. wouldn't be the first time I've been totally wrong, but
> > it's where I'd start ;)
> >
> > Best,
> > Erick
> >
> > On Wed, Oct 25, 2017 at 8:54 AM, Fengtan <[hidden email]> wrote:
> >> Hi,
> >>
> >> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's.
> >> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ;
> each
> >> VM has one Solr and one ZK instance.
> >> The cluster hosts 1,000 collections ; each collection has 1 shard and
> >> between 500 and 50,000 documents.
> >> Documents are indexed incrementally every day ; the Solr client mostly
> does
> >> searching.
> >> Solr runs with -Xms7g -Xmx7g.
> >>
> >> Everything has been working fine for about one month but a few days ago
> we
> >> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm
> >>
> >> Also we have always seen these:
> >>  PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> >>
> >>
> >> We are not sure what is causing the timeouts, although we have
> identified a
> >> few things that could be improved:
> >>
> >> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProc
> essorFactory
> >> -- we are aware that explicit commits are expensive
> >>
> >> 2) Drop the 1,000 collections and use a single one instead (all our
> >> collections use the same schema/solrconfig.xml) since stability problems
> >> are expected when the number of collections reaches the low hundreds
> >> <https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The
> >> downside is that the new collection would contain 1,000,000 documents
> which
> >> may bring new challenges.
> >>
> >> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a
> >> better performance according to this
> >> <https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
> >,
> >> this
> >> <https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_
> First.29_Collector>
> >> and this
> >> <http://lucene.472066.n3.nabble.com/java-util-
> concurrent-TimeoutException-Idle-timeout-expired-50001-
> 50000-ms-td4321209.html>.
> >> The downside is that Lucene explicitely discourages the usage of G1
> >> <https://wiki.apache.org/lucene-java/JavaBugs#Java_
> Bugs_in_various_JVMs_affecting_Lucene_.2F_Solr>
> >> so we are not sure what to expect. We use the default GC settings:
> >>  -XX:NewRatio=3
> >>  -XX:SurvivorRatio=4
> >>  -XX:TargetSurvivorRatio=90
> >>  -XX:MaxTenuringThreshold=8
> >>  -XX:+UseConcMarkSweepGC
> >>  -XX:+UseParNewGC
> >>  -XX:ConcGCThreads=4
> >>  -XX:ParallelGCThreads=4
> >>  -XX:+CMSScavengeBeforeRemark
> >>  -XX:PretenureSizeThreshold=64m
> >>  -XX:+UseCMSInitiatingOccupancyOnly
> >>  -XX:CMSInitiatingOccupancyFraction=50
> >>  -XX:CMSMaxAbortablePrecleanTime=6000
> >>  -XX:+CMSParallelRemarkEnabled
> >>  -XX:+ParallelRefProcEnabled
> >>
> >> 4) Tune the caches, possibly by increasing autowarmCount on filterCache
> --
> >> our current config is:
> >>  <filterCache class="solr.FastLRUCache" size="512" initialSize="512"
> >> autowarmCount="0"/>
> >>  <queryResultCache class="solr.LRUCache" size="512" initialSize="512"
> >> autowarmCount="32"/>
> >>  <documentCache class="solr.LRUCache" size="512" initialSize="512"
> >> autowarmCount="0"/>
> >>
> >> 5) Tweak the timeout settings, although this would not fix the
> underlying
> >> issue
> >>
> >>
> >> Does any of these options seem relevant ? Is there anything else that
> might
> >> address the timeouts ?
> >>
> >> Thanks
>
>
Reply | Threaded
Open this post in threaded view
|

Re: TimeoutException, IOException, Read timed out

Fengtan
I am happy to report that <1> fixed these:
  PERFORMANCE WARNING: Overlapping onDeckSearchers=2

We still occasionnally see timeouts so we may have to explore <2>.





On Thu, Oct 26, 2017 at 12:12 PM, Fengtan <[hidden email]> wrote:

> Thanks Erick and Emir -- we are going to start with <1> and possibly <2>.
>
> On Thu, Oct 26, 2017 at 7:06 AM, Emir Arnautović <
> [hidden email]> wrote:
>
>> Hi Fengtan,
>> I would just add that when merging collections, you might want to use
>> document routing (https://lucene.apache.org/sol
>> r/guide/6_6/shards-and-indexing-data-in-solrcloud.html#Shard
>> sandIndexingDatainSolrCloud-DocumentRouting <
>> https://lucene.apache.org/solr/guide/6_6/shards-and-indexin
>> g-data-in-solrcloud.html#ShardsandIndexingDatainSolrCloud-DocumentRouting>)
>> - since you are keeping separate collections, I guess you have a
>> “collection ID” to use as routing key. This will enable you to have one
>> collection but query only shard(s) with data from one “collection”.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 25 Oct 2017, at 19:25, Erick Erickson <[hidden email]>
>> wrote:
>> >
>> > <1> It's not the explicit commits are expensive, it's that they happen
>> > too fast. An explicit commit and an internal autocommit have exactly
>> > the same cost. Your "overlapping ondeck searchers"  is definitely an
>> > indication that your commits are happening from somwhere too quickly
>> > and are piling up.
>> >
>> > <2> Likely a good thing, each collection increases overhead. And
>> > 1,000,000 documents is quite small in Solr's terms unless the
>> > individual documents are enormous. I'd do this for a number of
>> > reasons.
>> >
>> > <3> Certainly an option, but I'd put that last. Fix the commit problem
>> first ;)
>> >
>> > <4> If you do this, make the autowarm count quite small. That said,
>> > this will be very little use if you have frequent commits. Let's say
>> > you commit every second. The autowarming will warm caches, which will
>> > then be thrown out a second later. And will increase the time it takes
>> > to open a new searcher.
>> >
>> > <5> Yeah, this would probably just be a band-aid.
>> >
>> > If I were prioritizing these, I'd do
>> > <1> first. If you control the client, just don't call commit. If you
>> > do not control the client, then what you've outlined is fine. Tip: set
>> > your soft commit settings to be as long as you can stand. If you must
>> > have very short intervals, consider disabling your caches completely.
>> > Here's a long article on commits....
>> > https://lucidworks.com/2013/08/23/understanding-transaction-
>> logs-softcommit-and-commit-in-sorlcloud/
>> >
>> > <2> Actually, this and <1> are pretty close in priority.
>> >
>> > Then re-evaluate. Fixing the commit issue may buy you quite a bit of
>> > time. Having 1,000 collections is pushing the boundaries presently.
>> > Each collection will establish watchers on the bits it cares about in
>> > ZooKeeper, and reducing the watchers by a factor approaching 1,000 is
>> > A Good Thing.
>> >
>> > Frankly, between these two things I'd pretty much expect your problems
>> > to disappear. wouldn't be the first time I've been totally wrong, but
>> > it's where I'd start ;)
>> >
>> > Best,
>> > Erick
>> >
>> > On Wed, Oct 25, 2017 at 8:54 AM, Fengtan <[hidden email]> wrote:
>> >> Hi,
>> >>
>> >> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's.
>> >> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ;
>> each
>> >> VM has one Solr and one ZK instance.
>> >> The cluster hosts 1,000 collections ; each collection has 1 shard and
>> >> between 500 and 50,000 documents.
>> >> Documents are indexed incrementally every day ; the Solr client mostly
>> does
>> >> searching.
>> >> Solr runs with -Xms7g -Xmx7g.
>> >>
>> >> Everything has been working fine for about one month but a few days
>> ago we
>> >> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm
>> >>
>> >> Also we have always seen these:
>> >>  PERFORMANCE WARNING: Overlapping onDeckSearchers=2
>> >>
>> >>
>> >> We are not sure what is causing the timeouts, although we have
>> identified a
>> >> few things that could be improved:
>> >>
>> >> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProc
>> essorFactory
>> >> -- we are aware that explicit commits are expensive
>> >>
>> >> 2) Drop the 1,000 collections and use a single one instead (all our
>> >> collections use the same schema/solrconfig.xml) since stability
>> problems
>> >> are expected when the number of collections reaches the low hundreds
>> >> <https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The
>> >> downside is that the new collection would contain 1,000,000 documents
>> which
>> >> may bring new challenges.
>> >>
>> >> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring
>> a
>> >> better performance according to this
>> >> <https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pau
>> se_problems>,
>> >> this
>> >> <https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_Firs
>> t.29_Collector>
>> >> and this
>> >> <http://lucene.472066.n3.nabble.com/java-util-concurrent-
>> TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html>.
>> >> The downside is that Lucene explicitely discourages the usage of G1
>> >> <https://wiki.apache.org/lucene-java/JavaBugs#Java_Bugs_in_
>> various_JVMs_affecting_Lucene_.2F_Solr>
>> >> so we are not sure what to expect. We use the default GC settings:
>> >>  -XX:NewRatio=3
>> >>  -XX:SurvivorRatio=4
>> >>  -XX:TargetSurvivorRatio=90
>> >>  -XX:MaxTenuringThreshold=8
>> >>  -XX:+UseConcMarkSweepGC
>> >>  -XX:+UseParNewGC
>> >>  -XX:ConcGCThreads=4
>> >>  -XX:ParallelGCThreads=4
>> >>  -XX:+CMSScavengeBeforeRemark
>> >>  -XX:PretenureSizeThreshold=64m
>> >>  -XX:+UseCMSInitiatingOccupancyOnly
>> >>  -XX:CMSInitiatingOccupancyFraction=50
>> >>  -XX:CMSMaxAbortablePrecleanTime=6000
>> >>  -XX:+CMSParallelRemarkEnabled
>> >>  -XX:+ParallelRefProcEnabled
>> >>
>> >> 4) Tune the caches, possibly by increasing autowarmCount on
>> filterCache --
>> >> our current config is:
>> >>  <filterCache class="solr.FastLRUCache" size="512" initialSize="512"
>> >> autowarmCount="0"/>
>> >>  <queryResultCache class="solr.LRUCache" size="512" initialSize="512"
>> >> autowarmCount="32"/>
>> >>  <documentCache class="solr.LRUCache" size="512" initialSize="512"
>> >> autowarmCount="0"/>
>> >>
>> >> 5) Tweak the timeout settings, although this would not fix the
>> underlying
>> >> issue
>> >>
>> >>
>> >> Does any of these options seem relevant ? Is there anything else that
>> might
>> >> address the timeouts ?
>> >>
>> >> Thanks
>>
>>
>