SolrClould 6.6 stability challenges

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

SolrClould 6.6 stability challenges

Rick Dig
hello all,
we are trying to run solrcloud 6.6 in a production setting.
here's our config and issue
1) 3 nodes, 1 shard, replication factor 3
2) all nodes are 16GB RAM, 4 core
3) Our production load is about 2000 requests per minute
4) index is fairly small, index size is around 400 MB with 300k documents
5) autocommit is currently set to 5 minutes (even though ideally we would
like a smaller interval).
6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
7) all of this runs perfectly ok when indexing isn't happening. as soon as
we start "nrt" indexing one of the follower nodes goes down within 10 to 20
minutes. from this point on the nodes never recover unless we stop
indexing.  the master usually is the last one to fall.
8) there are maybe 5 to 7 processes indexing at the same time with document
batch sizes of 500.
9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
10) no cpu and / or oom issues that we can see.
11) cpu load does go fairly high 15 to 20 at times.
any help or pointers appreciated

thanks
rick
Reply | Threaded
Open this post in threaded view
|

Re: SolrClould 6.6 stability challenges

Emir Arnautović
Hi Rick,
Do you see any errors in logs? Do you have any monitoring tool? Maybe you can check heap and GC metrics around time when incident happened. It is not large heap but some major GC could cause pause large enough to trigger some snowball and end up with node in recovery state.
What is indexing rate you observe? Why do you have max warming searchers 5 (did you mean this with autowarmingsearchers?) when you commit every 5 min? Why did you increase it - you seen errors with default 2? Maybe you commit every bulk?
Do you see similar behaviour when you just do indexing without queries?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Nov 2017, at 05:15, Rick Dig <[hidden email]> wrote:
>
> hello all,
> we are trying to run solrcloud 6.6 in a production setting.
> here's our config and issue
> 1) 3 nodes, 1 shard, replication factor 3
> 2) all nodes are 16GB RAM, 4 core
> 3) Our production load is about 2000 requests per minute
> 4) index is fairly small, index size is around 400 MB with 300k documents
> 5) autocommit is currently set to 5 minutes (even though ideally we would
> like a smaller interval).
> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
> 7) all of this runs perfectly ok when indexing isn't happening. as soon as
> we start "nrt" indexing one of the follower nodes goes down within 10 to 20
> minutes. from this point on the nodes never recover unless we stop
> indexing.  the master usually is the last one to fall.
> 8) there are maybe 5 to 7 processes indexing at the same time with document
> batch sizes of 500.
> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
> 10) no cpu and / or oom issues that we can see.
> 11) cpu load does go fairly high 15 to 20 at times.
> any help or pointers appreciated
>
> thanks
> rick

Reply | Threaded
Open this post in threaded view
|

Re: SolrClould 6.6 stability challenges

Rick Dig

hi Emir,
thanks for the response - 
a) we see this once in a while when the node goes down, nothing at other times. 
ERROR - 2017-10-02 12:19:07.222; [c:rbconfig s:shard1 r:core_node4 x:rbconfig_shard1_replica4] org.apache.solr.common.SolrException; org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async exception during distributed update: Read timed out
at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:972)
at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1911)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:78)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) 

b) gc log attached. looks ok as far as i can tell

c) we actually were running default 2 maxWarmingSearchers, just experimented with 5, failing in both cases. 

d) making sure we don't commit every bulk.

e) we are able to index 500 documents in around 20 seconds.

f) when we index without queries it works just fine without any issues. queries work fine as well by themselves. just the two together causes nodes to go down.





On Sat, Nov 4, 2017 at 2:47 PM, Emir Arnautović <[hidden email]> wrote:
Hi Rick,
Do you see any errors in logs? Do you have any monitoring tool? Maybe you can check heap and GC metrics around time when incident happened. It is not large heap but some major GC could cause pause large enough to trigger some snowball and end up with node in recovery state.
What is indexing rate you observe? Why do you have max warming searchers 5 (did you mean this with autowarmingsearchers?) when you commit every 5 min? Why did you increase it - you seen errors with default 2? Maybe you commit every bulk?
Do you see similar behaviour when you just do indexing without queries?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Nov 2017, at 05:15, Rick Dig <[hidden email]> wrote:
>
> hello all,
> we are trying to run solrcloud 6.6 in a production setting.
> here's our config and issue
> 1) 3 nodes, 1 shard, replication factor 3
> 2) all nodes are 16GB RAM, 4 core
> 3) Our production load is about 2000 requests per minute
> 4) index is fairly small, index size is around 400 MB with 300k documents
> 5) autocommit is currently set to 5 minutes (even though ideally we would
> like a smaller interval).
> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
> 7) all of this runs perfectly ok when indexing isn't happening. as soon as
> we start "nrt" indexing one of the follower nodes goes down within 10 to 20
> minutes. from this point on the nodes never recover unless we stop
> indexing.  the master usually is the last one to fall.
> 8) there are maybe 5 to 7 processes indexing at the same time with document
> batch sizes of 500.
> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
> 10) no cpu and / or oom issues that we can see.
> 11) cpu load does go fairly high 15 to 20 at times.
> any help or pointers appreciated
>
> thanks
> rick




solr_gc.log.0.current.zip (462K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: SolrClould 6.6 stability challenges

Amrit Sarkar
In reply to this post by Emir Arnautović
Pretty much what Emir has stated. I want to know, when you saw;

all of this runs perfectly ok when indexing isn't happening. as soon as
> we start "nrt" indexing one of the follower nodes goes down within 10 to 20
> minutes.


When you say "NRT" indexing, what is the commit strategy in indexing. With
auto-commit so highly set, are you committing after batch, if yes, what's
the number.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Sat, Nov 4, 2017 at 2:47 PM, Emir Arnautović <
[hidden email]> wrote:

> Hi Rick,
> Do you see any errors in logs? Do you have any monitoring tool? Maybe you
> can check heap and GC metrics around time when incident happened. It is not
> large heap but some major GC could cause pause large enough to trigger some
> snowball and end up with node in recovery state.
> What is indexing rate you observe? Why do you have max warming searchers 5
> (did you mean this with autowarmingsearchers?) when you commit every 5 min?
> Why did you increase it - you seen errors with default 2? Maybe you commit
> every bulk?
> Do you see similar behaviour when you just do indexing without queries?
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 4 Nov 2017, at 05:15, Rick Dig <[hidden email]> wrote:
> >
> > hello all,
> > we are trying to run solrcloud 6.6 in a production setting.
> > here's our config and issue
> > 1) 3 nodes, 1 shard, replication factor 3
> > 2) all nodes are 16GB RAM, 4 core
> > 3) Our production load is about 2000 requests per minute
> > 4) index is fairly small, index size is around 400 MB with 300k documents
> > 5) autocommit is currently set to 5 minutes (even though ideally we would
> > like a smaller interval).
> > 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
> > 7) all of this runs perfectly ok when indexing isn't happening. as soon
> as
> > we start "nrt" indexing one of the follower nodes goes down within 10 to
> 20
> > minutes. from this point on the nodes never recover unless we stop
> > indexing.  the master usually is the last one to fall.
> > 8) there are maybe 5 to 7 processes indexing at the same time with
> document
> > batch sizes of 500.
> > 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
> > 10) no cpu and / or oom issues that we can see.
> > 11) cpu load does go fairly high 15 to 20 at times.
> > any help or pointers appreciated
> >
> > thanks
> > rick
>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrClould 6.6 stability challenges

Rick Dig
not committing after the batch. made sure we have that turned off.
maxTime is set to 300000 (300 seconds), openSearcher is set to true.


On Sat, Nov 4, 2017 at 6:50 PM, Amrit Sarkar <[hidden email]> wrote:

> Pretty much what Emir has stated. I want to know, when you saw;
>
> all of this runs perfectly ok when indexing isn't happening. as soon as
> > we start "nrt" indexing one of the follower nodes goes down within 10 to
> 20
> > minutes.
>
>
> When you say "NRT" indexing, what is the commit strategy in indexing. With
> auto-commit so highly set, are you committing after batch, if yes, what's
> the number.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Sat, Nov 4, 2017 at 2:47 PM, Emir Arnautović <
> [hidden email]> wrote:
>
> > Hi Rick,
> > Do you see any errors in logs? Do you have any monitoring tool? Maybe you
> > can check heap and GC metrics around time when incident happened. It is
> not
> > large heap but some major GC could cause pause large enough to trigger
> some
> > snowball and end up with node in recovery state.
> > What is indexing rate you observe? Why do you have max warming searchers
> 5
> > (did you mean this with autowarmingsearchers?) when you commit every 5
> min?
> > Why did you increase it - you seen errors with default 2? Maybe you
> commit
> > every bulk?
> > Do you see similar behaviour when you just do indexing without queries?
> >
> > Thanks,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 4 Nov 2017, at 05:15, Rick Dig <[hidden email]> wrote:
> > >
> > > hello all,
> > > we are trying to run solrcloud 6.6 in a production setting.
> > > here's our config and issue
> > > 1) 3 nodes, 1 shard, replication factor 3
> > > 2) all nodes are 16GB RAM, 4 core
> > > 3) Our production load is about 2000 requests per minute
> > > 4) index is fairly small, index size is around 400 MB with 300k
> documents
> > > 5) autocommit is currently set to 5 minutes (even though ideally we
> would
> > > like a smaller interval).
> > > 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
> > > 7) all of this runs perfectly ok when indexing isn't happening. as soon
> > as
> > > we start "nrt" indexing one of the follower nodes goes down within 10
> to
> > 20
> > > minutes. from this point on the nodes never recover unless we stop
> > > indexing.  the master usually is the last one to fall.
> > > 8) there are maybe 5 to 7 processes indexing at the same time with
> > document
> > > batch sizes of 500.
> > > 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
> > > 10) no cpu and / or oom issues that we can see.
> > > 11) cpu load does go fairly high 15 to 20 at times.
> > > any help or pointers appreciated
> > >
> > > thanks
> > > rick
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrClould 6.6 stability challenges

Emir Arnautović
Hi Rick,
I quickly looked at GC logs and didn’t see obvious issues. You mentioned that batch processing takes ~20s and it is 500 documents. With 5-7 indexing thread it is ~150 documents/s. Are those big documents?
With 200 queries/min (~3-4 queries/s - what sort of queries?) and 5-7 indexing threads, you might be overloading 4 cores.
Do you have dedicated ZK nodes? Do you see the same issues with less indexing threads?

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Nov 2017, at 14:25, Rick Dig <[hidden email]> wrote:
>
> not committing after the batch. made sure we have that turned off.
> maxTime is set to 300000 (300 seconds), openSearcher is set to true.
>
>
> On Sat, Nov 4, 2017 at 6:50 PM, Amrit Sarkar <[hidden email]> wrote:
>
>> Pretty much what Emir has stated. I want to know, when you saw;
>>
>> all of this runs perfectly ok when indexing isn't happening. as soon as
>>> we start "nrt" indexing one of the follower nodes goes down within 10 to
>> 20
>>> minutes.
>>
>>
>> When you say "NRT" indexing, what is the commit strategy in indexing. With
>> auto-commit so highly set, are you committing after batch, if yes, what's
>> the number.
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>
>> On Sat, Nov 4, 2017 at 2:47 PM, Emir Arnautović <
>> [hidden email]> wrote:
>>
>>> Hi Rick,
>>> Do you see any errors in logs? Do you have any monitoring tool? Maybe you
>>> can check heap and GC metrics around time when incident happened. It is
>> not
>>> large heap but some major GC could cause pause large enough to trigger
>> some
>>> snowball and end up with node in recovery state.
>>> What is indexing rate you observe? Why do you have max warming searchers
>> 5
>>> (did you mean this with autowarmingsearchers?) when you commit every 5
>> min?
>>> Why did you increase it - you seen errors with default 2? Maybe you
>> commit
>>> every bulk?
>>> Do you see similar behaviour when you just do indexing without queries?
>>>
>>> Thanks,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>>
>>>> On 4 Nov 2017, at 05:15, Rick Dig <[hidden email]> wrote:
>>>>
>>>> hello all,
>>>> we are trying to run solrcloud 6.6 in a production setting.
>>>> here's our config and issue
>>>> 1) 3 nodes, 1 shard, replication factor 3
>>>> 2) all nodes are 16GB RAM, 4 core
>>>> 3) Our production load is about 2000 requests per minute
>>>> 4) index is fairly small, index size is around 400 MB with 300k
>> documents
>>>> 5) autocommit is currently set to 5 minutes (even though ideally we
>> would
>>>> like a smaller interval).
>>>> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
>>>> 7) all of this runs perfectly ok when indexing isn't happening. as soon
>>> as
>>>> we start "nrt" indexing one of the follower nodes goes down within 10
>> to
>>> 20
>>>> minutes. from this point on the nodes never recover unless we stop
>>>> indexing.  the master usually is the last one to fall.
>>>> 8) there are maybe 5 to 7 processes indexing at the same time with
>>> document
>>>> batch sizes of 500.
>>>> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
>>>> 10) no cpu and / or oom issues that we can see.
>>>> 11) cpu load does go fairly high 15 to 20 at times.
>>>> any help or pointers appreciated
>>>>
>>>> thanks
>>>> rick
>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: SolrClould 6.6 stability challenges

Shawn Heisey-2
In reply to this post by Rick Dig
On 11/3/2017 10:15 PM, Rick Dig wrote:

> we are trying to run solrcloud 6.6 in a production setting.
> here's our config and issue
> 1) 3 nodes, 1 shard, replication factor 3
> 2) all nodes are 16GB RAM, 4 core
> 3) Our production load is about 2000 requests per minute
> 4) index is fairly small, index size is around 400 MB with 300k documents
> 5) autocommit is currently set to 5 minutes (even though ideally we would
> like a smaller interval).
> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
> 7) all of this runs perfectly ok when indexing isn't happening. as soon as
> we start "nrt" indexing one of the follower nodes goes down within 10 to 20
> minutes. from this point on the nodes never recover unless we stop
> indexing.  the master usually is the last one to fall.
> 8) there are maybe 5 to 7 processes indexing at the same time with document
> batch sizes of 500.
> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
> 10) no cpu and / or oom issues that we can see.
> 11) cpu load does go fairly high 15 to 20 at times.

My two cents to add to what you've already seen:

With 300K documents and 400MB of index size, an 8GB heap seems very
excessive, even with complex queries.  What evidence do you have that
you need a heap that size?  Are you just following a best practice
recommendation you saw somewhere to give half your memory to Java?

This is a *tiny* index by both document count and size.  Each document
cannot be very big.

Your GC log doesn't show any issues that concern me.  There are a few
slow GCs, but when you index, that's probably to be expected, especially
with an 8GB heap.

What exactly do you mean by "one of the follower nodes goes down"?  When
this happens, are there error messages at the time of the event?  What
symptoms are there pertaining to that specific node?

A query load of 2000 per minute is about 33 per second.  Are these
queries steady for the full minute, or is it bursty?  33 qps is high,
but not insane, and with such a tiny index, is probably well within
Solr's capabilities.

There should be no reason to *ever* increase maxWarmingSearchers.  If
you see the warning about this, the fix is to reduce your commit
frequency, not increase the value.  Increasing the value can lead to
memory and performance problems.  The fact that this value is even being
discussed, and that the value has been changed on your setup, has me
thinking that there may be more commits happening than the
every-five-minute autocommit.

For automatic commits, I have some recommendations for everyone to start
with, and then adjust if necessary:  autoCommit: maxTime of 60000,
openSearcher false.  autoSoftCommit, maxTime of 120000.  Neither one
should have maxDocs configured.

It should take far less than 20 seconds to index a 500 document batch,
especially when they are small enough for 300K of them to produce a
400MB index.  There are only a few problems I can imagine right now that
could cause such slow indexing, having no real information to go on:  1)
The analysis chains in your schema are exceptionally heavy and take a
long time to run.  2) There is a performance issue happening that we
have not yet figured out.  3) Your indexing request includes a commit,
and the commit is happening very slowly.

Here is a log entry on one of my indexes showing 1000 documents being
added in 777 milliseconds.  The index that this is happening on is about
40GB in size, with about 30 million documents.  I have redacted part of
the uniqueKey values in this log, to hide the sources of our data:

2017-11-04 09:30:14.325 INFO  (qtp1394336709-42397) [   x:spark6live]
o.a.s.u.p.LogUpdateProcessorFactory [spark6live]  webapp=/solr
path=/update params={wt=javabin&version=2}{add=[REDACTEDsix557224
(1583127266377859072), REDACTEDsix557228 (1583127266381004800),
REDACTEDtwo979483 (1583127266381004801), REDACTEDtwo979488
(1583127266382053376), REDACTEDtwo979490 (1583127266383101952),
REDACTEDsix557260 (1583127266383101953), REDACTEDsix557242
(1583127266384150528), REDACTEDsix557258 (1583127266385199104),
REDACTEDsix557247 (1583127266385199105), REDACTEDsix557276
(1583127266394636288), ... (1000 adds)]} 0 777

The rate I'm getting here of 1000 docs in 777 milliseconds is a rate
that I consider to be pretty slow, especially because my indexing is
single-threaded.  But it works for us.  What you're seeing where 500
documents takes 20 seconds is slower than I've EVER seen, except in
situations where there's a serious problem.  On a system in good health,
with multiple threads indexing, Solr should be able to index several
thousand documents every second.

Is the indexing program running on the same machine as Solr, or on
another machine?  For best results, it should be on a different machine,
accessing Solr via HTTP.  This is so that whatever load the indexing
program creates does not take CPU, memory, and I/O resources away from Solr.

What OS is Solr running on?  If more information is needed, it will be a
good idea to know precisely how to gather that information.

Overall, based on the information currently available, you should not be
having the problems you are.  So there must be something about your
setup that's not configured correctly beyond the information we've
already got.  It could be directly Solr-related, or something else
indirectly causing problems.  I do not yet know exactly what information
we might need to help.

Can you share an entire solr.log file that covers enough time so that
there is both indexing and querying happening?  If it also covers that
node going down, that would be even better.  You'll probably need to use
a file-sharing website to share the log -- I'm surprised your GC log
made it to the list.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: SolrClould 6.6 stability challenges

Erick Erickson
Check the leader and follower logs for anything like "leader initiated
recovery" (LIR). One thing I have seen where followers go into
recovery is if, for some reason, the time it takes to respond to an
update exceeds the timeout. The scenario is this:
> leader sends an update
> follower fails to respond for _any_ reason within the timeout
> leader says "sick follower, make it recover"

In the particular case I'm thinking of, indexing the packet took
minutes. I strongly doubt that your documents are pathological enough
to hit this, but there's at least a chance that the update are
queueing up on the follower and the updates are timing out.

Best,
Erick


On Sun, Nov 5, 2017 at 7:14 AM, Shawn Heisey <[hidden email]> wrote:

> On 11/3/2017 10:15 PM, Rick Dig wrote:
>>
>> we are trying to run solrcloud 6.6 in a production setting.
>> here's our config and issue
>> 1) 3 nodes, 1 shard, replication factor 3
>> 2) all nodes are 16GB RAM, 4 core
>> 3) Our production load is about 2000 requests per minute
>> 4) index is fairly small, index size is around 400 MB with 300k documents
>> 5) autocommit is currently set to 5 minutes (even though ideally we would
>> like a smaller interval).
>> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
>> 7) all of this runs perfectly ok when indexing isn't happening. as soon as
>> we start "nrt" indexing one of the follower nodes goes down within 10 to
>> 20
>> minutes. from this point on the nodes never recover unless we stop
>> indexing.  the master usually is the last one to fall.
>> 8) there are maybe 5 to 7 processes indexing at the same time with
>> document
>> batch sizes of 500.
>> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
>> 10) no cpu and / or oom issues that we can see.
>> 11) cpu load does go fairly high 15 to 20 at times.
>
>
> My two cents to add to what you've already seen:
>
> With 300K documents and 400MB of index size, an 8GB heap seems very
> excessive, even with complex queries.  What evidence do you have that you
> need a heap that size?  Are you just following a best practice
> recommendation you saw somewhere to give half your memory to Java?
>
> This is a *tiny* index by both document count and size.  Each document
> cannot be very big.
>
> Your GC log doesn't show any issues that concern me.  There are a few slow
> GCs, but when you index, that's probably to be expected, especially with an
> 8GB heap.
>
> What exactly do you mean by "one of the follower nodes goes down"?  When
> this happens, are there error messages at the time of the event?  What
> symptoms are there pertaining to that specific node?
>
> A query load of 2000 per minute is about 33 per second.  Are these queries
> steady for the full minute, or is it bursty?  33 qps is high, but not
> insane, and with such a tiny index, is probably well within Solr's
> capabilities.
>
> There should be no reason to *ever* increase maxWarmingSearchers.  If you
> see the warning about this, the fix is to reduce your commit frequency, not
> increase the value.  Increasing the value can lead to memory and performance
> problems.  The fact that this value is even being discussed, and that the
> value has been changed on your setup, has me thinking that there may be more
> commits happening than the every-five-minute autocommit.
>
> For automatic commits, I have some recommendations for everyone to start
> with, and then adjust if necessary:  autoCommit: maxTime of 60000,
> openSearcher false.  autoSoftCommit, maxTime of 120000.  Neither one should
> have maxDocs configured.
>
> It should take far less than 20 seconds to index a 500 document batch,
> especially when they are small enough for 300K of them to produce a 400MB
> index.  There are only a few problems I can imagine right now that could
> cause such slow indexing, having no real information to go on:  1) The
> analysis chains in your schema are exceptionally heavy and take a long time
> to run.  2) There is a performance issue happening that we have not yet
> figured out.  3) Your indexing request includes a commit, and the commit is
> happening very slowly.
>
> Here is a log entry on one of my indexes showing 1000 documents being added
> in 777 milliseconds.  The index that this is happening on is about 40GB in
> size, with about 30 million documents.  I have redacted part of the
> uniqueKey values in this log, to hide the sources of our data:
>
> 2017-11-04 09:30:14.325 INFO  (qtp1394336709-42397) [   x:spark6live]
> o.a.s.u.p.LogUpdateProcessorFactory [spark6live]  webapp=/solr path=/update
> params={wt=javabin&version=2}{add=[REDACTEDsix557224 (1583127266377859072),
> REDACTEDsix557228 (1583127266381004800), REDACTEDtwo979483
> (1583127266381004801), REDACTEDtwo979488 (1583127266382053376),
> REDACTEDtwo979490 (1583127266383101952), REDACTEDsix557260
> (1583127266383101953), REDACTEDsix557242 (1583127266384150528),
> REDACTEDsix557258 (1583127266385199104), REDACTEDsix557247
> (1583127266385199105), REDACTEDsix557276 (1583127266394636288), ... (1000
> adds)]} 0 777
>
> The rate I'm getting here of 1000 docs in 777 milliseconds is a rate that I
> consider to be pretty slow, especially because my indexing is
> single-threaded.  But it works for us.  What you're seeing where 500
> documents takes 20 seconds is slower than I've EVER seen, except in
> situations where there's a serious problem.  On a system in good health,
> with multiple threads indexing, Solr should be able to index several
> thousand documents every second.
>
> Is the indexing program running on the same machine as Solr, or on another
> machine?  For best results, it should be on a different machine, accessing
> Solr via HTTP.  This is so that whatever load the indexing program creates
> does not take CPU, memory, and I/O resources away from Solr.
>
> What OS is Solr running on?  If more information is needed, it will be a
> good idea to know precisely how to gather that information.
>
> Overall, based on the information currently available, you should not be
> having the problems you are.  So there must be something about your setup
> that's not configured correctly beyond the information we've already got.
> It could be directly Solr-related, or something else indirectly causing
> problems.  I do not yet know exactly what information we might need to help.
>
> Can you share an entire solr.log file that covers enough time so that there
> is both indexing and querying happening?  If it also covers that node going
> down, that would be even better.  You'll probably need to use a file-sharing
> website to share the log -- I'm surprised your GC log made it to the list.
>
> Thanks,
> Shawn
Reply | Threaded
Open this post in threaded view
|

Re: SolrClould 6.6 stability challenges

Rick Dig
In reply to this post by Shawn Heisey-2
hi Shawn, all,
answers inline.
also, another discovery, not sure if completely useful. even when we
increase the autocommit values to say an hour, the nodes go "down" in 10-15
minutes. so either we are doing something wrong with autocommit settings
and commits are continuing to happen frequently (how do we confirm that
this isn't the case?) or they don't seem to matter to our scenario.
but like i mentioned before, querying and indexing work completely fine
when the other thing isn't running.

also, roughly 60% of the queries are "autocomplete" on a N-gram multivalued
field, which uses payload parsing as well as highlighting.

thanks



>>
> My two cents to add to what you've already seen:
>
> With 300K documents and 400MB of index size, an 8GB heap seems very
> excessive, even with complex queries.  What evidence do you have that you
> need a heap that size?  Are you just following a best practice
> recommendation you saw somewhere to give half your memory to Java?
>
*no hard evidence per se but pretty much a best practice recommendation. we
were allocating half of the memory available on the machine to the heap.*


>
> This is a *tiny* index by both document count and size.  Each document
> cannot be very big.
>
> Your GC log doesn't show any issues that concern me.  There are a few slow
> GCs, but when you index, that's probably to be expected, especially with an
> 8GB heap.
>
> What exactly do you mean by "one of the follower nodes goes down"?  When
> this happens, are there error messages at the time of the event?  What
> symptoms are there pertaining to that specific node?
>
*When a "node" goes down - the jvm continues to run but zookeeper shows the
node as "down" and queries stop being routed to that node.*



>
> A query load of 2000 per minute is about 33 per second.  Are these queries
> steady for the full minute, or is it bursty?  33 qps is high, but not
> insane, and with such a tiny index, is probably well within Solr's
> capabilities.
>
*the query volumes vary but we have never seen them go beyond 2000 per
minute. bursts are possible but haven't seen any massive bursts.*


>
> There should be no reason to *ever* increase maxWarmingSearchers.  If you
> see the warning about this, the fix is to reduce your commit frequency, not
> increase the value.  Increasing the value can lead to memory and
> performance problems.  The fact that this value is even being discussed,
> and that the value has been changed on your setup, has me thinking that
> there may be more commits happening than the every-five-minute autocommit.
>
*this was just a trial and error change, which we have since reverted,
because we have been struggling to understand the root cause.*

>
> For automatic commits, I have some recommendations for everyone to start
> with, and then adjust if necessary:  autoCommit: maxTime of 60000,
> openSearcher false.  autoSoftCommit, maxTime of 120000.  Neither one should
> have maxDocs configured.
>
> It should take far less than 20 seconds to index a 500 document batch,
> especially when they are small enough for 300K of them to produce a 400MB
> index.  There are only a few problems I can imagine right now that could
> cause such slow indexing, having no real information to go on:  1) The
> analysis chains in your schema are exceptionally heavy and take a long time
> to run.  2) There is a performance issue happening that we have not yet
> figured out.  3) Your indexing request includes a commit, and the commit is
> happening very slowly.
> *my apologies, the 20 second time  for a batch is  kinda misleading
> because this includes time taken for the application logic in constructing
> the data. *




> Here is a log entry on one of my indexes showing 1000 documents being
> added in 777 milliseconds.  The index that this is happening on is about
> 40GB in size, with about 30 million documents.  I have redacted part of the
> uniqueKey values in this log, to hide the sources of our data:
>
> 2017-11-04 09:30:14.325 INFO  (qtp1394336709-42397) [   x:spark6live]
> o.a.s.u.p.LogUpdateProcessorFactory [spark6live]  webapp=/solr
> path=/update params={wt=javabin&version=2}{add=[REDACTEDsix557224
> (1583127266377859072), REDACTEDsix557228 (1583127266381004800),
> REDACTEDtwo979483 (1583127266381004801), REDACTEDtwo979488
> (1583127266382053376), REDACTEDtwo979490 (1583127266383101952),
> REDACTEDsix557260 (1583127266383101953), REDACTEDsix557242
> (1583127266384150528), REDACTEDsix557258 (1583127266385199104),
> REDACTEDsix557247 (1583127266385199105), REDACTEDsix557276
> (1583127266394636288), ... (1000 adds)]} 0 777
>
> The rate I'm getting here of 1000 docs in 777 milliseconds is a rate that
> I consider to be pretty slow, especially because my indexing is
> single-threaded.  But it works for us.  What you're seeing where 500
> documents takes 20 seconds is slower than I've EVER seen, except in
> situations where there's a serious problem.  On a system in good health,
> with multiple threads indexing, Solr should be able to index several
> thousand documents every second.
>
> Is the indexing program running on the same machine as Solr, or on another
> machine?  For best results, it should be on a different machine, accessing
> Solr via HTTP.  This is so that whatever load the indexing program creates
> does not take CPU, memory, and I/O resources away from Solr.
>
*the indexing programs run on a different machine and solr is accessed via
http.*


>
> What OS is Solr running on?  If more information is needed, it will be a
> good idea to know precisely how to gather that information.
>
*solr is running on Ubuntu*

>
> Overall, based on the information currently available, you should not be
> having the problems you are.  So there must be something about your setup
> that's not configured correctly beyond the information we've already got.
> It could be directly Solr-related, or something else indirectly causing
> problems.  I do not yet know exactly what information we might need to help.
>
> Can you share an entire solr.log file that covers enough time so that
> there is both indexing and querying happening?  If it also covers that node
> going down, that would be even better.  You'll probably need to use a
> file-sharing website to share the log -- I'm surprised your GC log made it
> to the list.
> *will get a log uploaded and share with the group.*



> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrClould 6.6 stability challenges

Rick Dig
In reply to this post by Emir Arnautović
hi Emir -
the document size would be an average of  less than 1.5kb.
it is actually 2000 queries / min - queries are primarily autocomplete +
highlighting (on a multivalued field with different payloads),  search and
faceting .
what should we watch for that would indicate that we are overloading the
cpu cores ? (the cpu peaks at 75%, but like i mentioned earlier we've seen
that "load" can go up to 20, not sure if this has an impact).
yes, we have dedicated zk nodes.
yes, we predictably encounter this issue even when the indexing thread is
just one.

thanks


On Sun, Nov 5, 2017 at 3:12 PM, Emir Arnautović <
[hidden email]> wrote:

> Hi Rick,
> I quickly looked at GC logs and didn’t see obvious issues. You mentioned
> that batch processing takes ~20s and it is 500 documents. With 5-7 indexing
> thread it is ~150 documents/s. Are those big documents?
> With 200 queries/min (~3-4 queries/s - what sort of queries?) and 5-7
> indexing threads, you might be overloading 4 cores.
> Do you have dedicated ZK nodes? Do you see the same issues with less
> indexing threads?
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 4 Nov 2017, at 14:25, Rick Dig <[hidden email]> wrote:
> >
> > not committing after the batch. made sure we have that turned off.
> > maxTime is set to 300000 (300 seconds), openSearcher is set to true.
> >
> >
> > On Sat, Nov 4, 2017 at 6:50 PM, Amrit Sarkar <[hidden email]>
> wrote:
> >
> >> Pretty much what Emir has stated. I want to know, when you saw;
> >>
> >> all of this runs perfectly ok when indexing isn't happening. as soon as
> >>> we start "nrt" indexing one of the follower nodes goes down within 10
> to
> >> 20
> >>> minutes.
> >>
> >>
> >> When you say "NRT" indexing, what is the commit strategy in indexing.
> With
> >> auto-commit so highly set, are you committing after batch, if yes,
> what's
> >> the number.
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Sat, Nov 4, 2017 at 2:47 PM, Emir Arnautović <
> >> [hidden email]> wrote:
> >>
> >>> Hi Rick,
> >>> Do you see any errors in logs? Do you have any monitoring tool? Maybe
> you
> >>> can check heap and GC metrics around time when incident happened. It is
> >> not
> >>> large heap but some major GC could cause pause large enough to trigger
> >> some
> >>> snowball and end up with node in recovery state.
> >>> What is indexing rate you observe? Why do you have max warming
> searchers
> >> 5
> >>> (did you mean this with autowarmingsearchers?) when you commit every 5
> >> min?
> >>> Why did you increase it - you seen errors with default 2? Maybe you
> >> commit
> >>> every bulk?
> >>> Do you see similar behaviour when you just do indexing without queries?
> >>>
> >>> Thanks,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
> >>>> On 4 Nov 2017, at 05:15, Rick Dig <[hidden email]> wrote:
> >>>>
> >>>> hello all,
> >>>> we are trying to run solrcloud 6.6 in a production setting.
> >>>> here's our config and issue
> >>>> 1) 3 nodes, 1 shard, replication factor 3
> >>>> 2) all nodes are 16GB RAM, 4 core
> >>>> 3) Our production load is about 2000 requests per minute
> >>>> 4) index is fairly small, index size is around 400 MB with 300k
> >> documents
> >>>> 5) autocommit is currently set to 5 minutes (even though ideally we
> >> would
> >>>> like a smaller interval).
> >>>> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc.
> >>>> 7) all of this runs perfectly ok when indexing isn't happening. as
> soon
> >>> as
> >>>> we start "nrt" indexing one of the follower nodes goes down within 10
> >> to
> >>> 20
> >>>> minutes. from this point on the nodes never recover unless we stop
> >>>> indexing.  the master usually is the last one to fall.
> >>>> 8) there are maybe 5 to 7 processes indexing at the same time with
> >>> document
> >>>> batch sizes of 500.
> >>>> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5,
> >>>> 10) no cpu and / or oom issues that we can see.
> >>>> 11) cpu load does go fairly high 15 to 20 at times.
> >>>> any help or pointers appreciated
> >>>>
> >>>> thanks
> >>>> rick
> >>>
> >>>
> >>
>
>