SolrCloud scaling/optimization for high request rate

classic Classic list List threaded Threaded
43 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk

Hi Ere,

Thanks for your advice! I'm aware of the performance problems with deep paging and unfortunately it is not the case here, as the rows number is always 24 and next pages are hardly ever requested from what i see in the logs.


On 29.10.18 11:19, Ere Maijala wrote:
Hi Sofiya,

You've already received a lot of ideas, but I think this wasn't yet mentioned: You didn't specify the number of rows your queries fetch or whether you're using deep paging in the queries. Both can be real perfomance killers in a sharded index because a large set of records have to be fetched from all shards. This consumes a relatively high amount of memory, and even if the servers are able to handle a certain number of these queries simultaneously, you'd run into garbage collection trouble with more queries being served. So just one more thing to be aware of!

Regards,
Ere

Sofiya Strochyk kirjoitti 26.10.2018 klo 18.55:
Hi everyone,

We have a SolrCloud setup with the following configuration:

  * 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
    E5-1650v2, 12 cores, with SSDs)
  * One collection, 4 shards, each has only a single replica (so 4
    replicas in total), using compositeId router
  * Total index size is about 150M documents/320GB, so about 40M/80GB
    per node
  * Zookeeper is on a separate server
  * Documents consist of about 20 fields (most of them are both stored
    and indexed), average document size is about2kB
  * Queries are mostly 2-3 words in the q field, with 2 fq parameters,
    with complex sort expression (containing IF functions)
  * We don't use faceting due to performance reasons but need to add it
    in the future
  * Majority of the documents are reindexed 2 times/day, as fast as the
    SOLR allows, in batches of 1000-10000 docs. Some of the documents
    are also deleted (by id, not by query)
  * autoCommit is set to maxTime of 1 minute with openSearcher=false and
    autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits
    from clients are ignored.
  * Heap size is set to 8GB.

Target query rate is up to 500 qps, maybe 300, and we need to keep response time at <200ms. But at the moment we only see very good search performance with up to 100 requests per second. Whenever it grows to about 200, average response time abruptly increases to 0.5-1 second. (Also it seems that request rate reported by SOLR in admin metrics is 2x higher than the real one, because for every query, every shard receives 2 requests: one to obtain IDs and second one to get data by IDs; so target rate for SOLR metrics would be 1000 qps).

During high request load, CPU usage increases dramatically on the SOLR nodes. It doesn't reach 100% but averages at 50-70% on 3 servers and about 93% on 1 server (random server each time, not the smallest one).

The documentation mentions replication to spread the load between the servers. We tested replicating to smaller servers (32GB RAM, Intel Core i7-4770). However, when we tested it, the replicas were going out of sync all the time (possibly during commits) and reported errors like "PeerSync Recovery was not successful - trying replication." Then they proceed with replication which takes hours and the leader handles all requests singlehandedly during that time. Also both leaders and replicas started encountering OOM errors (heap space) for unknown reason. Heap dump analysis shows that most of the memory is consumed by [J (array of long) type, my best guess would be that it is "_version_" field, but it's still unclear why it happens. Also, even though with replication request rate and CPU usage drop 2 times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers (p75_ms is much smaller on nodes with replication, but still not as low as under load of <100 requests/s).

Garbage collection is much more active during high load as well. Full GC happens almost exclusively during those times. We have tried tuning GC options like suggested here <https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector> and it didn't change things though.

My questions are

  * How do we increase throughput? Is replication the only solution?
  * if yes - then why doesn't it affect response times, considering that
    CPU is not 100% used and index fits into memory?
  * How to deal with OOM and replicas going into recovery?
  * Is memory or CPU the main problem? (When searching on the internet,
    i never see CPU as main bottleneck for SOLR, but our case might be
    different)
  * Or do we need smaller shards? Could segments merging be a problem?
  * How to add faceting without search queries slowing down too much?
  * How to diagnose these problems and narrow down to the real reason in
    hardware or setup?

Any help would be much appreciated.

Thanks!

-- 
Email Signature
*Sofiia Strochyk
*


[hidden email] [hidden email]
    InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>

Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn icon <https://www.linkedin.com/company/interlogic>



--
Email Signature
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Erick Erickson
In reply to this post by Sofiya Strochyk
Speaking of your caches... Either it's a problem with the metrics
reporting or your warmup times very, very long. 11 seconds and, er,
52 seconds! My guess is that you have your autowarm counts set to a
very high number and are consuming a lot of CPU time every time a
commit happens. Which will only happen when indexing. I usually start
autowarms for these caches at < 20.

Quick note on autowarm: These caches are a map with the key being the
query and the value being some representation of the docs that satisfy
it. Autowarming just replays the most recently used N of these.
documentCache can't be autowarmed, so we can ignore it.

So in your case, the main value of the queryResultCache is to read
into memory all of the parts of the index to satisfy them, including,
say, the sort structures (docValues), the index terms and, really,
whatever is necessary. Ditto for the filterCache.

The queryResultCache was originally intended to support paging, it
only holds a few doc IDs per query. Memory wise it's pretty
insignificant. Your hit ratio indicates you're not paging. All that
said, the autowarm bits much more important so I wouldn't  disable it
entirely.

Each filterCache entry is bounded by maxDoc/8 size-wise (plus some
extra, but that's the number that usually counts). It may be smaller
for sparse result sets but we can ignore that for now. You usually
want this as small as possible and still get a decent hit ratio.

The entire purpose of autowarm is so that the _first_ query that's run
after a commit (hard with openSearcher=true or soft) isn't noticeably
slower due to having to initially load parts of the index into memory.
As the autowarm count goes up you pretty quickly hit diminishing
returns.

Now, all that may not be the actual problem, but here's a quick way to test:

turn your autowarm counts off. What you should see is a correlation
between when a commit happens and a small spike in response time for
the first few queries, but otherwise a better query response profile.
If that's true, try gradually increasing the autowarm count 10 at a
time. My bet: If this is germane, you'll pretty soon see no difference
in response times as you increase your autowarm count. I.e. there'll
be no noticeable difference between 20 and 30 for instance. And your
autowarm times will be drastically smaller. And most of the CPU you're
expending to autowarm will be freed up to actually satisfy use
queries.

If any of that speculation is borne out, you have something that'll
help. Or you have another blind alley ;)

Best
Erick

On Mon, Oct 29, 2018 at 8:00 AM Sofiya Strochyk <[hidden email]> wrote:

>
> Hi Deepak and thanks for your reply,
>
>
> On 27.10.18 10:35, Deepak Goel wrote:
>
>
> Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
>
> The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)
>
> --
> Sofiia Strochyk
>
>
>
> [hidden email]
>
> www.interlogic.com.ua
>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Deepak Goel
In reply to this post by Sofiya Strochyk
I would then suspect performance is choking in memory disk layer. can you please check the performance?

On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <[hidden email]> wrote:

Hi Deepak and thanks for your reply,


On 27.10.18 10:35, Deepak Goel wrote:

Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)

--
Sofiia Strochyk



[hidden email]

www.interlogic.com.ua

 
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk
In reply to this post by Erick Erickson

Sure, i can test that, will set to zero now :)

We never tried a small number for the autowarming parameter but it has been running with zero (default value) for a while before being changed to 64, and the startup after the commit has been a bit slow. But overall, there was rather little difference between 0 and 64, so the spike after commit could be related just to heavy searcher opening operation which can't be affected.


On 29.10.18 17:20, Erick Erickson wrote:
Speaking of your caches... Either it's a problem with the metrics
reporting or your warmup times very, very long. 11 seconds and, er,
52 seconds! My guess is that you have your autowarm counts set to a
very high number and are consuming a lot of CPU time every time a
commit happens. Which will only happen when indexing. I usually start
autowarms for these caches at < 20.

Quick note on autowarm: These caches are a map with the key being the
query and the value being some representation of the docs that satisfy
it. Autowarming just replays the most recently used N of these.
documentCache can't be autowarmed, so we can ignore it.

So in your case, the main value of the queryResultCache is to read
into memory all of the parts of the index to satisfy them, including,
say, the sort structures (docValues), the index terms and, really,
whatever is necessary. Ditto for the filterCache.

The queryResultCache was originally intended to support paging, it
only holds a few doc IDs per query. Memory wise it's pretty
insignificant. Your hit ratio indicates you're not paging. All that
said, the autowarm bits much more important so I wouldn't  disable it
entirely.

Each filterCache entry is bounded by maxDoc/8 size-wise (plus some
extra, but that's the number that usually counts). It may be smaller
for sparse result sets but we can ignore that for now. You usually
want this as small as possible and still get a decent hit ratio.

The entire purpose of autowarm is so that the _first_ query that's run
after a commit (hard with openSearcher=true or soft) isn't noticeably
slower due to having to initially load parts of the index into memory.
As the autowarm count goes up you pretty quickly hit diminishing
returns.

Now, all that may not be the actual problem, but here's a quick way to test:

turn your autowarm counts off. What you should see is a correlation
between when a commit happens and a small spike in response time for
the first few queries, but otherwise a better query response profile.
If that's true, try gradually increasing the autowarm count 10 at a
time. My bet: If this is germane, you'll pretty soon see no difference
in response times as you increase your autowarm count. I.e. there'll
be no noticeable difference between 20 and 30 for instance. And your
autowarm times will be drastically smaller. And most of the CPU you're
expending to autowarm will be freed up to actually satisfy use
queries.

If any of that speculation is borne out, you have something that'll
help. Or you have another blind alley ;)

Best
Erick

On Mon, Oct 29, 2018 at 8:00 AM Sofiya Strochyk [hidden email] wrote:
Hi Deepak and thanks for your reply,


On 27.10.18 10:35, Deepak Goel wrote:


Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.

The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)

--
Sofiia Strochyk



[hidden email]

www.interlogic.com.ua



--
Email Signature
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk
In reply to this post by Deepak Goel

Could you please clarify what is memory disk layer? Do you mean swapping from memory to disk, reading from disk to memory, or something else?


On 29.10.18 17:20, Deepak Goel wrote:
I would then suspect performance is choking in memory disk layer. can you please check the performance?

On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <[hidden email]> wrote:

Hi Deepak and thanks for your reply,


On 27.10.18 10:35, Deepak Goel wrote:

Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)

--
Sofiia Strochyk



[hidden email]

www.interlogic.com.ua

 

--
Email Signature
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Erick Erickson
Sofiya:

The interval between when a commit happens and all the autowarm
queries are finished if 52 seconds for the filterCache. seen warming
that that long unless something's very unusual. I'd actually be very
surprised if you're really only firing 64 autowarm queries and it's
taking almost 52 seconds.

However, if it really is taking that long for that few queries, then
most of what I think I know about this subject is probably wrong
anyway. I guess it could just be CPU starvation in that case. At any
rate, it's an anomaly that should get an explanation.

I suppose the other approach would be if you can get the autowarm time
when there's not much querying going on. If you get autowarm times of,
say, 40 seconds when there's no querying going on then it's not the
number of autowarm queries, but "something about your queries" that's
slowing things down, at least that gives you a place to start looking.

And I should have mentioned that when I think about excessive autowarm
counts, I've seen them in the thousands, 64 isn't really excessive,
even if diminishing returns start around 10-20...

Good Luck!
Erick

On Mon, Oct 29, 2018 at 10:54 AM Sofiya Strochyk <[hidden email]> wrote:

>
> Could you please clarify what is memory disk layer? Do you mean swapping from memory to disk, reading from disk to memory, or something else?
>
>
> On 29.10.18 17:20, Deepak Goel wrote:
>
> I would then suspect performance is choking in memory disk layer. can you please check the performance?
>
> On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <[hidden email]> wrote:
>>
>> Hi Deepak and thanks for your reply,
>>
>>
>> On 27.10.18 10:35, Deepak Goel wrote:
>>
>>
>> Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
>>
>> The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)
>>
>> --
>> Sofiia Strochyk
>>
>>
>>
>> [hidden email]
>>
>> www.interlogic.com.ua
>>
>>
>
>
> --
> Sofiia Strochyk
>
>
>
> [hidden email]
>
> www.interlogic.com.ua
>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Deepak Goel
In reply to this post by Sofiya Strochyk
Yes. Swapping from disk to memory & vice versa


Deepak
"The greatness of a nation can be judged by the way its animals are treated. Please consider stopping the cruelty by becoming a Vegan"


"Plant a Tree, Go Green"



On Mon, Oct 29, 2018 at 11:24 PM Sofiya Strochyk <[hidden email]> wrote:

Could you please clarify what is memory disk layer? Do you mean swapping from memory to disk, reading from disk to memory, or something else?


On 29.10.18 17:20, Deepak Goel wrote:
I would then suspect performance is choking in memory disk layer. can you please check the performance?

On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <[hidden email]> wrote:

Hi Deepak and thanks for your reply,


On 27.10.18 10:35, Deepak Goel wrote:

Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)

--
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon

--
Sofiia Strochyk



[hidden email]

www.interlogic.com.ua

 
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk

My swappiness is set to 10, swap is almost not used (used space is on scale of a few MB) and there is no swap IO.

There is disk IO like this, though:

https://upload.cc/i1/2018/10/30/43lGfj.png
https://upload.cc/i1/2018/10/30/T3u9oY.png

However CPU iowait is still zero, so not sure if the disk io is introducing any kind of delay...

On 30.10.18 10:21, Deepak Goel wrote:
Yes. Swapping from disk to memory & vice versa


Deepak
"The greatness of a nation can be judged by the way its animals are treated. Please consider stopping the cruelty by becoming a Vegan"


"Plant a Tree, Go Green"



On Mon, Oct 29, 2018 at 11:24 PM Sofiya Strochyk <[hidden email]> wrote:

Could you please clarify what is memory disk layer? Do you mean swapping from memory to disk, reading from disk to memory, or something else?


On 29.10.18 17:20, Deepak Goel wrote:
I would then suspect performance is choking in memory disk layer. can you please check the performance?

On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <[hidden email]> wrote:

Hi Deepak and thanks for your reply,


On 27.10.18 10:35, Deepak Goel wrote:

Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)

--
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon

--
Sofiia Strochyk



[hidden email]

www.interlogic.com.ua

 

--
Email Signature
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Shawn Heisey-2
In reply to this post by Erick Erickson
On 10/29/2018 8:56 PM, Erick Erickson wrote:
> The interval between when a commit happens and all the autowarm
> queries are finished if 52 seconds for the filterCache. seen warming
> that that long unless something's very unusual. I'd actually be very
> surprised if you're really only firing 64 autowarm queries and it's
> taking almost 52 seconds.

I wouldn't be surprised.  The servers I used to manage had a filterCache
autowarmCount of *four*.  Commits sometimes still took up to 15 seconds
to happen, and almost all of that time was filterCache warming.  Before
I reduced autowarmCount, it could take a VERY long time on occasion for
commits.  Some users had such large filters that I was forced to
increase the max allowed header size for Solr's container to 32K, so
that the occasional query with a 20K URL size could be handled.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Deepak Goel
In reply to this post by Sofiya Strochyk
Please see inline...


Deepak
"The greatness of a nation can be judged by the way its animals are treated. Please consider stopping the cruelty by becoming a Vegan"


"Plant a Tree, Go Green"



On Tue, Oct 30, 2018 at 5:21 PM Sofiya Strochyk <[hidden email]> wrote:

My swappiness is set to 10, swap is almost not used (used space is on scale of a few MB) and there is no swap IO.

There is disk IO like this, though:

https://upload.cc/i1/2018/10/30/43lGfj.png
https://upload.cc/i1/2018/10/30/T3u9oY.png 

****** 
The time for the data is too short. Can you provide for larger timeframes?
******


However CPU iowait is still zero, so not sure if the disk io is introducing any kind of delay...

******
Can you provide graphs for cpu iowait too? (For large timeframes)
******
On 30.10.18 10:21, Deepak Goel wrote:
Yes. Swapping from disk to memory & vice versa


Deepak
"The greatness of a nation can be judged by the way its animals are treated. Please consider stopping the cruelty by becoming a Vegan"


"Plant a Tree, Go Green"



On Mon, Oct 29, 2018 at 11:24 PM Sofiya Strochyk <[hidden email]> wrote:

Could you please clarify what is memory disk layer? Do you mean swapping from memory to disk, reading from disk to memory, or something else?


On 29.10.18 17:20, Deepak Goel wrote:
I would then suspect performance is choking in memory disk layer. can you please check the performance?

On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <[hidden email]> wrote:

Hi Deepak and thanks for your reply,


On 27.10.18 10:35, Deepak Goel wrote:

Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)

--
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon

--
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook
                            icon   LinkedIn
                            icon

--
Sofiia Strochyk



[hidden email]

www.interlogic.com.ua

 
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk

Sure, here is IO for bigger machine:

https://upload.cc/i1/2018/10/30/tQovyM.png

for smaller machine:

https://upload.cc/i1/2018/10/30/cP8DxU.png

CPU utilization including iowait:

https://upload.cc/i1/2018/10/30/eSs1YT.png

iowait only:

https://upload.cc/i1/2018/10/30/CHgx41.png


On 30.10.18 15:55, Deepak Goel wrote:
Please see inline...


Deepak
"The greatness of a nation can be judged by the way its animals are treated. Please consider stopping the cruelty by becoming a Vegan"


"Plant a Tree, Go Green"



On Tue, Oct 30, 2018 at 5:21 PM Sofiya Strochyk <[hidden email]> wrote:

My swappiness is set to 10, swap is almost not used (used space is on scale of a few MB) and there is no swap IO.

There is disk IO like this, though:

https://upload.cc/i1/2018/10/30/43lGfj.png
https://upload.cc/i1/2018/10/30/T3u9oY.png 

****** 
The time for the data is too short. Can you provide for larger timeframes?
******


However CPU iowait is still zero, so not sure if the disk io is introducing any kind of delay...

******
Can you provide graphs for cpu iowait too? (For large timeframes)
******
On 30.10.18 10:21, Deepak Goel wrote:
Yes. Swapping from disk to memory & vice versa


Deepak
"The greatness of a nation can be judged by the way its animals are treated. Please consider stopping the cruelty by becoming a Vegan"


"Plant a Tree, Go Green"



On Mon, Oct 29, 2018 at 11:24 PM Sofiya Strochyk <[hidden email]> wrote:

Could you please clarify what is memory disk layer? Do you mean swapping from memory to disk, reading from disk to memory, or something else?


On 29.10.18 17:20, Deepak Goel wrote:
I would then suspect performance is choking in memory disk layer. can you please check the performance?

On Mon, 29 Oct 2018, 20:30 Sofiya Strochyk, <[hidden email]> wrote:

Hi Deepak and thanks for your reply,


On 27.10.18 10:35, Deepak Goel wrote:

Last, what is the nature of your request. Are the queries the same? Or they are very random? Random queries would need more tuning than if the queries the same.
The search term (q) is different for each query, and filter query terms (fq) are repeated very often. (so we have very little cache hit ratio for query result cache, and very high hit ratio for filter cache)

--
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook
                                                        icon   LinkedIn
                                                        icon

--
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon

--
Sofiia Strochyk



[hidden email]

www.interlogic.com.ua

 

--
Email Signature
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Shawn Heisey-2
In reply to this post by Sofiya Strochyk
On 10/29/2018 7:24 AM, Sofiya Strochyk wrote:
> Actually the smallest server doesn't look bad in terms of performance,
> it has been consistently better that the other ones (without
> replication) which seems a bit strange (it should be about the same or
> slightly worse, right?). I guess the memory being smaller than index
> doesn't cause problems due to the fact that we use SSDs.

SSD, while fast, is nowhere near as fast as main memory. As I said, the
memory numbers might cause performance problems, or they might not. 
Glad you're in the latter category.

> What if we are sending requests to machine which is part of the
> cluster but doesn't host any shards? Does it handle the initial
> request and merging of the results, or this has to be handled by one
> of the shards anyway?
> Also i was thinking "more shards -> each shard searches smaller set of
> documents -> search is faster". Or is the overhead for merging results
> bigger than overhead from searching larger set of documents?

If every shard is on its own machine, many shards might not be a
performance bottleneck with a high query rate.  The more shards you
have, the more the machine doing the aggregation must do to produce results.

SolrCloud complicates the situation further.  It normally does load
balancing of all requests that come in across the cloud.  So the machine
handling the request might not be the machine where you SENT the request.

>> Very likely the one with a higher load is the one that is aggregating
>> shard requests for a full result.
> Is there a way to confirm this? Maybe the aggregating shard is going
> to have additional requests in its solr.log?

The logfiles on your servers should be verbose enough to indicate what
machines are handling which parts of the request.

>> Most Solr performance issues are memory related.  With an extreme
>> query rate, CPU can also be a bottleneck, but memory will almost
>> always be the bottleneck you run into first.
> This is the advice i've seen often, but how exactly can we run out of
> memory if total RAM is 128, heap is 8 and index size is 80. Especially
> since node with 64G runs just as fine if not better.

Even when memory is insufficient, "running out" of memory generally
doesn't happen unless the heap is too small.Java will work within the
limits imposed by the system if it can. For OS disk cache, the OS tries
to be as smart as it can about which data stays in the cache and which
data is discarded.

>> A lot of useful information can be obtained from the GC logs that
>> Solr's built-in scripting creates.  Can you share these logs?
>>
>> The screenshots described here can also be very useful for
>> troubleshooting:
>>
>> https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
> I have attached some GC logs and screenshots, hope these are helpful
> (can only attach small files)

Only one attachment made it to the list.  I'm surprised that ANY of them
made it -- usually they don't.  Generally you need to use a file sharing
website and provide links.  Dropbox is one site that works well.  Gist
might also work.

The GC log that made it through (solr_gc.log.7.1) is only two minutes
long.  Nothing useful can be learned from a log that short.  It is also
missing the information at the top about the JVM that created it, so I'm
wondering if you edited the file so it was shorter before including it.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk

The logfiles on your servers should be verbose enough to indicate what machines are handling which parts of the request.
Yes, generally i see the following entries in logs:
  1. df=_text_&distrib=false&fl=_id&fl=score&shards.purpose=4&start=0&fsv=true&sort=<sort expression>fq=<fq expression>&shard.url=<shard IP and path>&rows=24&version=2&q=<q expression>&NOW=1540984948280&isShard=true&wt=javabin
  2. df=_text_&distrib=false&fl=<full list of fields>&shards.purpose=64&start=0&fq=<fq expression>&shard.url=<shard IP and path>&rows=24&version=2&q=<q expression>&NOW=1540984948280&ids=<list of IDs>&isShard=true&wt=javabin
  3. q=<q expression>&fl=<full list of fields>&start=0&sort=<sort expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
Request type #3 (full request) is seen only 1 time across all shards, and I suppose it is the original/aggregated request. The shard is different every time, so this means load balancing is working.
Request #1 (get IDs by query) is always present for one replica of each shard.
Request #2 (get fields by IDs) is, however, sometimes missing even though request #1 has a non-zero number of hits for that shard. But i don't know if this could indicate a problem or it is working as expected?
Only one attachment made it to the list.  I'm surprised that ANY of them made it -- usually they don't.  Generally you need to use a file sharing website and provide links.  Dropbox is one site that works well.  Gist might also work.

The GC log that made it through (solr_gc.log.7.1) is only two minutes long.  Nothing useful can be learned from a log that short.  It is also missing the information at the top about the JVM that created it, so I'm wondering if you edited the file so it was shorter before including it.

Thanks,
Shawn
You are right, sorry, i didn't know this :)
(there is a 1MB limitation on attachments which is why i trimmed the log)
Here are the full GC logs: 1 2
and images: 1 2 3

--
Email Signature
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon
Reply | Threaded
Open this post in threaded view
|

Re: Re: SolrCloud scaling/optimization for high request rate

Toke Eskildsen-2
On Wed, 2018-10-31 at 13:42 +0200, Sofiya Strochyk wrote:
> q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
> expression>&fq=<fq expression>&rows=24&version=2.2&wt=json

Not much to see here, perhaps because you are not allowed to share it?

Maybe we can try and isolate the cause? Could you try different runs,
where you change different components and tell us roughly how that
affects performance?

1) Only request simple sorting by score
2) Reduce rows to 0
3) Increase rows to 100
4) Set fl=id only

- Toke Eskildsen, Royal Danish Library


Reply | Threaded
Open this post in threaded view
|

Re: Re: SolrCloud scaling/optimization for high request rate

Toke Eskildsen-2
So far no answer from Sofiya. That's fair enough: My suggestions might
have seemed random. Let me try to qualify them a bit.


What we have to work with is the redacted query
q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
and an earlier mention that sorting was complex.

My suggestions were to try

1) Only request simple sorting by score

If this improves performance substantially, we could try and see if
sorting could be made more efficient: Reducing complexity, pre-
calculating numbers etc.

2) Reduce rows to 0
3) Increase rows to 100

This measures one aspect of retrieval. If there is a big performance
difference between these two, we can further probe if the problem is
the number or size of fields - perhaps there is a ton of stored text,
perhaps there is a bunch of DocValued fields?

4) Set fl=id only

This is a variant of 2+3 to do a quick check if it is the resolving of
specific field values that is the problem. If using fl=id speeds up
substantially, the next step would be to add fields gradually until
(hopefully) there is a sharp performance decrease.

- Toke Eskildsen, Royal Danish Library


Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk

Hi Toke,

sorry for the late reply. The query i wrote here is edited to hide production details, but I can post additional info if this helps.

I have tested all of the suggested changes none of these seem to make a noticeable difference (usually response time and other metrics fluctuate over time, and the changes caused by different parameters are smaller than the fluctuations). What this probably means is that the heaviest task is retrieving IDs by query and not fields by ID. I've also checked QTime logged for these types of operations, and it is much higher for "get IDs by query" than for "get fields by IDs list". What could be done about this?

On 05.11.18 14:43, Toke Eskildsen wrote:
So far no answer from Sofiya. That's fair enough: My suggestions might
have seemed random. Let me try to qualify them a bit.


What we have to work with is the redacted query
q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
and an earlier mention that sorting was complex.

My suggestions were to try

1) Only request simple sorting by score

If this improves performance substantially, we could try and see if
sorting could be made more efficient: Reducing complexity, pre-
calculating numbers etc.

2) Reduce rows to 0
3) Increase rows to 100

This measures one aspect of retrieval. If there is a big performance
difference between these two, we can further probe if the problem is
the number or size of fields - perhaps there is a ton of stored text,
perhaps there is a bunch of DocValued fields?

4) Set fl=id only

This is a variant of 2+3 to do a quick check if it is the resolving of
specific field values that is the problem. If using fl=id speeds up
substantially, the next step would be to add fields gradually until
(hopefully) there is a sharp performance decrease.

- Toke Eskildsen, Royal Danish Library



--
Email Signature
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Toke Eskildsen-2
On Tue, 2018-11-06 at 16:38 +0200, Sofiya Strochyk wrote:
> I have tested all of the suggested changes none of these seem to make
> a noticeable difference (usually response time and other metrics
> fluctuate over time, and the changes caused by different parameters
> are smaller than the fluctuations). What this probably means is that
> the heaviest task is retrieving IDs by query and not fields by ID.

Barring anything overlooked, I agree on the query thing.

Were I to sit at the machine, I would try removing part of the query
until performance were satisfactory. Hopefully that would unearth very
few problematic parts, such as regexp, function or prefix-wildcard
queries. There might be ways to replace or tune those.

- Toke Eskildsen, Royal Danish Library


Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Ere Maijala
In reply to this post by Sofiya Strochyk
Sofiya,

Do you have docValues enabled for the id field? Apparently that can make
a significant difference. I'm failing to find the relevant references
right now, but just something worth checking out.

Regards,
Ere

Sofiya Strochyk kirjoitti 6.11.2018 klo 16.38:

> Hi Toke,
>
> sorry for the late reply. The query i wrote here is edited to hide
> production details, but I can post additional info if this helps.
>
> I have tested all of the suggested changes none of these seem to make a
> noticeable difference (usually response time and other metrics fluctuate
> over time, and the changes caused by different parameters are smaller
> than the fluctuations). What this probably means is that the heaviest
> task is retrieving IDs by query and not fields by ID. I've also checked
> QTime logged for these types of operations, and it is much higher for
> "get IDs by query" than for "get fields by IDs list". What could be done
> about this?
>
> On 05.11.18 14:43, Toke Eskildsen wrote:
>> So far no answer from Sofiya. That's fair enough: My suggestions might
>> have seemed random. Let me try to qualify them a bit.
>>
>>
>> What we have to work with is the redacted query
>> q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
>> expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
>> and an earlier mention that sorting was complex.
>>
>> My suggestions were to try
>>
>> 1) Only request simple sorting by score
>>
>> If this improves performance substantially, we could try and see if
>> sorting could be made more efficient: Reducing complexity, pre-
>> calculating numbers etc.
>>
>> 2) Reduce rows to 0
>> 3) Increase rows to 100
>>
>> This measures one aspect of retrieval. If there is a big performance
>> difference between these two, we can further probe if the problem is
>> the number or size of fields - perhaps there is a ton of stored text,
>> perhaps there is a bunch of DocValued fields?
>>
>> 4) Set fl=id only
>>
>> This is a variant of 2+3 to do a quick check if it is the resolving of
>> specific field values that is the problem. If using fl=id speeds up
>> substantially, the next step would be to add fields gradually until
>> (hopefully) there is a sharp performance decrease.
>>
>> - Toke Eskildsen, Royal Danish Library
>>
>>
>
> --
> Email Signature
> *Sofiia Strochyk
> *
>
>
> [hidden email] <mailto:[hidden email]>
> InterLogic
> www.interlogic.com.ua <https://www.interlogic.com.ua>
>
> Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn
> icon <https://www.linkedin.com/company/interlogic>
>

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk
In reply to this post by Toke Eskildsen-2

Thanks for your suggestions. I'll check if the filter queries or the main query tokenizers/filters might have anything to do with this, but I'm afraid query optimization can only get us so far. Since we will have to add facets later, the queries will only become heavier, and there has to be a way to scale this setup and deal with both higher load and more complex queries.


On 08.11.18 10:53, Toke Eskildsen wrote:
On Tue, 2018-11-06 at 16:38 +0200, Sofiya Strochyk wrote:
I have tested all of the suggested changes none of these seem to make
a noticeable difference (usually response time and other metrics
fluctuate over time, and the changes caused by different parameters
are smaller than the fluctuations). What this probably means is that
the heaviest task is retrieving IDs by query and not fields by ID. 
Barring anything overlooked, I agree on the query thing.

Were I to sit at the machine, I would try removing part of the query
until performance were satisfactory. Hopefully that would unearth very
few problematic parts, such as regexp, function or prefix-wildcard
queries. There might be ways to replace or tune those.

- Toke Eskildsen, Royal Danish Library



--
Email Signature
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon
Reply | Threaded
Open this post in threaded view
|

Re: **SPAM** Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk
In reply to this post by Ere Maijala

Thanks for the suggestion Ere. It looks like they are actually enabled; in schema file the field is only marked as stored (field name="_id" type="string" multiValued="false" indexed="true" required="true" stored="true") but the admin UI shows DocValues as enabled, so I guess this is by default. Is the solution to add "docValues=false" in the schema?


On 12.11.18 10:43, Ere Maijala wrote:
Sofiya,

Do you have docValues enabled for the id field? Apparently that can make a significant difference. I'm failing to find the relevant references right now, but just something worth checking out.

Regards,
Ere

Sofiya Strochyk kirjoitti 6.11.2018 klo 16.38:
Hi Toke,

sorry for the late reply. The query i wrote here is edited to hide production details, but I can post additional info if this helps.

I have tested all of the suggested changes none of these seem to make a noticeable difference (usually response time and other metrics fluctuate over time, and the changes caused by different parameters are smaller than the fluctuations). What this probably means is that the heaviest task is retrieving IDs by query and not fields by ID. I've also checked QTime logged for these types of operations, and it is much higher for "get IDs by query" than for "get fields by IDs list". What could be done about this?

On 05.11.18 14:43, Toke Eskildsen wrote:
So far no answer from Sofiya. That's fair enough: My suggestions might
have seemed random. Let me try to qualify them a bit.


What we have to work with is the redacted query
q=<q expression>&fl=<full list of fields>&start=0&sort=<sort
expression>&fq=<fq expression>&rows=24&version=2.2&wt=json
and an earlier mention that sorting was complex.

My suggestions were to try

1) Only request simple sorting by score

If this improves performance substantially, we could try and see if
sorting could be made more efficient: Reducing complexity, pre-
calculating numbers etc.

2) Reduce rows to 0
3) Increase rows to 100

This measures one aspect of retrieval. If there is a big performance
difference between these two, we can further probe if the problem is
the number or size of fields - perhaps there is a ton of stored text,
perhaps there is a bunch of DocValued fields?

4) Set fl=id only

This is a variant of 2+3 to do a quick check if it is the resolving of
specific field values that is the problem. If using fl=id speeds up
substantially, the next step would be to add fields gradually until
(hopefully) there is a sharp performance decrease.

- Toke Eskildsen, Royal Danish Library



-- 
Email Signature
*Sofiia Strochyk
*


[hidden email] [hidden email]
    InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>

Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedIn icon <https://www.linkedin.com/company/interlogic>



--
Email Signature
Sofiia Strochyk



[hidden email]
InterLogic
www.interlogic.com.ua

Facebook icon   LinkedIn icon
123