SolrCloud performance

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

SolrCloud performance

Chuming Chen
Hi All,

I am running a Solr cloud 7.4 with 4 shards and 4 nodes (JVM "-Xms20g -Xmx40g”), each shard has 32 million documents and 32Gbytes in size.

For a given query (I use complexphrase query), typically, the first time it took a couple of seconds to return the first 20 docs. However, for the following page, or sorting by a field, even run the same query again took a lot longer to return results. I can see my 4 solr nodes running crazy with more than 100%CPU.

My understanding is that Solr has query cache, run same query should be faster.

What could be wrong here? How do I debug? I checked solr.log in all nodes and didn’t see anything unusual. Most frequent log entry looks like this.

INFO  - 2018-11-02 19:32:55.189; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/metrics params={wt=javabin&version=2&key=solr.core.patternmatch.shard3.replica_n8:UPDATE./update.requests&key=solr.core.patternmatch.shard3.replica_n8:INDEX.sizeInBytes&key=solr.core.patternmatch.shard1.replica_n1:QUERY./select.requests&key=solr.core.patternmatch.shard1.replica_n1:INDEX.sizeInBytes&key=solr.core.patternmatch.shard1.replica_n1:UPDATE./update.requests&key=solr.core.patternmatch.shard3.replica_n8:QUERY./select.requests} status=0 QTime=7
INFO  - 2018-11-02 19:32:55.192; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/metrics params={wt=javabin&version=2&key=solr.jvm:os.processCpuLoad&key=solr.node:CONTAINER.fs.coreRoot.usableSpace&key=solr.jvm:os.systemLoadAverage&key=solr.jvm:memory.heap.used} status=0 QTime=1

Thank you for your kind help.

Chuming



Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud performance

Shawn Heisey-2
On 11/2/2018 1:38 PM, Chuming Chen wrote:
> I am running a Solr cloud 7.4 with 4 shards and 4 nodes (JVM "-Xms20g -Xmx40g”), each shard has 32 million documents and 32Gbytes in size.

A 40GB heap is probably completely unnecessary for an index of that
size.  Does each machine have one replica on it or two? If you are
trying for high availability, then it will be at least two shard
replicas per machine.

The values on -Xms and -Xmx should normally be set the same.  Java will
always tend to allocate the entire max heap it has been allowed, so it's
usually better to just let it have the whole amount right up front.

> For a given query (I use complexphrase query), typically, the first time it took a couple of seconds to return the first 20 docs. However, for the following page, or sorting by a field, even run the same query again took a lot longer to return results. I can see my 4 solr nodes running crazy with more than 100%CPU.

Can you obtain a screenshot of a process listing as described at the
following URL, and provide the image using a file sharing site?

https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue

There are separate instructions there for Windows and for Linux/UNIX
operating systems.

Also useful are the GC logs that are written by Java when Solr is
started using the included scripts.  I'm looking for logfiles that cover
several days of runtime.  You'll need to share them with a file sharing
website -- files will not normally make it to the mailing list if
attached to a message.

Getting a copy of the solrconfig.xml in use on your collection can also
be helpful.

> My understanding is that Solr has query cache, run same query should be faster.

If the query is absolutely identical in *every* way, then yes, it can be
satisfied from Solr caches, if their size is sufficient.  If you change
ANYTHING, including things like rows or start, filters, sorting, facets,
and other parameters, then the query probably cannot be satisfied
completely from cache.  At that point, Solr is very reliant on how much
memory has NOT been allocated to programs -- it must be a sufficient
quantity of memory that the Solr index data can be effectively cached.

> What could be wrong here? How do I debug? I checked solr.log in all nodes and didn’t see anything unusual. Most frequent log entry looks like this.
>
> INFO  - 2018-11-02 19:32:55.189; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/metrics params={wt=javabin&version=2&key=solr.core.patternmatch.shard3.replica_n8:UPDATE./update.requests&key=solr.core.patternmatch.shard3.replica_n8:INDEX.sizeInBytes&key=solr.core.patternmatch.shard1.replica_n1:QUERY./select.requests&key=solr.core.patternmatch.shard1.replica_n1:INDEX.sizeInBytes&key=solr.core.patternmatch.shard1.replica_n1:UPDATE./update.requests&key=solr.core.patternmatch.shard3.replica_n8:QUERY./select.requests} status=0 QTime=7
> INFO  - 2018-11-02 19:32:55.192; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/metrics params={wt=javabin&version=2&key=solr.jvm:os.processCpuLoad&key=solr.node:CONTAINER.fs.coreRoot.usableSpace&key=solr.jvm:os.systemLoadAverage&key=solr.jvm:memory.heap.used} status=0 QTime=1

That is not a query.  It is a call to the Metrics API. When I've made
this call on a production Solr machine, it seems to be very
resource-intensive, taking a long time.  I don't think it should be made
frequently.  Probably no more than once a minute. If you are seeing that
kind of entry in your logs a lot, then that might be contributing to
your performance issues.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud performance

Deepak Goel
In reply to this post by Chuming Chen
Please see inline for my thoughts


Deepak
"The greatness of a nation can be judged by the way its animals are
treated. Please consider stopping the cruelty by becoming a Vegan"

+91 73500 12833
[hidden email]

Facebook: https://www.facebook.com/deicool
LinkedIn: www.linkedin.com/in/deicool

"Plant a Tree, Go Green"

Make In India : http://www.makeinindia.com/home


On Sat, Nov 3, 2018 at 1:08 AM Chuming Chen <[hidden email]> wrote:

> Hi All,
>
> I am running a Solr cloud 7.4 with 4 shards and 4 nodes (JVM "-Xms20g
> -Xmx40g”), each shard has 32 million documents and 32Gbytes in size.
>
> For a given query (I use complexphrase query), typically, the first time
> it took a couple of seconds to return the first 20 docs. However, for the
> following page, or sorting by a field, even run the same query again took a
> lot longer to return results. I can see my 4 solr nodes running crazy with
> more than 100%CPU.
>
> I think the first time the query is being returned by Lucene (which is
already sorted out due to inverted field format). Second time around the
query is satisified by Solr (which is taking longer).


> My understanding is that Solr has query cache, run same query should be
> faster.
>
> What could be wrong here? How do I debug? I checked solr.log in all nodes
> and didn’t see anything unusual. Most frequent log entry looks like this.
>
> INFO  - 2018-11-02 19:32:55.189; [   ]
> org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null
> path=/admin/metrics
> params={wt=javabin&version=2&key=solr.core.patternmatch.shard3.replica_n8:UPDATE./update.requests&key=solr.core.patternmatch.shard3.replica_n8:INDEX.sizeInBytes&key=solr.core.patternmatch.shard1.replica_n1:QUERY./select.requests&key=solr.core.patternmatch.shard1.replica_n1:INDEX.sizeInBytes&key=solr.core.patternmatch.shard1.replica_n1:UPDATE./update.requests&key=solr.core.patternmatch.shard3.replica_n8:QUERY./select.requests}
> status=0 QTime=7
> INFO  - 2018-11-02 19:32:55.192; [   ]
> org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null
> path=/admin/metrics
> params={wt=javabin&version=2&key=solr.jvm:os.processCpuLoad&key=solr.node:CONTAINER.fs.coreRoot.usableSpace&key=solr.jvm:os.systemLoadAverage&key=solr.jvm:memory.heap.used}
> status=0 QTime=1
>
> Thank you for your kind help.
>
> Chuming
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud performance

Chuming Chen
In reply to this post by Shawn Heisey-2
Hi Shawn,

I have shared a tar ball with you ([hidden email]) from google drive. The tar ball includes logs directories of 4 nodes, solrconfig.xml, solr.in.sh, and screenshot of TOP command. The log files is about 1 day’s log. However, I restarted the solr cloud several times during that period.

I want to make it clear. I don’t have 4 physical machines. I have 48 cores server. All 4 solr nodes are running on the same physical machine. Each node has 1 shard and 1 replicate. I also have a ZooKeeper ensemble running on the same machine with 3 different ports.

I am curious to know what Solr is doing when the CPU usage is 100% or more than 100%. Because for some queries, I think even just looping through all the document without using any index might be faster.

If you have problem accessing the tar ball, please let me know.

Thanks a lot!

Chuming


On Nov 2, 2018, at 6:56 PM, Shawn Heisey <[hidden email]> wrote:

> On 11/2/2018 1:38 PM, Chuming Chen wrote:
>> I am running a Solr cloud 7.4 with 4 shards and 4 nodes (JVM "-Xms20g -Xmx40g”), each shard has 32 million documents and 32Gbytes in size.
>
> A 40GB heap is probably completely unnecessary for an index of that size.  Does each machine have one replica on it or two? If you are trying for high availability, then it will be at least two shard replicas per machine.
>
> The values on -Xms and -Xmx should normally be set the same.  Java will always tend to allocate the entire max heap it has been allowed, so it's usually better to just let it have the whole amount right up front.
>
>> For a given query (I use complexphrase query), typically, the first time it took a couple of seconds to return the first 20 docs. However, for the following page, or sorting by a field, even run the same query again took a lot longer to return results. I can see my 4 solr nodes running crazy with more than 100%CPU.
>
> Can you obtain a screenshot of a process listing as described at the following URL, and provide the image using a file sharing site?
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
>
> There are separate instructions there for Windows and for Linux/UNIX operating systems.
>
> Also useful are the GC logs that are written by Java when Solr is started using the included scripts.  I'm looking for logfiles that cover several days of runtime.  You'll need to share them with a file sharing website -- files will not normally make it to the mailing list if attached to a message.
>
> Getting a copy of the solrconfig.xml in use on your collection can also be helpful.
>
>> My understanding is that Solr has query cache, run same query should be faster.
>
> If the query is absolutely identical in *every* way, then yes, it can be satisfied from Solr caches, if their size is sufficient.  If you change ANYTHING, including things like rows or start, filters, sorting, facets, and other parameters, then the query probably cannot be satisfied completely from cache.  At that point, Solr is very reliant on how much memory has NOT been allocated to programs -- it must be a sufficient quantity of memory that the Solr index data can be effectively cached.
>
>> What could be wrong here? How do I debug? I checked solr.log in all nodes and didn’t see anything unusual. Most frequent log entry looks like this.
>>
>> INFO  - 2018-11-02 19:32:55.189; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/metrics params={wt=javabin&version=2&key=solr.core.patternmatch.shard3.replica_n8:UPDATE./update.requests&key=solr.core.patternmatch.shard3.replica_n8:INDEX.sizeInBytes&key=solr.core.patternmatch.shard1.replica_n1:QUERY./select.requests&key=solr.core.patternmatch.shard1.replica_n1:INDEX.sizeInBytes&key=solr.core.patternmatch.shard1.replica_n1:UPDATE./update.requests&key=solr.core.patternmatch.shard3.replica_n8:QUERY./select.requests} status=0 QTime=7
>> INFO  - 2018-11-02 19:32:55.192; [   ] org.apache.solr.servlet.HttpSolrCall; [admin] webapp=null path=/admin/metrics params={wt=javabin&version=2&key=solr.jvm:os.processCpuLoad&key=solr.node:CONTAINER.fs.coreRoot.usableSpace&key=solr.jvm:os.systemLoadAverage&key=solr.jvm:memory.heap.used} status=0 QTime=1
>
> That is not a query.  It is a call to the Metrics API. When I've made this call on a production Solr machine, it seems to be very resource-intensive, taking a long time.  I don't think it should be made frequently.  Probably no more than once a minute. If you are seeing that kind of entry in your logs a lot, then that might be contributing to your performance issues.
>
> Thanks,
> Shawn
>

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud performance

Shawn Heisey-2
On 11/4/2018 8:38 AM, Chuming Chen wrote:
> I have shared a tar ball with you ([hidden email]) from google drive. The tar ball includes logs directories of 4 nodes, solrconfig.xml, solr.in.sh, and screenshot of TOP command. The log files is about 1 day’s log. However, I restarted the solr cloud several times during that period.

Runtime represented in the GC log for node1 is 23 minutes. Not anywhere
near a full day.

Runtime represented in thc GC log for node2 is just under 16 minutes.

Runtime represented in the GC log for node3 is 434 milliseconds.

Runtime represented in the GC log for node4 is 501 milliseconds.

This is not enough to even make a guess, much less a reasoned
recommendation about the heap size you will actually need.  There must
be enough runtime that there have been significant garbage collections
so we can get a sense about how much memory the application actually needs.

> I want to make it clear. I don’t have 4 physical machines. I have 48 cores server. All 4 solr nodes are running on the same physical machine. Each node has 1 shard and 1 replicate. I also have a ZooKeeper ensemble running on the same machine with 3 different ports.

Why?  You get absolutely no redundancy that way.  One Solr instance and
one ZK instance would be more efficient on a single server.  The
increase in efficiency probably wouldn't be significant, but it WOULD be
more efficient.  You really can't get a sense about how separate servers
will behave if all the software is running on a single server.

> I am curious to know what Solr is doing when the CPU usage is 100% or more than 100%. Because for some queries, I think even just looping through all the document without using any index might be faster.

I have no way to answer this question.  Solr will be doing whatever you
asked it to do.

The screenshot of the top output shows that all four of the nodes there
are using about 3GB of memory each (RES minus SHR).  Which would be
consistent with the very short runtimes noted by the GC logs.  The VIRT
column reveals that each node has about 100GB of index data.  So about
400GB total index data.  Not much can be determined when the runtime is
so small.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud performance

Chuming Chen
Hi Shawn,

Thank you very much for your analysis. I currently don’t have multiple machines to play with. I will try "one Solr instance and one ZK instance would be more efficient on a single server” you suggested.

Thanks again,

Chuming



On Nov 4, 2018, at 7:56 PM, Shawn Heisey <[hidden email]> wrote:

> On 11/4/2018 8:38 AM, Chuming Chen wrote:
>> I have shared a tar ball with you ([hidden email]) from google drive. The tar ball includes logs directories of 4 nodes, solrconfig.xml, solr.in.sh, and screenshot of TOP command. The log files is about 1 day’s log. However, I restarted the solr cloud several times during that period.
>
> Runtime represented in the GC log for node1 is 23 minutes. Not anywhere near a full day.
>
> Runtime represented in thc GC log for node2 is just under 16 minutes.
>
> Runtime represented in the GC log for node3 is 434 milliseconds.
>
> Runtime represented in the GC log for node4 is 501 milliseconds.
>
> This is not enough to even make a guess, much less a reasoned recommendation about the heap size you will actually need.  There must be enough runtime that there have been significant garbage collections so we can get a sense about how much memory the application actually needs.
>
>> I want to make it clear. I don’t have 4 physical machines. I have 48 cores server. All 4 solr nodes are running on the same physical machine. Each node has 1 shard and 1 replicate. I also have a ZooKeeper ensemble running on the same machine with 3 different ports.
>
> Why?  You get absolutely no redundancy that way.  One Solr instance and one ZK instance would be more efficient on a single server.  The increase in efficiency probably wouldn't be significant, but it WOULD be more efficient.  You really can't get a sense about how separate servers will behave if all the software is running on a single server.
>
>> I am curious to know what Solr is doing when the CPU usage is 100% or more than 100%. Because for some queries, I think even just looping through all the document without using any index might be faster.
>
> I have no way to answer this question.  Solr will be doing whatever you asked it to do.
>
> The screenshot of the top output shows that all four of the nodes there are using about 3GB of memory each (RES minus SHR).  Which would be consistent with the very short runtimes noted by the GC logs.  The VIRT column reveals that each node has about 100GB of index data.  So about 400GB total index data.  Not much can be determined when the runtime is so small.
>
> Thanks,
> Shawn
>