Solr Memory Usage

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr Memory Usage

Vijay Kokatnur
I am observing some weird behavior with how Solr is using memory.  We are
running both Solr and zookeeper on the same node.  We tested memory
settings on Solr Cloud Setup of 1 shard with 146GB index size, and 2 Shard
Solr setup with 44GB index size.  Both are running on similar beefy
machines.

 After running the setup for 3-4 days, I see that a lot of memory is
inactive in all the nodes -

 99052952  total memory
 98606256  used memory
 19143796  active memory
 75063504  inactive memory

And inactive memory is never reclaimed by the OS.  When total memory size
is reached, latency and disk IO shoots up.  We observed this behavior in
both Solr Cloud setup with 1 shard and Solr setup with 2 shards.

For the Solr Cloud setup, we are running a cron job with following command
to clear out the inactive memory.  It  is working as expected.  Even though
the index size of Cloud is 146GB, the used memory is always below 55GB.
Our response times are better and no errors/exceptions are thrown. (This
command causes issue in 2 Shard setup)

echo 3 > /proc/sys/vm/drop_caches

We have disabled the query, doc and solr caches in our setup.  Zookeeper is
using around 10GB of memory and we are not running any other process in
this system.

Has anyone faced this issue before?
Reply | Threaded
Open this post in threaded view
|

Re: Solr Memory Usage

Shawn Heisey-2
On 10/29/2014 11:43 AM, Vijay Kokatnur wrote:

> I am observing some weird behavior with how Solr is using memory.  We are
> running both Solr and zookeeper on the same node.  We tested memory
> settings on Solr Cloud Setup of 1 shard with 146GB index size, and 2 Shard
> Solr setup with 44GB index size.  Both are running on similar beefy
> machines.
>
>  After running the setup for 3-4 days, I see that a lot of memory is
> inactive in all the nodes -
>
>  99052952  total memory
>  98606256  used memory
>  19143796  active memory
>  75063504  inactive memory
>
> And inactive memory is never reclaimed by the OS.  When total memory size
> is reached, latency and disk IO shoots up.  We observed this behavior in
> both Solr Cloud setup with 1 shard and Solr setup with 2 shards.

Where are these numbers coming from?  If they are coming from the
operating system and not Java, then you have nothing to worry about.

> For the Solr Cloud setup, we are running a cron job with following command
> to clear out the inactive memory.  It  is working as expected.  Even though
> the index size of Cloud is 146GB, the used memory is always below 55GB.
> Our response times are better and no errors/exceptions are thrown. (This
> command causes issue in 2 Shard setup)
>
> echo 3 > /proc/sys/vm/drop_caches

Don't do that.  You're throwing away almost every performance advantage
the operating system has to offer.  If this changes the numbers so they
look better to you, then I can almost guarantee you that you are not
having any actual problem, and that dropping the caches like this is
*hurting* performance, not helping it.

It's completely normal for a correctly functioning system to report an
extremely low amount of memory as free.  The operating system is using
the spare memory in your system as a filesystem cache, which makes
everything run a lot faster.  If a program needs more memory, the
operating system will instantly give up some of its disk cache in order
to satisfy the memory allocation.

The "virtual memory" part of this blog post (which has direct relevance
for Solr) hopefully can explain it better than I can.  The entire blog
post is worth reading.

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

RE: Solr Memory Usage

Toke Eskildsen
In reply to this post by Vijay Kokatnur
Vijay Kokatnur [[hidden email]] wrote:
> For the Solr Cloud setup, we are running a cron job with following command
> to clear out the inactive memory.  It  is working as expected.  Even though
> the index size of Cloud is 146GB, the used memory is always below 55GB.
> Our response times are better and no errors/exceptions are thrown. (This
> command causes issue in 2 Shard setup)

> echo 3 > /proc/sys/vm/drop_caches

As Shawn points out, this is under normal circumstances a very bad idea, but...

> Has anyone faced this issue before?

We did have some problems on a 256GB machine churning terabytes of data through 40 concurrent Tika processes and into Solr. After some days, performance got really bad. When we did a top, we noticed that most of the time was used in the kernel (the 'sy' on the '%Cpu(s):'-line). The drop_caches trick worked for us too. Our systems guys explained that it was because of virtual memory space fragmentation, so the OS had to spend a lot of resources just bookkeeping memory.

Try keeping an eye on the fraction of processing power spend on the kernel from you clear the cache until it performance gets bad again. If it rises drastically, you might have the same problem.

- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

RE: Solr Memory Usage

wmartinusa
This command only touches OS level caches that hold pages destined for (or
not) the swap cache. Its use means that disk will be hit on future requests,
but in many instances the pages were headed for ejection anyway.

It does not have anything whatsoever to do with Solr caches.  It also is not
fragmentation related; it is a result of the kernel managing virtual pages
in an "as designed manner". The proper command is

#sync; echo 3 >/proc/sys/vm/drop_caches.

http://linux.die.net/man/5/proc

I have encountered resistance on the use of this on long-running processes
for years ... from people who don't even research the matter.



-----Original Message-----
From: Toke Eskildsen [mailto:[hidden email]]
Sent: Wednesday, October 29, 2014 3:06 PM
To: [hidden email]
Subject: RE: Solr Memory Usage

Vijay Kokatnur [[hidden email]] wrote:
> For the Solr Cloud setup, we are running a cron job with following
> command to clear out the inactive memory.  It  is working as expected.  
> Even though the index size of Cloud is 146GB, the used memory is always
below 55GB.
> Our response times are better and no errors/exceptions are thrown.
> (This command causes issue in 2 Shard setup)

> echo 3 > /proc/sys/vm/drop_caches

As Shawn points out, this is under normal circumstances a very bad idea,
but...

> Has anyone faced this issue before?

We did have some problems on a 256GB machine churning terabytes of data
through 40 concurrent Tika processes and into Solr. After some days,
performance got really bad. When we did a top, we noticed that most of the
time was used in the kernel (the 'sy' on the '%Cpu(s):'-line). The
drop_caches trick worked for us too. Our systems guys explained that it was
because of virtual memory space fragmentation, so the OS had to spend a lot
of resources just bookkeeping memory.

Try keeping an eye on the fraction of processing power spend on the kernel
from you clear the cache until it performance gets bad again. If it rises
drastically, you might have the same problem.

- Toke Eskildsen

Reply | Threaded
Open this post in threaded view
|

RE: Solr Memory Usage

wmartinusa
In reply to this post by Toke Eskildsen
Oops. My wording was poor. My reference to those who don't research the
matter was pointing at a large number of engineers I have worked with; not
this list.

-----Original Message-----
From: Will Martin [mailto:[hidden email]]
Sent: Wednesday, October 29, 2014 6:38 PM
To: '[hidden email]'
Subject: RE: Solr Memory Usage

This command only touches OS level caches that hold pages destined for (or
not) the swap cache. Its use means that disk will be hit on future requests,
but in many instances the pages were headed for ejection anyway.

It does not have anything whatsoever to do with Solr caches.  It also is not
fragmentation related; it is a result of the kernel managing virtual pages
in an "as designed manner". The proper command is

#sync; echo 3 >/proc/sys/vm/drop_caches.

http://linux.die.net/man/5/proc

I have encountered resistance on the use of this on long-running processes
for years ... from people who don't even research the matter.



-----Original Message-----
From: Toke Eskildsen [mailto:[hidden email]]
Sent: Wednesday, October 29, 2014 3:06 PM
To: [hidden email]
Subject: RE: Solr Memory Usage

Vijay Kokatnur [[hidden email]] wrote:
> For the Solr Cloud setup, we are running a cron job with following
> command to clear out the inactive memory.  It  is working as expected.
> Even though the index size of Cloud is 146GB, the used memory is always
below 55GB.
> Our response times are better and no errors/exceptions are thrown.
> (This command causes issue in 2 Shard setup)

> echo 3 > /proc/sys/vm/drop_caches

As Shawn points out, this is under normal circumstances a very bad idea,
but...

> Has anyone faced this issue before?

We did have some problems on a 256GB machine churning terabytes of data
through 40 concurrent Tika processes and into Solr. After some days,
performance got really bad. When we did a top, we noticed that most of the
time was used in the kernel (the 'sy' on the '%Cpu(s):'-line). The
drop_caches trick worked for us too. Our systems guys explained that it was
because of virtual memory space fragmentation, so the OS had to spend a lot
of resources just bookkeeping memory.

Try keeping an eye on the fraction of processing power spend on the kernel
from you clear the cache until it performance gets bad again. If it rises
drastically, you might have the same problem.

- Toke Eskildsen

Reply | Threaded
Open this post in threaded view
|

Re: Solr Memory Usage

Shawn Heisey-2
In reply to this post by Toke Eskildsen
On 10/29/2014 1:05 PM, Toke Eskildsen wrote:
> We did have some problems on a 256GB machine churning terabytes of data through 40 concurrent Tika processes and into Solr. After some days, performance got really bad. When we did a top, we noticed that most of the time was used in the kernel (the 'sy' on the '%Cpu(s):'-line). The drop_caches trick worked for us too. Our systems guys explained that it was because of virtual memory space fragmentation, so the OS had to spend a lot of resources just bookkeeping memory.

There's always at least one exception to any general advice, including
whatever I come up with!  It's really too bad that it didn't Just Work
(tm) for you.  Weird things can happen when you start down the path of
extreme scaling, though.

Thank you for exploring the bleeding edge for us!

Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Solr Memory Usage

Toke Eskildsen
In reply to this post by wmartinusa
On Wed, 2014-10-29 at 23:37 +0100, Will Martin wrote:
> This command only touches OS level caches that hold pages destined for (or
> not) the swap cache. Its use means that disk will be hit on future requests,
> but in many instances the pages were headed for ejection anyway.
>
> It does not have anything whatsoever to do with Solr caches.

If you re-read my post, you will see "the OS had to spend a lot of
resources just bookkeeping memory". OS, not JVM.

> It also is not fragmentation related; it is a result of the kernel
> managing virtual pages in an "as designed manner". The proper command
> is
>
> #sync; echo 3 >/proc/sys/vm/drop_caches.

I just talked with a Systems guy to verify what happened when we had
the problem:

- The machine spawned Xmx1g JVMs with Tika, each instance processing a
  single 100M ARC file, sending the result to a shared Solr instance
  and shutting down. 40 instances were running at all times, each
  instance living for a little less than 3 minutes.
  Besides taking ~40GB of RAM in total, this also meant that about 10GB
  of RAM was released and re-requested from the system each minute.
  I don't know how the memory mapping in Solr works with regard to
  re-use of existing allocations, so I can't say if Solr added to than
  number or not.

- The indexing speed deteriorated after some days, grinding down to
  (loose guess) something like 1/4th of initial speed.

- Running top showed that the majority of time was spend in the kernel.

- Running "echo 3 >/proc/sys/vm/drop_caches" (I asked Systems explicitly
  about the integer and it was '3') brought the speed back to the
  initial level. The temporary patch was to run it once every hour.

- Running top with the patch showed the vast majority of time was spend
  in user space.

- Systems investigated and determined that "huge pages" were
  automatically requested by processes on the machine, leading to
  (virtual) memory fragmentation on the OS level. They used a tool in
  'sysfsutils' (just relaying what they said here) to change the default
  from huge pages to small pages (or whatever the default is named).

- The disabling of huge pages made the problem go away and we no longer
  use the drop_caches-trick.

> http://linux.die.net/man/5/proc
>
> I have encountered resistance on the use of this on long-running processes
> for years ... from people who don't even research the matter.

The resistance is natural: Although it might work to drop_cache, as it
did for us, it is still symptom treatment. Until the cause has been
isolated and determined to be practically unresolvable, the drop_cache
is a red flag.

Your undetermined core problem might not be the same as ours, but it is
simple to check: Watch kernel time percentage. If it rises over time,
try disabling huge pages.

- Toke Eskildsen, State and University Library, Denmark