Solr caching clarifications

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr caching clarifications

Manuel Le Normand
Hello,
As a result of frequent java OOM exceptions, I try to investigate more into
the solr jvm memory heap usage.
Please correct me if I am mistaking, this is my understanding of usages for
the heap (per replica on a solr instance):
1. Buffers for indexing - bounded by ramBufferSize
2. Solr caches
3. Segment merge
4. Miscellaneous- buffers for Tlogs, servlet overhead etc.

Particularly I'm concerned by Solr caches and segment merges.
1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet)
and queryResultCaches (DocList)? I understand it is related to the skip
spaces between doc id's that match (so it's not saved as a bitmap). But
basically, is every id saved as a java int?
2. QueryResultMaxDocsCached - (for example = 100) means that any query
resulting in more than 100 docs will not be cached (at all) in the
queryResultCache? Or does it have to do with the documentCache?
3. DocumentCache - written on the wiki it should be greater than
max_results*concurrent_queries. Max result is just the num of rows
displayed (rows-start) param, right? Not the queryResultWindow.
4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
cache be used? (on the expense of eviction of docs that were already loaded
with stored fields)
5. How large is the heap used by mergings? Assuming we have a merge of 10
segments of 500MB each (half inverted files - *.pos *.doc etc, half non
inverted files - *.fdt, *.tvd), how much heap should be left unused for
this merge?

Thanks in advance,
Manu
Reply | Threaded
Open this post in threaded view
|

Re: Solr caching clarifications

Erick Erickson
Inline

On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand
<[hidden email]> wrote:

> Hello,
> As a result of frequent java OOM exceptions, I try to investigate more into
> the solr jvm memory heap usage.
> Please correct me if I am mistaking, this is my understanding of usages for
> the heap (per replica on a solr instance):
> 1. Buffers for indexing - bounded by ramBufferSize
> 2. Solr caches
> 3. Segment merge
> 4. Miscellaneous- buffers for Tlogs, servlet overhead etc.
>
> Particularly I'm concerned by Solr caches and segment merges.
> 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet)
> and queryResultCaches (DocList)? I understand it is related to the skip
> spaces between doc id's that match (so it's not saved as a bitmap). But
> basically, is every id saved as a java int?

Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you
can get the maxDoc number from your Solr admin page). Plus some overhead
for storing the fq text, but that's usually not much. This is for each
entry up to "Size".

queryResultCache is usually trivial unless you've configured it extravagantly.
It's the query string length + queryResultWindowSize integers per entry
(queryResultWindowSize is from solrconfig.xml).

> 2. QueryResultMaxDocsCached - (for example = 100) means that any query
> resulting in more than 100 docs will not be cached (at all) in the
> queryResultCache? Or does it have to do with the documentCache?
It's just a limit on the queryResultCache entry size as far as I can
tell. But again
this cache is relatively small, I'd be surprised if it used
significant resources.

> 3. DocumentCache - written on the wiki it should be greater than
> max_results*concurrent_queries. Max result is just the num of rows
> displayed (rows-start) param, right? Not the queryResultWindow.

Yes. This a cache (I think) for the _contents_ of the documents you'll
be returning to be manipulated by various components during the life
of the query.

> 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
> cache be used? (on the expense of eviction of docs that were already loaded
> with stored fields)

Not sure, but I don't think this will contribute much to memory pressure. This
is about now many fields are loaded to get a single value from a doc in the
results list, and since one is usually working with 20 or so docs this
is usually
a small amount of memory.

> 5. How large is the heap used by mergings? Assuming we have a merge of 10
> segments of 500MB each (half inverted files - *.pos *.doc etc, half non
> inverted files - *.fdt, *.tvd), how much heap should be left unused for
> this merge?

Again, I don't think this is much of a memory consumer, although I
confess I don't
know the internals. Merging is mostly about I/O.

>
> Thanks in advance,
> Manu

But take a look at the admin page, you can see how much memory various
caches are using by looking at the plugins/stats section.

Best
Erick
Reply | Threaded
Open this post in threaded view
|

Re: Solr caching clarifications

Manuel Le Normand
Alright, thanks Erick. For the question about memory usage of merges, taken
from  Mike McCandless Blog

The big thing that stays in RAM is a logical int[] mapping old docIDs to
new docIDs, but in more recent versions of Lucene (4.x) we use a much more
efficient structure than a simple int[] ... see
https://issues.apache.org/jira/browse/LUCENE-2357

How much RAM is required is mostly a function of how many documents (lots
of tiny docs use more RAM than fewer huge docs).


A related clarification
As my users are not aware of the fq possibility, i was wondering how do I
make the best out of this field cache. Would if be efficient transforming
implicitly their query to a filter query on fields that are boolean
searches (date range etc. that do not affect the score of a document). Is
this a good practice? Is there any plugin for a query parser that makes it?



>
> Inline
>
> On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand
> <[hidden email]> wrote:
> > Hello,
> > As a result of frequent java OOM exceptions, I try to investigate more
into
> > the solr jvm memory heap usage.
> > Please correct me if I am mistaking, this is my understanding of usages
for
> > the heap (per replica on a solr instance):
> > 1. Buffers for indexing - bounded by ramBufferSize
> > 2. Solr caches
> > 3. Segment merge
> > 4. Miscellaneous- buffers for Tlogs, servlet overhead etc.
> >
> > Particularly I'm concerned by Solr caches and segment merges.
> > 1. How much memory consuming (bytes per doc) are FilterCaches
(bitDocSet)
> > and queryResultCaches (DocList)? I understand it is related to the skip
> > spaces between doc id's that match (so it's not saved as a bitmap). But
> > basically, is every id saved as a java int?
>
> Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you
> can get the maxDoc number from your Solr admin page). Plus some overhead
> for storing the fq text, but that's usually not much. This is for each
> entry up to "Size".



>
> queryResultCache is usually trivial unless you've configured it
extravagantly.

> It's the query string length + queryResultWindowSize integers per entry
> (queryResultWindowSize is from solrconfig.xml).
>
> > 2. QueryResultMaxDocsCached - (for example = 100) means that any query
> > resulting in more than 100 docs will not be cached (at all) in the
> > queryResultCache? Or does it have to do with the documentCache?
> It's just a limit on the queryResultCache entry size as far as I can
> tell. But again
> this cache is relatively small, I'd be surprised if it used
> significant resources.
>
> > 3. DocumentCache - written on the wiki it should be greater than
> > max_results*concurrent_queries. Max result is just the num of rows
> > displayed (rows-start) param, right? Not the queryResultWindow.
>
> Yes. This a cache (I think) for the _contents_ of the documents you'll
> be returning to be manipulated by various components during the life
> of the query.
>
> > 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
> > cache be used? (on the expense of eviction of docs that were already
loaded
> > with stored fields)
>
> Not sure, but I don't think this will contribute much to memory pressure.
This
> is about now many fields are loaded to get a single value from a doc in
the
> results list, and since one is usually working with 20 or so docs this
> is usually
> a small amount of memory.
>
> > 5. How large is the heap used by mergings? Assuming we have a merge of
10

> > segments of 500MB each (half inverted files - *.pos *.doc etc, half non
> > inverted files - *.fdt, *.tvd), how much heap should be left unused for
> > this merge?
>
> Again, I don't think this is much of a memory consumer, although I
> confess I don't
> know the internals. Merging is mostly about I/O.
>
> >
> > Thanks in advance,
> > Manu
>
> But take a look at the admin page, you can see how much memory various
> caches are using by looking at the plugins/stats section.
>
> Best
> Erick
Reply | Threaded
Open this post in threaded view
|

Re: Solr caching clarifications

Erick Erickson
Manuel:

First off, anything that Mike McCandless says about low-level
details should override anything I say. The memory savings
he's talking about there are actually something he tutored me
in once on a chat.

The savings there, as I understand it, aren't huge. For large
sets I think it's a 25% savings (if I calculated right). But consider
that even without those savings, 8 filter cache entries will be
more than the entire structure that JIRA talks about....

As to your fq question, absolutely! Any yes/no clause that,
as you say, contribute to the score is a candidate to be
moved to a fq clause. There are a couple of things to
be aware of though.
1> be a little careful of using NOW. If you don't use it correctly,
     fq clauses will not be re-used. See:
     http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/
2> How you usually do this is through the UI, not the users entering
     a query. For instance if you have a date-range picker your a;;
     constructs the fq clause from that. Or you append fq clauses to the
     links you create when you display facets or....

No, there's no automatic tool for this. There's not likely to be one
since there's no way to infer the intent. Say you put in a clause like
q=a AND b.
That scores things. It would give the same result set as
q=*:*&fq=1&fq=b
which would compute no scores. How could a tool infer when this
was or wasn't OK?

Best
Erick

On Sun, Jul 14, 2013 at 6:10 PM, Manuel Le Normand
<[hidden email]> wrote:

> Alright, thanks Erick. For the question about memory usage of merges, taken
> from  Mike McCandless Blog
>
> The big thing that stays in RAM is a logical int[] mapping old docIDs to
> new docIDs, but in more recent versions of Lucene (4.x) we use a much more
> efficient structure than a simple int[] ... see
> https://issues.apache.org/jira/browse/LUCENE-2357
>
> How much RAM is required is mostly a function of how many documents (lots
> of tiny docs use more RAM than fewer huge docs).
>
>
> A related clarification
> As my users are not aware of the fq possibility, i was wondering how do I
> make the best out of this field cache. Would if be efficient transforming
> implicitly their query to a filter query on fields that are boolean
> searches (date range etc. that do not affect the score of a document). Is
> this a good practice? Is there any plugin for a query parser that makes it?
>
>
>
>>
>> Inline
>>
>> On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand
>> <[hidden email]> wrote:
>> > Hello,
>> > As a result of frequent java OOM exceptions, I try to investigate more
> into
>> > the solr jvm memory heap usage.
>> > Please correct me if I am mistaking, this is my understanding of usages
> for
>> > the heap (per replica on a solr instance):
>> > 1. Buffers for indexing - bounded by ramBufferSize
>> > 2. Solr caches
>> > 3. Segment merge
>> > 4. Miscellaneous- buffers for Tlogs, servlet overhead etc.
>> >
>> > Particularly I'm concerned by Solr caches and segment merges.
>> > 1. How much memory consuming (bytes per doc) are FilterCaches
> (bitDocSet)
>> > and queryResultCaches (DocList)? I understand it is related to the skip
>> > spaces between doc id's that match (so it's not saved as a bitmap). But
>> > basically, is every id saved as a java int?
>>
>> Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you
>> can get the maxDoc number from your Solr admin page). Plus some overhead
>> for storing the fq text, but that's usually not much. This is for each
>> entry up to "Size".
>
>
>
>>
>> queryResultCache is usually trivial unless you've configured it
> extravagantly.
>> It's the query string length + queryResultWindowSize integers per entry
>> (queryResultWindowSize is from solrconfig.xml).
>>
>> > 2. QueryResultMaxDocsCached - (for example = 100) means that any query
>> > resulting in more than 100 docs will not be cached (at all) in the
>> > queryResultCache? Or does it have to do with the documentCache?
>> It's just a limit on the queryResultCache entry size as far as I can
>> tell. But again
>> this cache is relatively small, I'd be surprised if it used
>> significant resources.
>>
>> > 3. DocumentCache - written on the wiki it should be greater than
>> > max_results*concurrent_queries. Max result is just the num of rows
>> > displayed (rows-start) param, right? Not the queryResultWindow.
>>
>> Yes. This a cache (I think) for the _contents_ of the documents you'll
>> be returning to be manipulated by various components during the life
>> of the query.
>>
>> > 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
>> > cache be used? (on the expense of eviction of docs that were already
> loaded
>> > with stored fields)
>>
>> Not sure, but I don't think this will contribute much to memory pressure.
> This
>> is about now many fields are loaded to get a single value from a doc in
> the
>> results list, and since one is usually working with 20 or so docs this
>> is usually
>> a small amount of memory.
>>
>> > 5. How large is the heap used by mergings? Assuming we have a merge of
> 10
>> > segments of 500MB each (half inverted files - *.pos *.doc etc, half non
>> > inverted files - *.fdt, *.tvd), how much heap should be left unused for
>> > this merge?
>>
>> Again, I don't think this is much of a memory consumer, although I
>> confess I don't
>> know the internals. Merging is mostly about I/O.
>>
>> >
>> > Thanks in advance,
>> > Manu
>>
>> But take a look at the admin page, you can see how much memory various
>> caches are using by looking at the plugins/stats section.
>>
>> Best
>> Erick
Reply | Threaded
Open this post in threaded view
|

Re: Solr caching clarifications

Manuel Le Normand
Great explanation and article.

Yes, this buffer for merges seems very small, and still optimized. Thats
impressive.