Costs/benefits of DocValues

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Costs/benefits of DocValues

Demian Katz
Hello,

I have a legacy Solr schema that I would like to update to take advantage of DocValues. I understand that by adding "docValues=true" to some of my fields, I can improve sorting/faceting performance. However, I have a couple of questions:


1.)    Will Solr always take proper advantage of docValues when it is turned on, or will I gain greater performance by turning of stored/indexed in situations where only docValues are necessary (e.g. a sort-only field)?

2.)    Will adding docValues to a field introduce significant performance penalties for non-docValues uses of that field, beyond the obvious fact that the additional data will consume more disk and memory?

I'm asking this question because the existing schema has some multi-purpose fields, and I'm trying to determine whether I should just add "docValues=true" wherever it might help, or if I need to take a more thoughtful approach and potentially split some fields with copyFields, etc. This is particularly significant because my schema makes use of some dynamic field suffixes, and I'm not sure if I need to add new suffixes to differentiate docValues/non-docValues fields, or if it's okay to turn on docValues across the board "just in case."

Apologies if these questions have already been answered - I couldn't find a totally clear answer in the places I searched.

Thanks!

- Demian
Reply | Threaded
Open this post in threaded view
|

Re: Costs/benefits of DocValues

Yonik Seeley
On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz <[hidden email]> wrote:
> I understand that by adding "docValues=true" to some of my fields, I can improve sorting/faceting performance.

I don't think this is true in the general sense.
docValues are built at index-time, so what you will save is initial
un-inversion time (i.e. the first time a field is used after a new
searcher is opened).
After that point, docValues may be slightly slower.

The other advantage of docValues is memory use... much/most of it is
essentially "off-heap", being memory-mapped from disk.  This cuts down
on memory issues and helps reduce longer GC pauses.

docValues are good in general, and I think we should default to them
more for Solr 6, but they are not better in all ways.

> However, I have a couple of questions:
>
>
> 1.)    Will Solr always take proper advantage of docValues when it is turned on

Yes.

> , or will I gain greater performance by turning of stored/indexed in situations where only docValues are necessary (e.g. a sort-only field)?
>
> 2.)    Will adding docValues to a field introduce significant performance penalties for non-docValues uses of that field, beyond the obvious fact that the additional data will consume more disk and memory?

No, it's a separate part of the index.

-Yonik


> I'm asking this question because the existing schema has some multi-purpose fields, and I'm trying to determine whether I should just add "docValues=true" wherever it might help, or if I need to take a more thoughtful approach and potentially split some fields with copyFields, etc. This is particularly significant because my schema makes use of some dynamic field suffixes, and I'm not sure if I need to add new suffixes to differentiate docValues/non-docValues fields, or if it's okay to turn on docValues across the board "just in case."
>
> Apologies if these questions have already been answered - I couldn't find a totally clear answer in the places I searched.
>
> Thanks!
>
> - Demian
Reply | Threaded
Open this post in threaded view
|

Re: Costs/benefits of DocValues

Alexandre Rafalovitch
I thought docValues were per segment, so the price of un-inversion was
effectively paid on each commit for all the segments, as opposed to
just the updated one.

I admit I also find the story around docValues to be very confusing at
the moment. Especially on the interplay with "indexed=false". It would
make a VERY good article to have this clarified somehow by people in
the know.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 9 November 2015 at 11:04, Yonik Seeley <[hidden email]> wrote:

> On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz <[hidden email]> wrote:
>> I understand that by adding "docValues=true" to some of my fields, I can improve sorting/faceting performance.
>
> I don't think this is true in the general sense.
> docValues are built at index-time, so what you will save is initial
> un-inversion time (i.e. the first time a field is used after a new
> searcher is opened).
> After that point, docValues may be slightly slower.
>
> The other advantage of docValues is memory use... much/most of it is
> essentially "off-heap", being memory-mapped from disk.  This cuts down
> on memory issues and helps reduce longer GC pauses.
>
> docValues are good in general, and I think we should default to them
> more for Solr 6, but they are not better in all ways.
>
>> However, I have a couple of questions:
>>
>>
>> 1.)    Will Solr always take proper advantage of docValues when it is turned on
>
> Yes.
>
>> , or will I gain greater performance by turning of stored/indexed in situations where only docValues are necessary (e.g. a sort-only field)?
>>
>> 2.)    Will adding docValues to a field introduce significant performance penalties for non-docValues uses of that field, beyond the obvious fact that the additional data will consume more disk and memory?
>
> No, it's a separate part of the index.
>
> -Yonik
>
>
>> I'm asking this question because the existing schema has some multi-purpose fields, and I'm trying to determine whether I should just add "docValues=true" wherever it might help, or if I need to take a more thoughtful approach and potentially split some fields with copyFields, etc. This is particularly significant because my schema makes use of some dynamic field suffixes, and I'm not sure if I need to add new suffixes to differentiate docValues/non-docValues fields, or if it's okay to turn on docValues across the board "just in case."
>>
>> Apologies if these questions have already been answered - I couldn't find a totally clear answer in the places I searched.
>>
>> Thanks!
>>
>> - Demian
Reply | Threaded
Open this post in threaded view
|

Re: Costs/benefits of DocValues

Yonik Seeley
On Mon, Nov 9, 2015 at 11:19 AM, Alexandre Rafalovitch
<[hidden email]> wrote:
> I thought docValues were per segment, so the price of un-inversion was
> effectively paid on each commit for all the segments, as opposed to
> just the updated one.

Both the field cache (i.e. uninverting indexed values) and docValues
are mostly per-segment (I say mostly because some uses still require
building a global ord map).

But even when things are mostly per-segment, you hit major segment
merges and the cost of un-inversion (when you aren't using docValues)
is non-trivial.

> I admit I also find the story around docValues to be very confusing at
> the moment. Especially on the interplay with "indexed=false".

You still need "indexed=true" for efficient filters on the field.
Hence if you're faceting on a field and want to use docValues, you
probably want to keep the "indexed=true" on the field as well.

-Yonik


> It would
> make a VERY good article to have this clarified somehow by people in
> the know.
>
> Regards,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 9 November 2015 at 11:04, Yonik Seeley <[hidden email]> wrote:
>> On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz <[hidden email]> wrote:
>>> I understand that by adding "docValues=true" to some of my fields, I can improve sorting/faceting performance.
>>
>> I don't think this is true in the general sense.
>> docValues are built at index-time, so what you will save is initial
>> un-inversion time (i.e. the first time a field is used after a new
>> searcher is opened).
>> After that point, docValues may be slightly slower.
>>
>> The other advantage of docValues is memory use... much/most of it is
>> essentially "off-heap", being memory-mapped from disk.  This cuts down
>> on memory issues and helps reduce longer GC pauses.
>>
>> docValues are good in general, and I think we should default to them
>> more for Solr 6, but they are not better in all ways.
>>
>>> However, I have a couple of questions:
>>>
>>>
>>> 1.)    Will Solr always take proper advantage of docValues when it is turned on
>>
>> Yes.
>>
>>> , or will I gain greater performance by turning of stored/indexed in situations where only docValues are necessary (e.g. a sort-only field)?
>>>
>>> 2.)    Will adding docValues to a field introduce significant performance penalties for non-docValues uses of that field, beyond the obvious fact that the additional data will consume more disk and memory?
>>
>> No, it's a separate part of the index.
>>
>> -Yonik
>>
>>
>>> I'm asking this question because the existing schema has some multi-purpose fields, and I'm trying to determine whether I should just add "docValues=true" wherever it might help, or if I need to take a more thoughtful approach and potentially split some fields with copyFields, etc. This is particularly significant because my schema makes use of some dynamic field suffixes, and I'm not sure if I need to add new suffixes to differentiate docValues/non-docValues fields, or if it's okay to turn on docValues across the board "just in case."
>>>
>>> Apologies if these questions have already been answered - I couldn't find a totally clear answer in the places I searched.
>>>
>>> Thanks!
>>>
>>> - Demian
Reply | Threaded
Open this post in threaded view
|

Re: Costs/benefits of DocValues

Alexandre Rafalovitch
Thank you Yonik.

So I would probably advise then to "keep your indexed=true" and think
about _adding_ docValues when there is a memory pressure or when there
is clear performance issue for the ...specific... uses.

But if we are keeping the indexed=true, then docValues=true will STILL
use at least as much memory however efficient docValues are
themselves, right? Or will something that is normally loaded and use
memory will stay unloaded in this combination scenario?

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 9 November 2015 at 11:57, Yonik Seeley <[hidden email]> wrote:

> On Mon, Nov 9, 2015 at 11:19 AM, Alexandre Rafalovitch
> <[hidden email]> wrote:
>> I thought docValues were per segment, so the price of un-inversion was
>> effectively paid on each commit for all the segments, as opposed to
>> just the updated one.
>
> Both the field cache (i.e. uninverting indexed values) and docValues
> are mostly per-segment (I say mostly because some uses still require
> building a global ord map).
>
> But even when things are mostly per-segment, you hit major segment
> merges and the cost of un-inversion (when you aren't using docValues)
> is non-trivial.
>
>> I admit I also find the story around docValues to be very confusing at
>> the moment. Especially on the interplay with "indexed=false".
>
> You still need "indexed=true" for efficient filters on the field.
> Hence if you're faceting on a field and want to use docValues, you
> probably want to keep the "indexed=true" on the field as well.
>
> -Yonik
>
>
>> It would
>> make a VERY good article to have this clarified somehow by people in
>> the know.
>>
>> Regards,
>>    Alex.
>> ----
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 9 November 2015 at 11:04, Yonik Seeley <[hidden email]> wrote:
>>> On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz <[hidden email]> wrote:
>>>> I understand that by adding "docValues=true" to some of my fields, I can improve sorting/faceting performance.
>>>
>>> I don't think this is true in the general sense.
>>> docValues are built at index-time, so what you will save is initial
>>> un-inversion time (i.e. the first time a field is used after a new
>>> searcher is opened).
>>> After that point, docValues may be slightly slower.
>>>
>>> The other advantage of docValues is memory use... much/most of it is
>>> essentially "off-heap", being memory-mapped from disk.  This cuts down
>>> on memory issues and helps reduce longer GC pauses.
>>>
>>> docValues are good in general, and I think we should default to them
>>> more for Solr 6, but they are not better in all ways.
>>>
>>>> However, I have a couple of questions:
>>>>
>>>>
>>>> 1.)    Will Solr always take proper advantage of docValues when it is turned on
>>>
>>> Yes.
>>>
>>>> , or will I gain greater performance by turning of stored/indexed in situations where only docValues are necessary (e.g. a sort-only field)?
>>>>
>>>> 2.)    Will adding docValues to a field introduce significant performance penalties for non-docValues uses of that field, beyond the obvious fact that the additional data will consume more disk and memory?
>>>
>>> No, it's a separate part of the index.
>>>
>>> -Yonik
>>>
>>>
>>>> I'm asking this question because the existing schema has some multi-purpose fields, and I'm trying to determine whether I should just add "docValues=true" wherever it might help, or if I need to take a more thoughtful approach and potentially split some fields with copyFields, etc. This is particularly significant because my schema makes use of some dynamic field suffixes, and I'm not sure if I need to add new suffixes to differentiate docValues/non-docValues fields, or if it's okay to turn on docValues across the board "just in case."
>>>>
>>>> Apologies if these questions have already been answered - I couldn't find a totally clear answer in the places I searched.
>>>>
>>>> Thanks!
>>>>
>>>> - Demian
Reply | Threaded
Open this post in threaded view
|

Re: Costs/benefits of DocValues

Erick Erickson
bq: But if we are keeping the indexed=true, then docValues=true will STILL
use at least as much memory however efficient docValues are
themselves, right?

AFAIK, kinda. The big difference is that with docValues="false", you're
building these structures in the JVM whereas with docValues="true",
the structures are at least partially in the OS memory thus relieving
the pressure on Java's heap, GC and the rest.

On Mon, Nov 9, 2015 at 9:06 AM, Alexandre Rafalovitch
<[hidden email]> wrote:

> Thank you Yonik.
>
> So I would probably advise then to "keep your indexed=true" and think
> about _adding_ docValues when there is a memory pressure or when there
> is clear performance issue for the ...specific... uses.
>
> But if we are keeping the indexed=true, then docValues=true will STILL
> use at least as much memory however efficient docValues are
> themselves, right? Or will something that is normally loaded and use
> memory will stay unloaded in this combination scenario?
>
> Regards,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 9 November 2015 at 11:57, Yonik Seeley <[hidden email]> wrote:
>> On Mon, Nov 9, 2015 at 11:19 AM, Alexandre Rafalovitch
>> <[hidden email]> wrote:
>>> I thought docValues were per segment, so the price of un-inversion was
>>> effectively paid on each commit for all the segments, as opposed to
>>> just the updated one.
>>
>> Both the field cache (i.e. uninverting indexed values) and docValues
>> are mostly per-segment (I say mostly because some uses still require
>> building a global ord map).
>>
>> But even when things are mostly per-segment, you hit major segment
>> merges and the cost of un-inversion (when you aren't using docValues)
>> is non-trivial.
>>
>>> I admit I also find the story around docValues to be very confusing at
>>> the moment. Especially on the interplay with "indexed=false".
>>
>> You still need "indexed=true" for efficient filters on the field.
>> Hence if you're faceting on a field and want to use docValues, you
>> probably want to keep the "indexed=true" on the field as well.
>>
>> -Yonik
>>
>>
>>> It would
>>> make a VERY good article to have this clarified somehow by people in
>>> the know.
>>>
>>> Regards,
>>>    Alex.
>>> ----
>>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>>> http://www.solr-start.com/
>>>
>>>
>>> On 9 November 2015 at 11:04, Yonik Seeley <[hidden email]> wrote:
>>>> On Mon, Nov 9, 2015 at 10:55 AM, Demian Katz <[hidden email]> wrote:
>>>>> I understand that by adding "docValues=true" to some of my fields, I can improve sorting/faceting performance.
>>>>
>>>> I don't think this is true in the general sense.
>>>> docValues are built at index-time, so what you will save is initial
>>>> un-inversion time (i.e. the first time a field is used after a new
>>>> searcher is opened).
>>>> After that point, docValues may be slightly slower.
>>>>
>>>> The other advantage of docValues is memory use... much/most of it is
>>>> essentially "off-heap", being memory-mapped from disk.  This cuts down
>>>> on memory issues and helps reduce longer GC pauses.
>>>>
>>>> docValues are good in general, and I think we should default to them
>>>> more for Solr 6, but they are not better in all ways.
>>>>
>>>>> However, I have a couple of questions:
>>>>>
>>>>>
>>>>> 1.)    Will Solr always take proper advantage of docValues when it is turned on
>>>>
>>>> Yes.
>>>>
>>>>> , or will I gain greater performance by turning of stored/indexed in situations where only docValues are necessary (e.g. a sort-only field)?
>>>>>
>>>>> 2.)    Will adding docValues to a field introduce significant performance penalties for non-docValues uses of that field, beyond the obvious fact that the additional data will consume more disk and memory?
>>>>
>>>> No, it's a separate part of the index.
>>>>
>>>> -Yonik
>>>>
>>>>
>>>>> I'm asking this question because the existing schema has some multi-purpose fields, and I'm trying to determine whether I should just add "docValues=true" wherever it might help, or if I need to take a more thoughtful approach and potentially split some fields with copyFields, etc. This is particularly significant because my schema makes use of some dynamic field suffixes, and I'm not sure if I need to add new suffixes to differentiate docValues/non-docValues fields, or if it's okay to turn on docValues across the board "just in case."
>>>>>
>>>>> Apologies if these questions have already been answered - I couldn't find a totally clear answer in the places I searched.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> - Demian
Reply | Threaded
Open this post in threaded view
|

Re: Costs/benefits of DocValues

Yonik Seeley
In reply to this post by Alexandre Rafalovitch
On Mon, Nov 9, 2015 at 12:06 PM, Alexandre Rafalovitch
<[hidden email]> wrote:

> Thank you Yonik.
>
> So I would probably advise then to "keep your indexed=true" and think
> about _adding_ docValues when there is a memory pressure or when there
> is clear performance issue for the ...specific... uses.
>
> But if we are keeping the indexed=true, then docValues=true will STILL
> use at least as much memory however efficient docValues are
> themselves, right? Or will something that is normally loaded and use
> memory will stay unloaded in this combination scenario?

Think about it this way: for something like sorting, we need a column
for fast docid->value lookup.
Enabling docValues means building this column at index time.  At
search time, it gets memory mapped, just like most other parts of the
index.  The required memory is off-heap... the OS needs to keep the
file in it's buffer cache for good performance.
If docValues aren't enabled, this means that we need to build the
column on-the-fly on-heap (i.e. FieldCache entry is built from
un-inverting the indexed values).

An indexed field by itself only takes up disk space, just like
docValues.  Of course for searches to be fast, off-heap RAM (in the
form of OS buffer cache / disk cache) is still needed.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Costs/benefits of DocValues

Mikhail Khludnev
In reply to this post by Demian Katz
On Mon, Nov 9, 2015 at 6:55 PM, Demian Katz <[hidden email]>
wrote:

> I have a legacy Solr schema that I would like to update to take advantage
> of DocValues. I understand that by adding "docValues=true" to some of my
> fields, I can improve sorting/faceting performance.


Demian,
If an index has many segments  (let's say more than 5, or 10) docValues
faceting performance is prohibitive for old facet.field=.. .
You either need to wait for Solr 5.4 (see
https://issues.apache.org/jira/browse/SOLR-7730) or switch to JSON Facets.


--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<[hidden email]>