Cost of enabling doc values

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Cost of enabling doc values

root23
Hi all,
Does anyone know how much typically index size increments when we enable doc
value on a field.
Our business side want to enable sorting fields on most of our fields. I am
trying to push back saying that it will increase the index size, since
enabling docvalues will create the univerted index.

I know the size probably depends on what values are in the fields but i need
a general idea so that i can convince them that enabling on the fields is
costly and it will incur this much cost.

If anyone knows how to find this out looking at an existing solr index which
has docvalues enabled , that will  also be great help.

Thanks !!!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Cost of enabling doc values

Erick Erickson
I pretty much agree with your business side.

The rough size of the docValues fields is one of X for each doc. So
say you have an int field. Size is near maxDoc * 4 bytes. This is not
totally accurate, there is some int packing done for instance, but
it'll do. If you really want an accurate count, look at the
before/after size of your *.dvd, *.dvm segment files in your index.

However, it's "pay me now or pay me later". The critical operations
are faceting, grouping and sorting. If you do any of those operations
on a field that is _not_ docValues=true, it will be uninverted on the
_java heap_, where it will consume GC cycles, put pressure on all your
other operations, etc. This process will be done _every_ time you open
a new searcher and use these fields.

If the field _does_ have docValues=true, that will be held in the OS's
memory space, _not_ the JVM's heap due to using MMapDirectory (see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
Among other virtues, it can be swapped out (although you don't want it
to be, it's still better than OOMing). Plus loading it is just reading
it off disk rather than the expensive uninversion process.

And if you don't do any of those operations (grouping, sorting and
faceting), then the bits just sit there on disk doing nothing.

So say you carefully define what fields will be used for any of the
three operations and enable docValues. Then 3 months later the
business side comes back with "oh, we need to facet on another field".
Your choices are:
1> live with the increased heap usage and other resource contention.
Perhaps along the way panicking because your processes OOM and prod
goes down.
or
2> reindex from scratch, starting with a totally new collection.

And note the fragility here. Your application can be humming along
just fine for months. Then one fine day someone innocently submits a
query that sorts on a new field that has docValues=false and B-OOM.

If (and only if) you can _guarantee_ that fieldX will never be used
for any of the three operations, then turning off docValues for that
field will save you some disk space. But that's the only advantage.
Well, alright. If you have to do a full index replication that'll
happen a bit faster too.

So I prefer to err on the side of caution. I recommend making fields
docValues=true unless I can absolutely guarantee (and business _also_
agrees)
1>  that fieldX will never be used for sorting, grouping or faceting,
or
2> if the can't promise that they guarantee to give me time to
completely reindex,

Best,
Erick


On Wed, Jun 13, 2018 at 4:30 PM, root23 <[hidden email]> wrote:

> Hi all,
> Does anyone know how much typically index size increments when we enable doc
> value on a field.
> Our business side want to enable sorting fields on most of our fields. I am
> trying to push back saying that it will increase the index size, since
> enabling docvalues will create the univerted index.
>
> I know the size probably depends on what values are in the fields but i need
> a general idea so that i can convince them that enabling on the fields is
> costly and it will incur this much cost.
>
> If anyone knows how to find this out looking at an existing solr index which
> has docvalues enabled , that will  also be great help.
>
> Thanks !!!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Cost of enabling doc values

Jan Høydahl / Cominvent
Depending on what your documents look like, it could be that enabling docValues would allow you to save space by switching to stored="false" since Solr can fetch the stored value from docValues. I say it depends on your documents and use case since sometimes it may be slower to access a docValue just to read one field if all the other fields come from stored values. If you do not do matches/lookups/range-queries on some fields you may even be able to set indexed="false" and save space in the inverted index.

A benefit of having docValues enabled is that it then lets you do atomic updates to your docs, to re-index from an existing index (not from source) and to use streaming expressions on all fields.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 14. jun. 2018 kl. 04:13 skrev Erick Erickson <[hidden email]>:
>
> I pretty much agree with your business side.
>
> The rough size of the docValues fields is one of X for each doc. So
> say you have an int field. Size is near maxDoc * 4 bytes. This is not
> totally accurate, there is some int packing done for instance, but
> it'll do. If you really want an accurate count, look at the
> before/after size of your *.dvd, *.dvm segment files in your index.
>
> However, it's "pay me now or pay me later". The critical operations
> are faceting, grouping and sorting. If you do any of those operations
> on a field that is _not_ docValues=true, it will be uninverted on the
> _java heap_, where it will consume GC cycles, put pressure on all your
> other operations, etc. This process will be done _every_ time you open
> a new searcher and use these fields.
>
> If the field _does_ have docValues=true, that will be held in the OS's
> memory space, _not_ the JVM's heap due to using MMapDirectory (see:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
> Among other virtues, it can be swapped out (although you don't want it
> to be, it's still better than OOMing). Plus loading it is just reading
> it off disk rather than the expensive uninversion process.
>
> And if you don't do any of those operations (grouping, sorting and
> faceting), then the bits just sit there on disk doing nothing.
>
> So say you carefully define what fields will be used for any of the
> three operations and enable docValues. Then 3 months later the
> business side comes back with "oh, we need to facet on another field".
> Your choices are:
> 1> live with the increased heap usage and other resource contention.
> Perhaps along the way panicking because your processes OOM and prod
> goes down.
> or
> 2> reindex from scratch, starting with a totally new collection.
>
> And note the fragility here. Your application can be humming along
> just fine for months. Then one fine day someone innocently submits a
> query that sorts on a new field that has docValues=false and B-OOM.
>
> If (and only if) you can _guarantee_ that fieldX will never be used
> for any of the three operations, then turning off docValues for that
> field will save you some disk space. But that's the only advantage.
> Well, alright. If you have to do a full index replication that'll
> happen a bit faster too.
>
> So I prefer to err on the side of caution. I recommend making fields
> docValues=true unless I can absolutely guarantee (and business _also_
> agrees)
> 1>  that fieldX will never be used for sorting, grouping or faceting,
> or
> 2> if the can't promise that they guarantee to give me time to
> completely reindex,
>
> Best,
> Erick
>
>
> On Wed, Jun 13, 2018 at 4:30 PM, root23 <[hidden email]> wrote:
>> Hi all,
>> Does anyone know how much typically index size increments when we enable doc
>> value on a field.
>> Our business side want to enable sorting fields on most of our fields. I am
>> trying to push back saying that it will increase the index size, since
>> enabling docvalues will create the univerted index.
>>
>> I know the size probably depends on what values are in the fields but i need
>> a general idea so that i can convince them that enabling on the fields is
>> costly and it will incur this much cost.
>>
>> If anyone knows how to find this out looking at an existing solr index which
>> has docvalues enabled , that will  also be great help.
>>
>> Thanks !!!
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: Cost of enabling doc values

root23
In reply to this post by Erick Erickson
Thanks for the detailed explanation erick.
I did a little math as you suggested. Just wanted to see if i am doing it
right.
So we have around 4 billion docs in production and around 70 nodes.

To support the business use case we have around 18 fields on which we have
to enable docvalues for sorting.

FieldType   totalFields   Size of field
TriIntField    2               4 bytes
StrField       7                20 bytes
IntField        1                4 bytes
Bool              1              1 bytes
TrieDateField  2             10 bytes
TextField        5             10 bytes


Some of them i approximated the bytes like fot strField and textField based
on no. of chatacters we usually have in those fields. I am not sure about
the TrieDate field how much it will take. Please feel free to correct me if
i am way off.

so acc. to the above total size for a doc is = 2*4 + 20 *7 + 4 + 1+20+50 =
223 bytes.

So for 4 billion docs it comes to approximate 892000000000 bytes or 892 gb.

Does that math sound right or am i way off ?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Cost of enabling doc values

Erick Erickson
My claim is it simply doesn't matter. You either have to have those
bytes laying around on disk in the DV case and using OS memory or in
the cumulative java heap in the non-dv case.

If you're doing one of the three operations I know of no situation
where I would _not_ enable docValues.

The Lucene people do a lot of effort to make things compact, so what
you're coming up with is probably an upper bound. Frankly I'd just
enable the DV fields, index a bunch of docs and look at the cumulative
sizes of your dvd and dvm files.

I'd probably index, say, 10M docs and measure the two extensions, then
index 10M more and use the delta between 10M and 20M to extrapolate.

I also use the size of those files to get something of a sense of how
much OS memory I need for those operations (searching not included
yet). Gives me a sense of whether what I want to do is possible or
not.

Long blog on the topic of sizing, but it sums up as "try it and see":

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Thu, Jun 14, 2018 at 8:34 AM, root23 <[hidden email]> wrote:

> Thanks for the detailed explanation erick.
> I did a little math as you suggested. Just wanted to see if i am doing it
> right.
> So we have around 4 billion docs in production and around 70 nodes.
>
> To support the business use case we have around 18 fields on which we have
> to enable docvalues for sorting.
>
> FieldType   totalFields   Size of field
> TriIntField    2               4 bytes
> StrField       7                20 bytes
> IntField        1                4 bytes
> Bool              1              1 bytes
> TrieDateField  2             10 bytes
> TextField        5             10 bytes
>
>
> Some of them i approximated the bytes like fot strField and textField based
> on no. of chatacters we usually have in those fields. I am not sure about
> the TrieDate field how much it will take. Please feel free to correct me if
> i am way off.
>
> so acc. to the above total size for a doc is = 2*4 + 20 *7 + 4 + 1+20+50 =
> 223 bytes.
>
> So for 4 billion docs it comes to approximate 892000000000 bytes or 892 gb.
>
> Does that math sound right or am i way off ?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html