when to use docvalue

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

when to use docvalue

matthew sporleder
I have quite a few numeric / meta-data type fields in my schema and
pretty much only use them in fq=, sort=, and friends.  Should I always
use DocValue on these if i never plan to q=search: on them?  Are there
any drawbacks?

Thanks,
Matt
Reply | Threaded
Open this post in threaded view
|

Re: when to use docvalue

Erick Erickson
Yes. You should also index them….

Here’s the way I think of it.

For questions “For term X, which docs contain that value?” means index=true. This is a search.

For questions “Does doc X have value Y in field Z”, means docValues=true.

what’s the difference? Well, the first one is to get the result set. The second is for, given a result set,
count/sort/whatever.

fq clauses are searches, so index=true.

sorting, faceting, grouping and function queries  are “for each doc in the result set, what values does field Y contain?”

Maybe that made things clear as mud, but it’s the way I think of it ;)

Best,
Erick



fq clauses are searches. Indexed=true is for searching.

sort

> On May 19, 2020, at 4:00 PM, matthew sporleder <[hidden email]> wrote:
>
> I have quite a few numeric / meta-data type fields in my schema and
> pretty much only use them in fq=, sort=, and friends.  Should I always
> use DocValue on these if i never plan to q=search: on them?  Are there
> any drawbacks?
>
> Thanks,
> Matt

Reply | Threaded
Open this post in threaded view
|

Re: when to use docvalue

matthew sporleder
You can index AND docvalue?  For some reason I thought they were exclusive

On Tue, May 19, 2020 at 5:36 PM Erick Erickson <[hidden email]> wrote:

>
> Yes. You should also index them….
>
> Here’s the way I think of it.
>
> For questions “For term X, which docs contain that value?” means index=true. This is a search.
>
> For questions “Does doc X have value Y in field Z”, means docValues=true.
>
> what’s the difference? Well, the first one is to get the result set. The second is for, given a result set,
> count/sort/whatever.
>
> fq clauses are searches, so index=true.
>
> sorting, faceting, grouping and function queries  are “for each doc in the result set, what values does field Y contain?”
>
> Maybe that made things clear as mud, but it’s the way I think of it ;)
>
> Best,
> Erick
>
>
>
> fq clauses are searches. Indexed=true is for searching.
>
> sort
>
> > On May 19, 2020, at 4:00 PM, matthew sporleder <[hidden email]> wrote:
> >
> > I have quite a few numeric / meta-data type fields in my schema and
> > pretty much only use them in fq=, sort=, and friends.  Should I always
> > use DocValue on these if i never plan to q=search: on them?  Are there
> > any drawbacks?
> >
> > Thanks,
> > Matt
>
Reply | Threaded
Open this post in threaded view
|

Re: when to use docvalue

Erick Erickson
They are _absolutely_ able to be used together. Background:

“In the bad old days”, there was no docValues. So whenever you needed
to facet/sort/group/use function queries Solr (well, Lucene) had to take
the inverted structure resulting from “index=true” and “uninvert” it on the
Java heap.

docValues essentially does the “uninverting” at index time and puts
that structure in a separate file for each segment. So rather than uninvert
the index on the heap, Lucene can just read it in from disk in MMapDirectory
(i.e. OS) memory space.

The downside is that your index will be bigger when you do both, that is the
size on disk will be bigger. But, it’ll be much faster to load, much faster to
autowarm, and will move the structures necessary to do faceting/sorting/etc
into OS memory where the garbage collection is vastly more efficient than
Javas.

And frankly I don’t think the increased size on disk is a downside. You’ll have
to have the memory anyway, and having it used on the OS memory space is
so much more efficient than on Java’s heap that it’s a win-win IMO.

Oh, and if you never sort/facet/group/use function queries, then the
docValues structures are never even read into MMapDirectory space.

So yes, freely do both.

Best,
Erick


> On May 19, 2020, at 5:41 PM, matthew sporleder <[hidden email]> wrote:
>
> You can index AND docvalue?  For some reason I thought they were exclusive
>
> On Tue, May 19, 2020 at 5:36 PM Erick Erickson <[hidden email]> wrote:
>>
>> Yes. You should also index them….
>>
>> Here’s the way I think of it.
>>
>> For questions “For term X, which docs contain that value?” means index=true. This is a search.
>>
>> For questions “Does doc X have value Y in field Z”, means docValues=true.
>>
>> what’s the difference? Well, the first one is to get the result set. The second is for, given a result set,
>> count/sort/whatever.
>>
>> fq clauses are searches, so index=true.
>>
>> sorting, faceting, grouping and function queries  are “for each doc in the result set, what values does field Y contain?”
>>
>> Maybe that made things clear as mud, but it’s the way I think of it ;)
>>
>> Best,
>> Erick
>>
>>
>>
>> fq clauses are searches. Indexed=true is for searching.
>>
>> sort
>>
>>> On May 19, 2020, at 4:00 PM, matthew sporleder <[hidden email]> wrote:
>>>
>>> I have quite a few numeric / meta-data type fields in my schema and
>>> pretty much only use them in fq=, sort=, and friends.  Should I always
>>> use DocValue on these if i never plan to q=search: on them?  Are there
>>> any drawbacks?
>>>
>>> Thanks,
>>> Matt
>>

Reply | Threaded
Open this post in threaded view
|

Re: when to use docvalue

Revas-2
Erick, Can you also explain how to optimize facet query and range facets as
they dont use docValues and contribute to higher response time?

On Tue, May 19, 2020 at 5:55 PM Erick Erickson <[hidden email]>
wrote:

> They are _absolutely_ able to be used together. Background:
>
> “In the bad old days”, there was no docValues. So whenever you needed
> to facet/sort/group/use function queries Solr (well, Lucene) had to take
> the inverted structure resulting from “index=true” and “uninvert” it on the
> Java heap.
>
> docValues essentially does the “uninverting” at index time and puts
> that structure in a separate file for each segment. So rather than uninvert
> the index on the heap, Lucene can just read it in from disk in
> MMapDirectory
> (i.e. OS) memory space.
>
> The downside is that your index will be bigger when you do both, that is
> the
> size on disk will be bigger. But, it’ll be much faster to load, much
> faster to
> autowarm, and will move the structures necessary to do faceting/sorting/etc
> into OS memory where the garbage collection is vastly more efficient than
> Javas.
>
> And frankly I don’t think the increased size on disk is a downside. You’ll
> have
> to have the memory anyway, and having it used on the OS memory space is
> so much more efficient than on Java’s heap that it’s a win-win IMO.
>
> Oh, and if you never sort/facet/group/use function queries, then the
> docValues structures are never even read into MMapDirectory space.
>
> So yes, freely do both.
>
> Best,
> Erick
>
>
> > On May 19, 2020, at 5:41 PM, matthew sporleder <[hidden email]>
> wrote:
> >
> > You can index AND docvalue?  For some reason I thought they were
> exclusive
> >
> > On Tue, May 19, 2020 at 5:36 PM Erick Erickson <[hidden email]>
> wrote:
> >>
> >> Yes. You should also index them….
> >>
> >> Here’s the way I think of it.
> >>
> >> For questions “For term X, which docs contain that value?” means
> index=true. This is a search.
> >>
> >> For questions “Does doc X have value Y in field Z”, means
> docValues=true.
> >>
> >> what’s the difference? Well, the first one is to get the result set.
> The second is for, given a result set,
> >> count/sort/whatever.
> >>
> >> fq clauses are searches, so index=true.
> >>
> >> sorting, faceting, grouping and function queries  are “for each doc in
> the result set, what values does field Y contain?”
> >>
> >> Maybe that made things clear as mud, but it’s the way I think of it ;)
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> fq clauses are searches. Indexed=true is for searching.
> >>
> >> sort
> >>
> >>> On May 19, 2020, at 4:00 PM, matthew sporleder <[hidden email]>
> wrote:
> >>>
> >>> I have quite a few numeric / meta-data type fields in my schema and
> >>> pretty much only use them in fq=, sort=, and friends.  Should I always
> >>> use DocValue on these if i never plan to q=search: on them?  Are there
> >>> any drawbacks?
> >>>
> >>> Thanks,
> >>> Matt
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: when to use docvalue

Rahul Goswami
In reply to this post by Erick Erickson
Eric,
Thanks for that explanation. I have a follow up question on that. I find
the scenario of stored=true and docValues=true to be tricky at times...
would like to know when is each of these scenarios preferred over the other
two for primitive datatypes:

1) stored=true and docValues=false
2) stored=false and docValues=true
3) stored=true and docValues=true

Thanks,
Rahul

On Tue, May 19, 2020 at 5:55 PM Erick Erickson <[hidden email]>
wrote:

> They are _absolutely_ able to be used together. Background:
>
> “In the bad old days”, there was no docValues. So whenever you needed
> to facet/sort/group/use function queries Solr (well, Lucene) had to take
> the inverted structure resulting from “index=true” and “uninvert” it on the
> Java heap.
>
> docValues essentially does the “uninverting” at index time and puts
> that structure in a separate file for each segment. So rather than uninvert
> the index on the heap, Lucene can just read it in from disk in
> MMapDirectory
> (i.e. OS) memory space.
>
> The downside is that your index will be bigger when you do both, that is
> the
> size on disk will be bigger. But, it’ll be much faster to load, much
> faster to
> autowarm, and will move the structures necessary to do faceting/sorting/etc
> into OS memory where the garbage collection is vastly more efficient than
> Javas.
>
> And frankly I don’t think the increased size on disk is a downside. You’ll
> have
> to have the memory anyway, and having it used on the OS memory space is
> so much more efficient than on Java’s heap that it’s a win-win IMO.
>
> Oh, and if you never sort/facet/group/use function queries, then the
> docValues structures are never even read into MMapDirectory space.
>
> So yes, freely do both.
>
> Best,
> Erick
>
>
> > On May 19, 2020, at 5:41 PM, matthew sporleder <[hidden email]>
> wrote:
> >
> > You can index AND docvalue?  For some reason I thought they were
> exclusive
> >
> > On Tue, May 19, 2020 at 5:36 PM Erick Erickson <[hidden email]>
> wrote:
> >>
> >> Yes. You should also index them….
> >>
> >> Here’s the way I think of it.
> >>
> >> For questions “For term X, which docs contain that value?” means
> index=true. This is a search.
> >>
> >> For questions “Does doc X have value Y in field Z”, means
> docValues=true.
> >>
> >> what’s the difference? Well, the first one is to get the result set.
> The second is for, given a result set,
> >> count/sort/whatever.
> >>
> >> fq clauses are searches, so index=true.
> >>
> >> sorting, faceting, grouping and function queries  are “for each doc in
> the result set, what values does field Y contain?”
> >>
> >> Maybe that made things clear as mud, but it’s the way I think of it ;)
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> fq clauses are searches. Indexed=true is for searching.
> >>
> >> sort
> >>
> >>> On May 19, 2020, at 4:00 PM, matthew sporleder <[hidden email]>
> wrote:
> >>>
> >>> I have quite a few numeric / meta-data type fields in my schema and
> >>> pretty much only use them in fq=, sort=, and friends.  Should I always
> >>> use DocValue on these if i never plan to q=search: on them?  Are there
> >>> any drawbacks?
> >>>
> >>> Thanks,
> >>> Matt
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: when to use docvalue

Erick Erickson
Revas:

Facet queries are just queries that are constrained by the total result set of your
primary query, so the answer to that would be the same as speeding up regular
queries. As far as range facets are concerned, I believe they _do_ use docValues,
after all they have to answer the exact same question: For doc X in the result set,
what is the value of field Y? The only difference is it has to bucket a bunch of them.

Rahul: Please don;’t hijack threads, it makes it difficult to find things later. Start
a separate e-mail thread.

The answer to your question is, of course, “it depends” on a number of things and
changes with the query. First of all, multivalued fields don’t qualify because
docValues are a sorted set, meaning the return is sorted and deduplicated. So if
the input has f values in it, b c d c d, what you’d get back from DV is b c d.

So let’s go with primitive, single-valued types. It still depends, but Solr does
the right thing, or tries. Here’s the scoop. stored fields for any single doc are
stored as a contiguous, compressed bit of memory. So if any _one_ field needs
to be read from the stored data, the entire block is decompressed and Solr will
preferentially fetch the value from the decompressed data as it’s pretty certain
to be at least as cheap as fetching from DV. However, the reverse is true if _all_
the returned values are single-valued DV fields. Then it’s more efficient to fetch
the DV values as they’re MMapped, and won’t cost the seek-and-decompress cycle.

Unless space is a real consideration for you, I’d set both index and docValues to
true…

Best,
Erick

> On May 20, 2020, at 10:45 AM, Rahul Goswami <[hidden email]> wrote:
>
> Eric,
> Thanks for that explanation. I have a follow up question on that. I find
> the scenario of stored=true and docValues=true to be tricky at times...
> would like to know when is each of these scenarios preferred over the other
> two for primitive datatypes:
>
> 1) stored=true and docValues=false
> 2) stored=false and docValues=true
> 3) stored=true and docValues=true
>
> Thanks,
> Rahul
>
> On Tue, May 19, 2020 at 5:55 PM Erick Erickson <[hidden email]>
> wrote:
>
>> They are _absolutely_ able to be used together. Background:
>>
>> “In the bad old days”, there was no docValues. So whenever you needed
>> to facet/sort/group/use function queries Solr (well, Lucene) had to take
>> the inverted structure resulting from “index=true” and “uninvert” it on the
>> Java heap.
>>
>> docValues essentially does the “uninverting” at index time and puts
>> that structure in a separate file for each segment. So rather than uninvert
>> the index on the heap, Lucene can just read it in from disk in
>> MMapDirectory
>> (i.e. OS) memory space.
>>
>> The downside is that your index will be bigger when you do both, that is
>> the
>> size on disk will be bigger. But, it’ll be much faster to load, much
>> faster to
>> autowarm, and will move the structures necessary to do faceting/sorting/etc
>> into OS memory where the garbage collection is vastly more efficient than
>> Javas.
>>
>> And frankly I don’t think the increased size on disk is a downside. You’ll
>> have
>> to have the memory anyway, and having it used on the OS memory space is
>> so much more efficient than on Java’s heap that it’s a win-win IMO.
>>
>> Oh, and if you never sort/facet/group/use function queries, then the
>> docValues structures are never even read into MMapDirectory space.
>>
>> So yes, freely do both.
>>
>> Best,
>> Erick
>>
>>
>>> On May 19, 2020, at 5:41 PM, matthew sporleder <[hidden email]>
>> wrote:
>>>
>>> You can index AND docvalue?  For some reason I thought they were
>> exclusive
>>>
>>> On Tue, May 19, 2020 at 5:36 PM Erick Erickson <[hidden email]>
>> wrote:
>>>>
>>>> Yes. You should also index them….
>>>>
>>>> Here’s the way I think of it.
>>>>
>>>> For questions “For term X, which docs contain that value?” means
>> index=true. This is a search.
>>>>
>>>> For questions “Does doc X have value Y in field Z”, means
>> docValues=true.
>>>>
>>>> what’s the difference? Well, the first one is to get the result set.
>> The second is for, given a result set,
>>>> count/sort/whatever.
>>>>
>>>> fq clauses are searches, so index=true.
>>>>
>>>> sorting, faceting, grouping and function queries  are “for each doc in
>> the result set, what values does field Y contain?”
>>>>
>>>> Maybe that made things clear as mud, but it’s the way I think of it ;)
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>
>>>>
>>>> fq clauses are searches. Indexed=true is for searching.
>>>>
>>>> sort
>>>>
>>>>> On May 19, 2020, at 4:00 PM, matthew sporleder <[hidden email]>
>> wrote:
>>>>>
>>>>> I have quite a few numeric / meta-data type fields in my schema and
>>>>> pretty much only use them in fq=, sort=, and friends.  Should I always
>>>>> use DocValue on these if i never plan to q=search: on them?  Are there
>>>>> any drawbacks?
>>>>>
>>>>> Thanks,
>>>>> Matt
>>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: when to use docvalue

Revas-2
Thanks, Erick. Its just when we enable both index=true and docValues=true,
it increases the index time by 2x atleast for full re-index.

On Wed, May 20, 2020 at 2:30 PM Erick Erickson <[hidden email]>
wrote:

> Revas:
>
> Facet queries are just queries that are constrained by the total result
> set of your
> primary query, so the answer to that would be the same as speeding up
> regular
> queries. As far as range facets are concerned, I believe they _do_ use
> docValues,
> after all they have to answer the exact same question: For doc X in the
> result set,
> what is the value of field Y? The only difference is it has to bucket a
> bunch of them.
>
> Rahul: Please don;’t hijack threads, it makes it difficult to find things
> later. Start
> a separate e-mail thread.
>
> The answer to your question is, of course, “it depends” on a number of
> things and
> changes with the query. First of all, multivalued fields don’t qualify
> because
> docValues are a sorted set, meaning the return is sorted and deduplicated.
> So if
> the input has f values in it, b c d c d, what you’d get back from DV is b
> c d.
>
> So let’s go with primitive, single-valued types. It still depends, but
> Solr does
> the right thing, or tries. Here’s the scoop. stored fields for any single
> doc are
> stored as a contiguous, compressed bit of memory. So if any _one_ field
> needs
> to be read from the stored data, the entire block is decompressed and Solr
> will
> preferentially fetch the value from the decompressed data as it’s pretty
> certain
> to be at least as cheap as fetching from DV. However, the reverse is true
> if _all_
> the returned values are single-valued DV fields. Then it’s more efficient
> to fetch
> the DV values as they’re MMapped, and won’t cost the seek-and-decompress
> cycle.
>
> Unless space is a real consideration for you, I’d set both index and
> docValues to
> true…
>
> Best,
> Erick
>
> > On May 20, 2020, at 10:45 AM, Rahul Goswami <[hidden email]>
> wrote:
> >
> > Eric,
> > Thanks for that explanation. I have a follow up question on that. I find
> > the scenario of stored=true and docValues=true to be tricky at times...
> > would like to know when is each of these scenarios preferred over the
> other
> > two for primitive datatypes:
> >
> > 1) stored=true and docValues=false
> > 2) stored=false and docValues=true
> > 3) stored=true and docValues=true
> >
> > Thanks,
> > Rahul
> >
> > On Tue, May 19, 2020 at 5:55 PM Erick Erickson <[hidden email]>
> > wrote:
> >
> >> They are _absolutely_ able to be used together. Background:
> >>
> >> “In the bad old days”, there was no docValues. So whenever you needed
> >> to facet/sort/group/use function queries Solr (well, Lucene) had to take
> >> the inverted structure resulting from “index=true” and “uninvert” it on
> the
> >> Java heap.
> >>
> >> docValues essentially does the “uninverting” at index time and puts
> >> that structure in a separate file for each segment. So rather than
> uninvert
> >> the index on the heap, Lucene can just read it in from disk in
> >> MMapDirectory
> >> (i.e. OS) memory space.
> >>
> >> The downside is that your index will be bigger when you do both, that is
> >> the
> >> size on disk will be bigger. But, it’ll be much faster to load, much
> >> faster to
> >> autowarm, and will move the structures necessary to do
> faceting/sorting/etc
> >> into OS memory where the garbage collection is vastly more efficient
> than
> >> Javas.
> >>
> >> And frankly I don’t think the increased size on disk is a downside.
> You’ll
> >> have
> >> to have the memory anyway, and having it used on the OS memory space is
> >> so much more efficient than on Java’s heap that it’s a win-win IMO.
> >>
> >> Oh, and if you never sort/facet/group/use function queries, then the
> >> docValues structures are never even read into MMapDirectory space.
> >>
> >> So yes, freely do both.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>> On May 19, 2020, at 5:41 PM, matthew sporleder <[hidden email]>
> >> wrote:
> >>>
> >>> You can index AND docvalue?  For some reason I thought they were
> >> exclusive
> >>>
> >>> On Tue, May 19, 2020 at 5:36 PM Erick Erickson <
> [hidden email]>
> >> wrote:
> >>>>
> >>>> Yes. You should also index them….
> >>>>
> >>>> Here’s the way I think of it.
> >>>>
> >>>> For questions “For term X, which docs contain that value?” means
> >> index=true. This is a search.
> >>>>
> >>>> For questions “Does doc X have value Y in field Z”, means
> >> docValues=true.
> >>>>
> >>>> what’s the difference? Well, the first one is to get the result set.
> >> The second is for, given a result set,
> >>>> count/sort/whatever.
> >>>>
> >>>> fq clauses are searches, so index=true.
> >>>>
> >>>> sorting, faceting, grouping and function queries  are “for each doc in
> >> the result set, what values does field Y contain?”
> >>>>
> >>>> Maybe that made things clear as mud, but it’s the way I think of it ;)
> >>>>
> >>>> Best,
> >>>> Erick
> >>>>
> >>>>
> >>>>
> >>>> fq clauses are searches. Indexed=true is for searching.
> >>>>
> >>>> sort
> >>>>
> >>>>> On May 19, 2020, at 4:00 PM, matthew sporleder <[hidden email]
> >
> >> wrote:
> >>>>>
> >>>>> I have quite a few numeric / meta-data type fields in my schema and
> >>>>> pretty much only use them in fq=, sort=, and friends.  Should I
> always
> >>>>> use DocValue on these if i never plan to q=search: on them?  Are
> there
> >>>>> any drawbacks?
> >>>>>
> >>>>> Thanks,
> >>>>> Matt
> >>>>
> >>
> >>
>
>