questions regrading stored fields role in query time

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

questions regrading stored fields role in query time

Saurabh Sharma
Hi All ,


I am new here on this channel.
Few days back we upgraded our solr cloud to version 7.3 and doing real-time
document posting with 15 seconds soft commit and 2 minutes hard commit
time.As of now we posting full document to solr which includes data
accumulations from various sources.

Now we want to do partial updates.I went through the documentation and
found that all the fields should be stored or docValues for partial
updates. I have few questions regarding this?

1) In case i am just fetching only 1 field while making query.What will the
performance impact due to all fields being stored? Lets say i have an "id"
field and i do have doc value true for the field, will solr use stored
fields in this case? will it load whole document in RAM ?

2)What's the impact of large stored fields (.fdt) on query time
performance. Do query time even depend on the stored field or they just
depend on indexes?


Thanks and regards
Saurabh
Reply | Threaded
Open this post in threaded view
|

Re: questions regrading stored fields role in query time

Emir Arnautović
Hi Saurabh,
Welcome to the channel!
Storing fields should not affect query performances directly if you use lazy field loading and it is the default set. And it should not affect at all if you have enough RAM compared to index size. Otherwise OS caches might be affected by stored fields. The best way to tell is to tests with expected indexing/partial updates load and see if/how much it affects performances.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Feb 2019, at 09:34, Saurabh Sharma <[hidden email]> wrote:
>
> Hi All ,
>
>
> I am new here on this channel.
> Few days back we upgraded our solr cloud to version 7.3 and doing real-time
> document posting with 15 seconds soft commit and 2 minutes hard commit
> time.As of now we posting full document to solr which includes data
> accumulations from various sources.
>
> Now we want to do partial updates.I went through the documentation and
> found that all the fields should be stored or docValues for partial
> updates. I have few questions regarding this?
>
> 1) In case i am just fetching only 1 field while making query.What will the
> performance impact due to all fields being stored? Lets say i have an "id"
> field and i do have doc value true for the field, will solr use stored
> fields in this case? will it load whole document in RAM ?
>
> 2)What's the impact of large stored fields (.fdt) on query time
> performance. Do query time even depend on the stored field or they just
> depend on indexes?
>
>
> Thanks and regards
> Saurabh

Reply | Threaded
Open this post in threaded view
|

Re: questions regrading stored fields role in query time

Saurabh Sharma
Hi Emir,

I had this question in my mind if I store my only returnable field as
docValue in RAM.will my stored documents be referenced while constructing
the response after the query. Ideally, as the field asked to return i.e fl
is already in RAM then documents on disk should not be consulted for this
field.

Any insight about the usage of docValued field vs stored field and
preference order will help here in understanding the situation in a better
way.

Thanks
Saurabh

On Tue, Feb 26, 2019 at 2:41 PM Emir Arnautović <
[hidden email]> wrote:

> Hi Saurabh,
> Welcome to the channel!
> Storing fields should not affect query performances directly if you use
> lazy field loading and it is the default set. And it should not affect at
> all if you have enough RAM compared to index size. Otherwise OS caches
> might be affected by stored fields. The best way to tell is to tests with
> expected indexing/partial updates load and see if/how much it affects
> performances.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 26 Feb 2019, at 09:34, Saurabh Sharma <[hidden email]>
> wrote:
> >
> > Hi All ,
> >
> >
> > I am new here on this channel.
> > Few days back we upgraded our solr cloud to version 7.3 and doing
> real-time
> > document posting with 15 seconds soft commit and 2 minutes hard commit
> > time.As of now we posting full document to solr which includes data
> > accumulations from various sources.
> >
> > Now we want to do partial updates.I went through the documentation and
> > found that all the fields should be stored or docValues for partial
> > updates. I have few questions regarding this?
> >
> > 1) In case i am just fetching only 1 field while making query.What will
> the
> > performance impact due to all fields being stored? Lets say i have an
> "id"
> > field and i do have doc value true for the field, will solr use stored
> > fields in this case? will it load whole document in RAM ?
> >
> > 2)What's the impact of large stored fields (.fdt) on query time
> > performance. Do query time even depend on the stored field or they just
> > depend on indexes?
> >
> >
> > Thanks and regards
> > Saurabh
>
>
Reply | Threaded
Open this post in threaded view
|

Re: questions regrading stored fields role in query time

Emir Arnautović
Hi Saurabh,
DocValues can be used for retrieving field values (note that order will not be preserved in case of multivalue field) but they are also stored in files, just different structures. Doc values will load some structure in memory, but will also use memory mapped files to access values (not familiar with this code and just assuming) so in any case it will use “shared” OS caches. Those caches will be affected when loading stored fields to do partial update. Also it’ll take some memory when indexing documents. That is why storing and doing partial updates could indirectly affect query performances. But that might be insignificant and only test can tell for sure. Unless you have small index and enough RAM, then I can also tell that for sure.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Feb 2019, at 11:21, Saurabh Sharma <[hidden email]> wrote:
>
> Hi Emir,
>
> I had this question in my mind if I store my only returnable field as
> docValue in RAM.will my stored documents be referenced while constructing
> the response after the query. Ideally, as the field asked to return i.e fl
> is already in RAM then documents on disk should not be consulted for this
> field.
>
> Any insight about the usage of docValued field vs stored field and
> preference order will help here in understanding the situation in a better
> way.
>
> Thanks
> Saurabh
>
> On Tue, Feb 26, 2019 at 2:41 PM Emir Arnautović <
> [hidden email]> wrote:
>
>> Hi Saurabh,
>> Welcome to the channel!
>> Storing fields should not affect query performances directly if you use
>> lazy field loading and it is the default set. And it should not affect at
>> all if you have enough RAM compared to index size. Otherwise OS caches
>> might be affected by stored fields. The best way to tell is to tests with
>> expected indexing/partial updates load and see if/how much it affects
>> performances.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 26 Feb 2019, at 09:34, Saurabh Sharma <[hidden email]>
>> wrote:
>>>
>>> Hi All ,
>>>
>>>
>>> I am new here on this channel.
>>> Few days back we upgraded our solr cloud to version 7.3 and doing
>> real-time
>>> document posting with 15 seconds soft commit and 2 minutes hard commit
>>> time.As of now we posting full document to solr which includes data
>>> accumulations from various sources.
>>>
>>> Now we want to do partial updates.I went through the documentation and
>>> found that all the fields should be stored or docValues for partial
>>> updates. I have few questions regarding this?
>>>
>>> 1) In case i am just fetching only 1 field while making query.What will
>> the
>>> performance impact due to all fields being stored? Lets say i have an
>> "id"
>>> field and i do have doc value true for the field, will solr use stored
>>> fields in this case? will it load whole document in RAM ?
>>>
>>> 2)What's the impact of large stored fields (.fdt) on query time
>>> performance. Do query time even depend on the stored field or they just
>>> depend on indexes?
>>>
>>>
>>> Thanks and regards
>>> Saurabh
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: questions regrading stored fields role in query time

Shawn Heisey-2
In reply to this post by Saurabh Sharma
On 2/26/2019 1:34 AM, Saurabh Sharma wrote:
> Now we want to do partial updates.I went through the documentation and
> found that all the fields should be stored or docValues for partial
> updates. I have few questions regarding this?
>
> 1) In case i am just fetching only 1 field while making query.What will the
> performance impact due to all fields being stored? Lets say i have an "id"
> field and i do have doc value true for the field, will solr use stored
> fields in this case? will it load whole document in RAM ?

I am not aware of any option to keep docValues in RAM.  If you have
enough memory in your system (memory that has NOT been assigned to any
program), then the OS *might* keep some or all of your index data in
memory.  That functionality, present in all modern operating systems, is
the secret to good performance.

The stored data is compressed.  The docValues data is not compressed.
Uncompressing stored data uses CPU cycles.  Generally if data must be
read off of disk, compressed will be faster.  But if the data has been
cached by the OS and comes from memory, which you definitely want to
happen if possible, uncompressed will likely be faster ... and it will
definitely require less CPU.

If you have many fields but you're only fetching one, then docValues
will almost certainly be faster than stored.  All of the stored fields
for one document are compressed together, so Solr will be reading data
that it won't actually be using, in order to achieve decompression.

I believe that if you have both stored data and docValues for a field,
Solr will use stored data for search results.  I am not positive that
this is the case, but I think it's what happens.

> 2)What's the impact of large stored fields (.fdt) on query time
> performance. Do query time even depend on the stored field or they just
> depend on indexes?

The size of your stored data will have no *DIRECT* impact on query
performance.  Stored data is not consulted for the query part.  It is
consulted when document data is retrieved to return with the response.

A large amount of stored data can have an indirect impact on query
performance.  If there is insufficient memory available to the OS disk
cache, then reading the stored data to return results to the client will
push information out of the disk cache that is needed for queries.  If
that happens, then Solr will need to re-read that data off the disk to
do a query.  Because disks are glacially slow compared to memory,
performance will be impacted.

Here's a page about performance problems.  Most of it is about memory,
since that is usually the resource that has the biggest effect on
performance:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: questions regrading stored fields role in query time

Erick Erickson
In reply to this post by Emir Arnautović
It Depends (tm).

See: SOLR-12598 for details. The short form is that as of Solr 7.5, Solr attempts to do the most efficient thing possible when fetching fields to return to the client.

1> if all requested fields are docValues, return from docValues.
2> if _any_ field is stored, return from the stored (fdt) values.
3> if some are DV=true, but stored=false, get from both places
4> if some are DV=false but stored=true, get from both places.

To return a single stored=true field that is _not_ docValues, a minimum 16K block must be read from disk and decompressed. Much of the time, that will contain all of the fields and the uncompressed doc will be in the JVM’s heap so it’s more efficient to do that than pull it from MMapDirectory space.

If all values are dv=true, then not having to seek to disk/uncompress is probably more efficient so do it that way.

3 and 4 are really the same thing, you _can’t_ get all the fields from the same place, so you have to read/decompress _and_ pull from DV.

But wrapped around all this is that you’re really not doing either for even a small fraction of the docs compared to searching. Say I have numFound of 1,000,000 but return 10 docs. You only have to decompress 10 blocks at worst.

And, as Emir says, accessing the fdt files is only done for the 10 docs returned, so that really doesn’t impact the search times much…

Best,
Erick

> On Feb 26, 2019, at 2:40 AM, Emir Arnautović <[hidden email]> wrote:
>
> Hi Saurabh,
> DocValues can be used for retrieving field values (note that order will not be preserved in case of multivalue field) but they are also stored in files, just different structures. Doc values will load some structure in memory, but will also use memory mapped files to access values (not familiar with this code and just assuming) so in any case it will use “shared” OS caches. Those caches will be affected when loading stored fields to do partial update. Also it’ll take some memory when indexing documents. That is why storing and doing partial updates could indirectly affect query performances. But that might be insignificant and only test can tell for sure. Unless you have small index and enough RAM, then I can also tell that for sure.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 26 Feb 2019, at 11:21, Saurabh Sharma <[hidden email]> wrote:
>>
>> Hi Emir,
>>
>> I had this question in my mind if I store my only returnable field as
>> docValue in RAM.will my stored documents be referenced while constructing
>> the response after the query. Ideally, as the field asked to return i.e fl
>> is already in RAM then documents on disk should not be consulted for this
>> field.
>>
>> Any insight about the usage of docValued field vs stored field and
>> preference order will help here in understanding the situation in a better
>> way.
>>
>> Thanks
>> Saurabh
>>
>> On Tue, Feb 26, 2019 at 2:41 PM Emir Arnautović <
>> [hidden email]> wrote:
>>
>>> Hi Saurabh,
>>> Welcome to the channel!
>>> Storing fields should not affect query performances directly if you use
>>> lazy field loading and it is the default set. And it should not affect at
>>> all if you have enough RAM compared to index size. Otherwise OS caches
>>> might be affected by stored fields. The best way to tell is to tests with
>>> expected indexing/partial updates load and see if/how much it affects
>>> performances.
>>>
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>>
>>>> On 26 Feb 2019, at 09:34, Saurabh Sharma <[hidden email]>
>>> wrote:
>>>>
>>>> Hi All ,
>>>>
>>>>
>>>> I am new here on this channel.
>>>> Few days back we upgraded our solr cloud to version 7.3 and doing
>>> real-time
>>>> document posting with 15 seconds soft commit and 2 minutes hard commit
>>>> time.As of now we posting full document to solr which includes data
>>>> accumulations from various sources.
>>>>
>>>> Now we want to do partial updates.I went through the documentation and
>>>> found that all the fields should be stored or docValues for partial
>>>> updates. I have few questions regarding this?
>>>>
>>>> 1) In case i am just fetching only 1 field while making query.What will
>>> the
>>>> performance impact due to all fields being stored? Lets say i have an
>>> "id"
>>>> field and i do have doc value true for the field, will solr use stored
>>>> fields in this case? will it load whole document in RAM ?
>>>>
>>>> 2)What's the impact of large stored fields (.fdt) on query time
>>>> performance. Do query time even depend on the stored field or they just
>>>> depend on indexes?
>>>>
>>>>
>>>> Thanks and regards
>>>> Saurabh
>>>
>>>
>