Version field as DV

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Version field as DV

Ishan Chattopadhyaya
Hi all,
I am looking to try out _version_ as a docvalue (SOLR-6337) as a precursor to SOLR-5944. Towards that, I want the _version_ field to be stored=indexed=false, docValues=true.

Does someone know about the performance implications of retrieving the _version_ as a docvalue, e.g. accessing docvalue vs. a stored field? Is there any known inefficiency when using a docvalue (as opposed to a stored field) due to random disk seeks, for example?
Regards,
Ishan
Reply | Threaded
Open this post in threaded view
|

Re: Version field as DV

Joel Bernstein
In general DocValues were built to support large scale random access use cases such as faceting and sorting. They have similar performance characteristics as the FieldCache. But unlike the FieldCache you can trade off memory and performance by selecting different DocValues formats.


On Mon, Jun 22, 2015 at 10:41 AM, Ishan Chattopadhyaya <[hidden email]> wrote:
Hi all,
I am looking to try out _version_ as a docvalue (SOLR-6337) as a precursor to SOLR-5944. Towards that, I want the _version_ field to be stored=indexed=false, docValues=true.

Does someone know about the performance implications of retrieving the _version_ as a docvalue, e.g. accessing docvalue vs. a stored field? Is there any known inefficiency when using a docvalue (as opposed to a stored field) due to random disk seeks, for example?
Regards,
Ishan

Reply | Threaded
Open this post in threaded view
|

RE: Version field as DV

Reitzel, Charles

I think where Ishan is going with his question is this:

1.      _version_ never needs to be searchable, thus, indexed=false makes sense.

2.      _version_ typically needs to be evaluated with performing an update and, possibly, delete, thus stored=true makes sense.

3.      _version_ would never be used for either sorting or faceting.

4.      Given the above, is using docValues=true for _version_ a good idea?

 

Looking at the documentation:

https://cwiki.apache.org/confluence/display/solr/DocValues

 

And a bit more background:

http://lucidworks.com/blog/fun-with-docvalues-in-solr-4-2/

 

My take is a simple “no”.   Since docValues is, in essence, column oriented storage (and can be seen, I think, as an alternate index format), what benefit is to be gained for the _version_ field.   The primary benefits of docValues are in the sorting and faceting operations (maybe grouping?).   These operations are never performed on the _version_ field, are they?

 

I guess my remaining question is does it make sense to set indexed=”false” on _version_?   The example schemas set indexed=true.   Does solr itself perform searches internally on _version_?   If so, then indexed=true is required.   But otherwise, it seems like useless overhead.

 

Note, I have been using optimistic concurrency control in one application and, so, am interested in this possible optimization.   Any changes in this space between 4.x and 5.x?

 

Thanks,

Charlie

 

From: Joel Bernstein [mailto:[hidden email]]
Sent: Monday, June 22, 2015 11:55 AM
To: lucene dev
Subject: Re: Version field as DV

 

In general DocValues were built to support large scale random access use cases such as faceting and sorting. They have similar performance characteristics as the FieldCache. But unlike the FieldCache you can trade off memory and performance by selecting different DocValues formats.


 

On Mon, Jun 22, 2015 at 10:41 AM, Ishan Chattopadhyaya <[hidden email]> wrote:

Hi all,
I am looking to try out _version_ as a docvalue (SOLR-6337) as a precursor to SOLR-5944. Towards that, I want the _version_ field to be stored=indexed=false, docValues=true.

Does someone know about the performance implications of retrieving the _version_ as a docvalue, e.g. accessing docvalue vs. a stored field? Is there any known inefficiency when using a docvalue (as opposed to a stored field) due to random disk seeks, for example?

Regards,

Ishan

 


*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

Reply | Threaded
Open this post in threaded view
|

RE: Version field as DV

Shai Erera

There is one advantage to setting DocValues=true, if the application frequently updates documents. Pulling this value from a DV will be faster than stored fields. The latter need to be decompressed, filter all but the _version_ field etc.

indexed=true is only needed if Solr needs to know which document is associated with a certain version value. I don't know if it does though...

Shai

On Jun 22, 2015 7:23 PM, "Reitzel, Charles" <[hidden email]> wrote:

I think where Ishan is going with his question is this:

1.      _version_ never needs to be searchable, thus, indexed=false makes sense.

2.      _version_ typically needs to be evaluated with performing an update and, possibly, delete, thus stored=true makes sense.

3.      _version_ would never be used for either sorting or faceting.

4.      Given the above, is using docValues=true for _version_ a good idea?

 

Looking at the documentation:

https://cwiki.apache.org/confluence/display/solr/DocValues

 

And a bit more background:

http://lucidworks.com/blog/fun-with-docvalues-in-solr-4-2/

 

My take is a simple “no”.   Since docValues is, in essence, column oriented storage (and can be seen, I think, as an alternate index format), what benefit is to be gained for the _version_ field.   The primary benefits of docValues are in the sorting and faceting operations (maybe grouping?).   These operations are never performed on the _version_ field, are they?

 

I guess my remaining question is does it make sense to set indexed=”false” on _version_?   The example schemas set indexed=true.   Does solr itself perform searches internally on _version_?   If so, then indexed=true is required.   But otherwise, it seems like useless overhead.

 

Note, I have been using optimistic concurrency control in one application and, so, am interested in this possible optimization.   Any changes in this space between 4.x and 5.x?

 

Thanks,

Charlie

 

From: Joel Bernstein [mailto:[hidden email]]
Sent: Monday, June 22, 2015 11:55 AM
To: lucene dev
Subject: Re: Version field as DV

 

In general DocValues were built to support large scale random access use cases such as faceting and sorting. They have similar performance characteristics as the FieldCache. But unlike the FieldCache you can trade off memory and performance by selecting different DocValues formats.


 

On Mon, Jun 22, 2015 at 10:41 AM, Ishan Chattopadhyaya <[hidden email]> wrote:

Hi all,
I am looking to try out _version_ as a docvalue (SOLR-6337) as a precursor to SOLR-5944. Towards that, I want the _version_ field to be stored=indexed=false, docValues=true.

Does someone know about the performance implications of retrieving the _version_ as a docvalue, e.g. accessing docvalue vs. a stored field? Is there any known inefficiency when using a docvalue (as opposed to a stored field) due to random disk seeks, for example?

Regards,

Ishan

 


*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

Reply | Threaded
Open this post in threaded view
|

Re: Version field as DV

Ishan Chattopadhyaya
In reply to this post by Reitzel, Charles
> 2.      _version_ typically needs to be evaluated with performing an update and, possibly, delete, thus stored=true makes sense.

On Mon, Jun 22, 2015 at 9:53 PM, Reitzel, Charles <[hidden email]> wrote:
I was looking at also making the stored=false along with docValues=true, since internally looking up a docvalue can be achieved using a value source (equivalent to field(_version_) as a function query) and the field doesn't need to stored.
My main concern is the feasibility of doing that performance wise, and hence would it make sense for _version_ to go this way.

Also, are there any benchmarks around accessing a stored value vs. accessing a docvalue?


On Mon, Jun 22, 2015 at 9:53 PM, Reitzel, Charles <[hidden email]> wrote:

I think where Ishan is going with his question is this:

1.      _version_ never needs to be searchable, thus, indexed=false makes sense.

2.      _version_ typically needs to be evaluated with performing an update and, possibly, delete, thus stored=true makes sense.

3.      _version_ would never be used for either sorting or faceting.

4.      Given the above, is using docValues=true for _version_ a good idea?

 

Looking at the documentation:

https://cwiki.apache.org/confluence/display/solr/DocValues

 

And a bit more background:

http://lucidworks.com/blog/fun-with-docvalues-in-solr-4-2/

 

My take is a simple “no”.   Since docValues is, in essence, column oriented storage (and can be seen, I think, as an alternate index format), what benefit is to be gained for the _version_ field.   The primary benefits of docValues are in the sorting and faceting operations (maybe grouping?).   These operations are never performed on the _version_ field, are they?

 

I guess my remaining question is does it make sense to set indexed=”false” on _version_?   The example schemas set indexed=true.   Does solr itself perform searches internally on _version_?   If so, then indexed=true is required.   But otherwise, it seems like useless overhead.

 

Note, I have been using optimistic concurrency control in one application and, so, am interested in this possible optimization.   Any changes in this space between 4.x and 5.x?

 

Thanks,

Charlie

 

From: Joel Bernstein [mailto:[hidden email]]
Sent: Monday, June 22, 2015 11:55 AM
To: lucene dev
Subject: Re: Version field as DV

 

In general DocValues were built to support large scale random access use cases such as faceting and sorting. They have similar performance characteristics as the FieldCache. But unlike the FieldCache you can trade off memory and performance by selecting different DocValues formats.


 

On Mon, Jun 22, 2015 at 10:41 AM, Ishan Chattopadhyaya <[hidden email]> wrote:

Hi all,
I am looking to try out _version_ as a docvalue (SOLR-6337) as a precursor to SOLR-5944. Towards that, I want the _version_ field to be stored=indexed=false, docValues=true.

Does someone know about the performance implications of retrieving the _version_ as a docvalue, e.g. accessing docvalue vs. a stored field? Is there any known inefficiency when using a docvalue (as opposed to a stored field) due to random disk seeks, for example?

Regards,

Ishan

 


*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************


Reply | Threaded
Open this post in threaded view
|

RE: Version field as DV

Chris Hostetter-3
In reply to this post by Reitzel, Charles

This thread kind of got off into a tangent about solr specifics -- if you
skip down it's really a question about underlying performance concerns of
using docvalues vs using stored fields.

: 1.      _version_ never needs to be searchable, thus, indexed=false makes sense.

Unless i'm wrong, the version field is involved in "search" contexts
because of optimistic concurrency - in order for an "updated doc=1 if
version=42" then under the covers a search is done against hte version
field --- but since this is a fairly constrained filter, indexed=false
might still be fine as long as docValues=true because the search can be
done via a DocValues based filter.

: 4.      Given the above, is using docValues=true for _version_ a good idea?

: My take is a simple “no”.  Since docValues is, in essence, column
: oriented storage (and can be seen, I think, as an alternate index
: format), what benefit is to be gained for the _version_ field.  The

To be clear -- Solr already has code thta depends on having "Doc Values"
on the version field to deal with max version value in segments (see
VersionInfo.getVersionFromIndex and VersionInfo.getMaxVersionFromIndex) --
but as with any field, that doens't mean you must have 'docValues="true"'
in your schema, instead the UninvertedReader can be used as long as the
field is indexed.

But that's really not what Ishan is asking about.  

We know it's possible to use docValues=true && indexed=false on the
version field -- SOLR-6337 is open to decide if that makes sense in the
sample configs.  Ishan's question is really about stored=false.

The key bit of context of Ishan's question is updateable docValues
(SOLR-5944) and if/how it might be usable in Solr for the version field --
but one key aspect of doing that would be in ensuring that we can *return*
the correct version value to user (for optimistic concurrency).  Currently
that's done with stored fields, but that wouldn't be feasible if we go
down hte route of updateable docValues, which means we would have to
"return" the version field from the docValues.

that's where ishan's question about docvalues and performance and disk
seeks comes from...

What are the downsides in saying "instead of using docvalues and stored
fields for this this single valued int per doc, we're only going to use
docvalues & when doing pagination we will return the current value of the
field to the user from the docvalues" what kind of performance impacts
come up in that case when you have 100 docs per page(ination)


-Hoss
http://www.lucidworks.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Version field as DV

david.w.smiley@gmail.com
I don’t know if it’s worth it in terms of the trade-offs, but there’s something to be said about having *both* indexed=true & docValues=true on the _version_ field in particular.  docValues is not an “index”; any operation other than looking up the value for a specific document is O(docs) with docValues.  VersionInfo.getMaxVersionFromIndex() has a slow O(docs) algorithm when it has to use docValues, versus the field being indexed=true which uses a O(log(versionCount)) where versionCount <= docs.  It’s actually sometimes constant-time if the index postings format supports ordinals (the default BlockTree one does not).  Maybe we should use an ord-supported postings format.  What I don’t know is how frequent some of these operations are on a version field, thus could better judge the trade-offs.

~ David

On Mon, Jun 22, 2015 at 1:01 PM Chris Hostetter <[hidden email]> wrote:

This thread kind of got off into a tangent about solr specifics -- if you
skip down it's really a question about underlying performance concerns of
using docvalues vs using stored fields.

: 1.      _version_ never needs to be searchable, thus, indexed=false makes sense.

Unless i'm wrong, the version field is involved in "search" contexts
because of optimistic concurrency - in order for an "updated doc=1 if
version=42" then under the covers a search is done against hte version
field --- but since this is a fairly constrained filter, indexed=false
might still be fine as long as docValues=true because the search can be
done via a DocValues based filter.

: 4.      Given the above, is using docValues=true for _version_ a good idea?

: My take is a simple “no”.  Since docValues is, in essence, column
: oriented storage (and can be seen, I think, as an alternate index
: format), what benefit is to be gained for the _version_ field.  The

To be clear -- Solr already has code thta depends on having "Doc Values"
on the version field to deal with max version value in segments (see
VersionInfo.getVersionFromIndex and VersionInfo.getMaxVersionFromIndex) --
but as with any field, that doens't mean you must have 'docValues="true"'
in your schema, instead the UninvertedReader can be used as long as the
field is indexed.

But that's really not what Ishan is asking about.

We know it's possible to use docValues=true && indexed=false on the
version field -- SOLR-6337 is open to decide if that makes sense in the
sample configs.  Ishan's question is really about stored=false.

The key bit of context of Ishan's question is updateable docValues
(SOLR-5944) and if/how it might be usable in Solr for the version field --
but one key aspect of doing that would be in ensuring that we can *return*
the correct version value to user (for optimistic concurrency).  Currently
that's done with stored fields, but that wouldn't be feasible if we go
down hte route of updateable docValues, which means we would have to
"return" the version field from the docValues.

that's where ishan's question about docvalues and performance and disk
seeks comes from...

What are the downsides in saying "instead of using docvalues and stored
fields for this this single valued int per doc, we're only going to use
docvalues & when doing pagination we will return the current value of the
field to the user from the docvalues" what kind of performance impacts
come up in that case when you have 100 docs per page(ination)


-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Version field as DV

Adrien Grand
In reply to this post by Ishan Chattopadhyaya
For the record, there is an experimental postings format in
lucene/sandbox called IDVersionPostingsFormat that stores both the ID
and version in a postings format. This way you don't have to perform
additional seeks to look up the version, and it's even optimized for
id look ups with a minimum version for faster optimistic concurrency.

On Mon, Jun 22, 2015 at 4:41 PM, Ishan Chattopadhyaya
<[hidden email]> wrote:

> Hi all,
> I am looking to try out _version_ as a docvalue (SOLR-6337) as a precursor
> to SOLR-5944. Towards that, I want the _version_ field to be
> stored=indexed=false, docValues=true.
>
> Does someone know about the performance implications of retrieving the
> _version_ as a docvalue, e.g. accessing docvalue vs. a stored field? Is
> there any known inefficiency when using a docvalue (as opposed to a stored
> field) due to random disk seeks, for example?
> Regards,
> Ishan



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

replacing stored fields with docValues to leverage updatable doc values -- was: Re: Version field as DV

Chris Hostetter-3
In reply to this post by david.w.smiley@gmail.com

: I don’t know if it’s worth it in terms of the trade-offs, but there’s
: something to be said about having *both* indexed=true & docValues=true on

Yes that's true -- and: again ... not at all what Ishan is asking
about.

(The tradeoffs between DV and indexed are known and the question of how
they might apply to _version_ field best practices is something already
being idsucssed/tracked in the issue i mentioned before.)

The question Ishan was trying to ask, and what this thread keeps diverging
from (so i just changed the subject to try and make it more clear) is
about eliminating *stored* values from use with a particular field using
docValues in it's place.

Wether or not *indexed* values should/could also be used for performace
of searches is a completley orthoginal question -- the current discussion
is about the possibility of using the "updatable" feature of DocValues to
change some field values (in solr's case, one of those fields would *have*
to be the version field, hence the original poor subject of this thread)
and then relying *only* on the docValues to "return" the current field
values to the client.

So for a concrete example...

   id: indexed + stored + DV
   title: indexed/tokenxed + stored
   _version_: DV
   price: DV

...so if i want to change the "title" of a book, i have to completley
re-index it, but if i only want to change the *price* of a book, I use
updatable doc values to change the price field (and in solr's case, for
correct optimistic concurrency, i also update the _version_ field).

But if/when users do paginated searches of books, and get ~100 results, we
use stored fields to get the id & title of each result, but we use DV to
return the current "price" (and version)

Make sense?

Which brings us back to the question:  are there any serious performance
downsides to "abusuing" doc values in this way instead of using stored
fields?  My recollection is that back in the early days of doc values
someone did some fairly serious performanc testing and decided that trying
to use docvalues for this purpose was in fact a lot slower then stored
fields because of the random disk seeks (as opposed to all storedfields
for a single doc being co-located)

: > The key bit of context of Ishan's question is updateable docValues
: > (SOLR-5944) and if/how it might be usable in Solr for the version field --
: > but one key aspect of doing that would be in ensuring that we can *return*
: > the correct version value to user (for optimistic concurrency).  Currently
: > that's done with stored fields, but that wouldn't be feasible if we go
: > down hte route of updateable docValues, which means we would have to
: > "return" the version field from the docValues.
: >
: > that's where ishan's question about docvalues and performance and disk
: > seeks comes from...
: >
: > What are the downsides in saying "instead of using docvalues and stored
: > fields for this this single valued int per doc, we're only going to use
: > docvalues & when doing pagination we will return the current value of the
: > field to the user from the docvalues" what kind of performance impacts
: > come up in that case when you have 100 docs per page(ination)


-Hoss
http://www.lucidworks.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Version field as DV

Chris Hostetter-3
In reply to this post by Adrien Grand

: For the record, there is an experimental postings format in
: lucene/sandbox called IDVersionPostingsFormat that stores both the ID
: and version in a postings format. This way you don't have to perform
: additional seeks to look up the version, and it's even optimized for
: id look ups with a minimum version for faster optimistic concurrency.

yes, and i've got a post-it note on my desk to spend time one of these
days thinking about if/how it might work in solr -- but what Ishan is
asking about is really a much broader question about updatable doc values
vs re-indexing and stored fields.  See my recent message in the thread i
forked off of this one for a more concrete example.




-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Version field as DV

Shalin Shekhar Mangar
In reply to this post by Adrien Grand
On Tue, Jun 23, 2015 at 6:41 PM, Adrien Grand <[hidden email]> wrote:
> For the record, there is an experimental postings format in
> lucene/sandbox called IDVersionPostingsFormat that stores both the ID
> and version in a postings format. This way you don't have to perform
> additional seeks to look up the version, and it's even optimized for
> id look ups with a minimum version for faster optimistic concurrency.

Yeah, I have looked at it in the past but in the context of updateable
DocValues, I feel that there is no way to support updateable doc
values if we use the IDVersionPostingsFormat. This is because we must
update a DocValue field together with the version field atomically or
else we run into consistency issues.

>
> On Mon, Jun 22, 2015 at 4:41 PM, Ishan Chattopadhyaya
> <[hidden email]> wrote:
>> Hi all,
>> I am looking to try out _version_ as a docvalue (SOLR-6337) as a precursor
>> to SOLR-5944. Towards that, I want the _version_ field to be
>> stored=indexed=false, docValues=true.
>>
>> Does someone know about the performance implications of retrieving the
>> _version_ as a docvalue, e.g. accessing docvalue vs. a stored field? Is
>> there any known inefficiency when using a docvalue (as opposed to a stored
>> field) due to random disk seeks, for example?
>> Regards,
>> Ishan
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>



--
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Version field as DV

Reitzel, Charles
Shalin, that makes sense.    But it also seems like the details of _version_ can and should be handled internally and not be subjected to the vagaries of deployment.   Put another way, whenever _version_ is used, shouldn't its storage should be determined by the code, not schema.xml?

SOLR-5944 is a super important issue with endless applications.  Pricing is a huge use case: price field values fluctuate by the minute, hour, day, etc., but docs remain otherwise very stable.   But there are many other cases with similar semantics (e.g. share counts, purchase order quantities, assigned resources).

So, I guess I'm encouraging you to do whatever it takes to _version_ to make SOLR-5944 work.   :-)

P.S. Many thanks to Chris Hostetter for his corrections and clarifications.  I'm learning a lot from this thread.

-----Original Message-----
From: Shalin Shekhar Mangar [mailto:[hidden email]]
Sent: Thursday, June 25, 2015 12:48 AM
To: [hidden email]
Subject: Re: Version field as DV

On Tue, Jun 23, 2015 at 6:41 PM, Adrien Grand <[hidden email]> wrote:
> For the record, there is an experimental postings format in
> lucene/sandbox called IDVersionPostingsFormat that stores both the ID
> and version in a postings format. This way you don't have to perform
> additional seeks to look up the version, and it's even optimized for
> id look ups with a minimum version for faster optimistic concurrency.

Yeah, I have looked at it in the past but in the context of updateable DocValues, I feel that there is no way to support updateable doc values if we use the IDVersionPostingsFormat. This is because we must update a DocValue field together with the version field atomically or else we run into consistency issues.

>
> On Mon, Jun 22, 2015 at 4:41 PM, Ishan Chattopadhyaya
> <[hidden email]> wrote:
>> Hi all,
>> I am looking to try out _version_ as a docvalue (SOLR-6337) as a
>> precursor to SOLR-5944. Towards that, I want the _version_ field to
>> be stored=indexed=false, docValues=true.
>>
>> Does someone know about the performance implications of retrieving
>> the _version_ as a docvalue, e.g. accessing docvalue vs. a stored
>> field? Is there any known inefficiency when using a docvalue (as
>> opposed to a stored
>> field) due to random disk seeks, for example?
>> Regards,
>> Ishan
>
>
>
> --
> Adrien


*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA-CREF
*************************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]