BinaryDocValues compression with 8.5.1

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

BinaryDocValues compression with 8.5.1

Viral Gandhi
Hi,
I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal benchmarking. We noticed that with this upgrade our QPS dropped more than 40% and also affected latencies. After doing some profiling and reverting LUCENE-9211 commit related to BinaryDocValues compression, we recovered ~30% of the loss. Did anyone encounter similar situation?

We rely on BinaryDocValues very heavily. Should this newly introduced compression be optional to opt-in?

Also, any other pointers for on recovering remaining 10% loss. When I run benchmark on 8.4 index with 8.5.1 code, performance is very similar to 8.4.

Thanks,
Viral Gandhi
Reply | Threaded
Open this post in threaded view
|

Re: BinaryDocValues compression with 8.5.1

david.w.smiley@gmail.com
I don't have a direct answer for you, but your message causes me to reflect on how Lucene does *not* give users choice of format on a per-type basis (e.g. BinaryDocValues vs NumericDocValues vs etc.), which is annoying.  Ideally the previous simple format would be available for you to choose, but it is not.  Lucene lets you mix & match PostingsFormats, stored fields formats, term vectors formats, points format.  But when it comes to DocValues, it's an all-encompassing format for five different structures.  So you take it or leave it; all or nothing.  My colleague filed https://issues.apache.org/jira/browse/LUCENE-9236 on this matter; feel free to comment there with your opinion if you have one.

~ David


On Mon, May 18, 2020 at 7:52 PM Viral Gandhi <[hidden email]> wrote:
Hi,
I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal benchmarking. We noticed that with this upgrade our QPS dropped more than 40% and also affected latencies. After doing some profiling and reverting LUCENE-9211 commit related to BinaryDocValues compression, we recovered ~30% of the loss. Did anyone encounter similar situation?

We rely on BinaryDocValues very heavily. Should this newly introduced compression be optional to opt-in?

Also, any other pointers for on recovering remaining 10% loss. When I run benchmark on 8.4 index with 8.5.1 code, performance is very similar to 8.4.

Thanks,
Viral Gandhi
Reply | Threaded
Open this post in threaded view
|

Re: BinaryDocValues compression with 8.5.1

Michael Sokolov-4
I guess the compression we added to binary doc values, and for
postings, seems to have hurt performance in a way that wasn't detected
in testing when those changes were made, or if it was detected, I
don't recall any discussion about the tradeoff being made. Now that we
do see there is a tradeoff, I think we need to have that discussion
though. I can see that having compression can be a nice win for
indexes that are huge and may be memory bound, since it can help avoid
I/O, but for a low-latency case where the index is already memory
resident, we are willing to pay the price of a larger index to avoid
the cost of decompression. I think we need to find some way of
handling both cases. I think our design principle should be to expose
as few knobs as we can, but in this case I don't see how the code can
make the decision whether to compress or not, since it really depends
on external design considerations (how big will the index grow? how
much RAM will the servers have? what query latency is tolerable?)
Given that, I think we should find a way to expose some kind of
configurability. Maybe as a first step, rather than making this
configurable for each DocValuesType, we could offer a global
configuration in IndexWriterConfig (compressFields=true/false)?

On Tue, May 19, 2020 at 1:05 AM David Smiley <[hidden email]> wrote:

>
> I don't have a direct answer for you, but your message causes me to reflect on how Lucene does *not* give users choice of format on a per-type basis (e.g. BinaryDocValues vs NumericDocValues vs etc.), which is annoying.  Ideally the previous simple format would be available for you to choose, but it is not.  Lucene lets you mix & match PostingsFormats, stored fields formats, term vectors formats, points format.  But when it comes to DocValues, it's an all-encompassing format for five different structures.  So you take it or leave it; all or nothing.  My colleague filed https://issues.apache.org/jira/browse/LUCENE-9236 on this matter; feel free to comment there with your opinion if you have one.
>
> ~ David
>
>
> On Mon, May 18, 2020 at 7:52 PM Viral Gandhi <[hidden email]> wrote:
>>
>> Hi,
>> I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal benchmarking. We noticed that with this upgrade our QPS dropped more than 40% and also affected latencies. After doing some profiling and reverting LUCENE-9211 commit related to BinaryDocValues compression, we recovered ~30% of the loss. Did anyone encounter similar situation?
>>
>> We rely on BinaryDocValues very heavily. Should this newly introduced compression be optional to opt-in?
>>
>> Also, any other pointers for on recovering remaining 10% loss. When I run benchmark on 8.4 index with 8.5.1 code, performance is very similar to 8.4.
>>
>> Thanks,
>> Viral Gandhi

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: BinaryDocValues compression with 8.5.1

Michael McCandless-2
I think we could do this at the Codec level?

For example, for stored fields, the current default format (Lucene50StoredFieldsFormat) has two modes, Mode.BEST_SPEED and Mode.BEST_COMPRESSION, that are easy for the user to pick.  Both modes use compression, just at varying levels.

I think for the (new) Lucene84DocValuesFormat, which looks like it will always compress binary DVs, we could similarly add a Mode, maybe with two options, COMPRESSED and UNCOMPRESSED?

This way it is fairly simple for users to create a custom Codec subclassing the default Codec and pick the format they want.  And we can try to figure out which way it should default.  Our (Amazon's customer facing product search) usage is admittedly unusual, heavily relying on BINARY doc values performance per hit collected during matching.  Other search applications might not see a 40% hit to their red-line throughput :)

Viral could you please open a Jira issue to find a way to make this configurable?  We can hash out the details on the issue ...

On Wed, May 20, 2020 at 5:38 PM Michael Sokolov <[hidden email]> wrote:
I guess the compression we added to binary doc values, and for
postings, seems to have hurt performance in a way that wasn't detected
in testing when those changes were made, or if it was detected, I
don't recall any discussion about the tradeoff being made. Now that we
do see there is a tradeoff, I think we need to have that discussion
though. I can see that having compression can be a nice win for
indexes that are huge and may be memory bound, since it can help avoid
I/O, but for a low-latency case where the index is already memory
resident, we are willing to pay the price of a larger index to avoid
the cost of decompression. I think we need to find some way of
handling both cases. I think our design principle should be to expose
as few knobs as we can, but in this case I don't see how the code can
make the decision whether to compress or not, since it really depends
on external design considerations (how big will the index grow? how
much RAM will the servers have? what query latency is tolerable?)
Given that, I think we should find a way to expose some kind of
configurability. Maybe as a first step, rather than making this
configurable for each DocValuesType, we could offer a global
configuration in IndexWriterConfig (compressFields=true/false)?

On Tue, May 19, 2020 at 1:05 AM David Smiley <[hidden email]> wrote:
>
> I don't have a direct answer for you, but your message causes me to reflect on how Lucene does *not* give users choice of format on a per-type basis (e.g. BinaryDocValues vs NumericDocValues vs etc.), which is annoying.  Ideally the previous simple format would be available for you to choose, but it is not.  Lucene lets you mix & match PostingsFormats, stored fields formats, term vectors formats, points format.  But when it comes to DocValues, it's an all-encompassing format for five different structures.  So you take it or leave it; all or nothing.  My colleague filed https://issues.apache.org/jira/browse/LUCENE-9236 on this matter; feel free to comment there with your opinion if you have one.
>
> ~ David
>
>
> On Mon, May 18, 2020 at 7:52 PM Viral Gandhi <[hidden email]> wrote:
>>
>> Hi,
>> I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal benchmarking. We noticed that with this upgrade our QPS dropped more than 40% and also affected latencies. After doing some profiling and reverting LUCENE-9211 commit related to BinaryDocValues compression, we recovered ~30% of the loss. Did anyone encounter similar situation?
>>
>> We rely on BinaryDocValues very heavily. Should this newly introduced compression be optional to opt-in?
>>
>> Also, any other pointers for on recovering remaining 10% loss. When I run benchmark on 8.4 index with 8.5.1 code, performance is very similar to 8.4.
>>
>> Thanks,
>> Viral Gandhi

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: BinaryDocValues compression with 8.5.1

Viral Gandhi
Thank you! Opened https://issues.apache.org/jira/browse/LUCENE-9378 to address this.

Viral Gandhi

On Wed, 20 May 2020 at 15:27, Michael McCandless <[hidden email]> wrote:
I think we could do this at the Codec level?

For example, for stored fields, the current default format (Lucene50StoredFieldsFormat) has two modes, Mode.BEST_SPEED and Mode.BEST_COMPRESSION, that are easy for the user to pick.  Both modes use compression, just at varying levels.

I think for the (new) Lucene84DocValuesFormat, which looks like it will always compress binary DVs, we could similarly add a Mode, maybe with two options, COMPRESSED and UNCOMPRESSED?

This way it is fairly simple for users to create a custom Codec subclassing the default Codec and pick the format they want.  And we can try to figure out which way it should default.  Our (Amazon's customer facing product search) usage is admittedly unusual, heavily relying on BINARY doc values performance per hit collected during matching.  Other search applications might not see a 40% hit to their red-line throughput :)

Viral could you please open a Jira issue to find a way to make this configurable?  We can hash out the details on the issue ...

On Wed, May 20, 2020 at 5:38 PM Michael Sokolov <[hidden email]> wrote:
I guess the compression we added to binary doc values, and for
postings, seems to have hurt performance in a way that wasn't detected
in testing when those changes were made, or if it was detected, I
don't recall any discussion about the tradeoff being made. Now that we
do see there is a tradeoff, I think we need to have that discussion
though. I can see that having compression can be a nice win for
indexes that are huge and may be memory bound, since it can help avoid
I/O, but for a low-latency case where the index is already memory
resident, we are willing to pay the price of a larger index to avoid
the cost of decompression. I think we need to find some way of
handling both cases. I think our design principle should be to expose
as few knobs as we can, but in this case I don't see how the code can
make the decision whether to compress or not, since it really depends
on external design considerations (how big will the index grow? how
much RAM will the servers have? what query latency is tolerable?)
Given that, I think we should find a way to expose some kind of
configurability. Maybe as a first step, rather than making this
configurable for each DocValuesType, we could offer a global
configuration in IndexWriterConfig (compressFields=true/false)?

On Tue, May 19, 2020 at 1:05 AM David Smiley <[hidden email]> wrote:
>
> I don't have a direct answer for you, but your message causes me to reflect on how Lucene does *not* give users choice of format on a per-type basis (e.g. BinaryDocValues vs NumericDocValues vs etc.), which is annoying.  Ideally the previous simple format would be available for you to choose, but it is not.  Lucene lets you mix & match PostingsFormats, stored fields formats, term vectors formats, points format.  But when it comes to DocValues, it's an all-encompassing format for five different structures.  So you take it or leave it; all or nothing.  My colleague filed https://issues.apache.org/jira/browse/LUCENE-9236 on this matter; feel free to comment there with your opinion if you have one.
>
> ~ David
>
>
> On Mon, May 18, 2020 at 7:52 PM Viral Gandhi <[hidden email]> wrote:
>>
>> Hi,
>> I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal benchmarking. We noticed that with this upgrade our QPS dropped more than 40% and also affected latencies. After doing some profiling and reverting LUCENE-9211 commit related to BinaryDocValues compression, we recovered ~30% of the loss. Did anyone encounter similar situation?
>>
>> We rely on BinaryDocValues very heavily. Should this newly introduced compression be optional to opt-in?
>>
>> Also, any other pointers for on recovering remaining 10% loss. When I run benchmark on 8.4 index with 8.5.1 code, performance is very similar to 8.4.
>>
>> Thanks,
>> Viral Gandhi

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: BinaryDocValues compression with 8.5.1

Michael McCandless-2
Thanks Viral!

On Thu, May 21, 2020 at 2:21 PM Viral Gandhi <[hidden email]> wrote:
Thank you! Opened https://issues.apache.org/jira/browse/LUCENE-9378 to address this.

Viral Gandhi

On Wed, 20 May 2020 at 15:27, Michael McCandless <[hidden email]> wrote:
I think we could do this at the Codec level?

For example, for stored fields, the current default format (Lucene50StoredFieldsFormat) has two modes, Mode.BEST_SPEED and Mode.BEST_COMPRESSION, that are easy for the user to pick.  Both modes use compression, just at varying levels.

I think for the (new) Lucene84DocValuesFormat, which looks like it will always compress binary DVs, we could similarly add a Mode, maybe with two options, COMPRESSED and UNCOMPRESSED?

This way it is fairly simple for users to create a custom Codec subclassing the default Codec and pick the format they want.  And we can try to figure out which way it should default.  Our (Amazon's customer facing product search) usage is admittedly unusual, heavily relying on BINARY doc values performance per hit collected during matching.  Other search applications might not see a 40% hit to their red-line throughput :)

Viral could you please open a Jira issue to find a way to make this configurable?  We can hash out the details on the issue ...

On Wed, May 20, 2020 at 5:38 PM Michael Sokolov <[hidden email]> wrote:
I guess the compression we added to binary doc values, and for
postings, seems to have hurt performance in a way that wasn't detected
in testing when those changes were made, or if it was detected, I
don't recall any discussion about the tradeoff being made. Now that we
do see there is a tradeoff, I think we need to have that discussion
though. I can see that having compression can be a nice win for
indexes that are huge and may be memory bound, since it can help avoid
I/O, but for a low-latency case where the index is already memory
resident, we are willing to pay the price of a larger index to avoid
the cost of decompression. I think we need to find some way of
handling both cases. I think our design principle should be to expose
as few knobs as we can, but in this case I don't see how the code can
make the decision whether to compress or not, since it really depends
on external design considerations (how big will the index grow? how
much RAM will the servers have? what query latency is tolerable?)
Given that, I think we should find a way to expose some kind of
configurability. Maybe as a first step, rather than making this
configurable for each DocValuesType, we could offer a global
configuration in IndexWriterConfig (compressFields=true/false)?

On Tue, May 19, 2020 at 1:05 AM David Smiley <[hidden email]> wrote:
>
> I don't have a direct answer for you, but your message causes me to reflect on how Lucene does *not* give users choice of format on a per-type basis (e.g. BinaryDocValues vs NumericDocValues vs etc.), which is annoying.  Ideally the previous simple format would be available for you to choose, but it is not.  Lucene lets you mix & match PostingsFormats, stored fields formats, term vectors formats, points format.  But when it comes to DocValues, it's an all-encompassing format for five different structures.  So you take it or leave it; all or nothing.  My colleague filed https://issues.apache.org/jira/browse/LUCENE-9236 on this matter; feel free to comment there with your opinion if you have one.
>
> ~ David
>
>
> On Mon, May 18, 2020 at 7:52 PM Viral Gandhi <[hidden email]> wrote:
>>
>> Hi,
>> I tried upgrading to lucene 8.5.1 from 8.4 and ran our internal benchmarking. We noticed that with this upgrade our QPS dropped more than 40% and also affected latencies. After doing some profiling and reverting LUCENE-9211 commit related to BinaryDocValues compression, we recovered ~30% of the loss. Did anyone encounter similar situation?
>>
>> We rely on BinaryDocValues very heavily. Should this newly introduced compression be optional to opt-in?
>>
>> Also, any other pointers for on recovering remaining 10% loss. When I run benchmark on 8.4 index with 8.5.1 code, performance is very similar to 8.4.
>>
>> Thanks,
>> Viral Gandhi

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]