Improving DocValue sorting performance

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Improving DocValue sorting performance

Daniel Penning-2
Hi

I think sorting and searching on docvalues can benefit from similar optimizations like impacts do when scoring is used. By storing the minimum and maximum value for blocks of docvalues it would be possible to skip over sets of documents that don't contain any values in the relevant range.

When sorting a large resultset this could be used to skip over parts of the index once the desired number of documents is found and be narrowed down further step by step. This could also be used to improve performance of docvalue based range queries used in conjunctions by only looking at blocks of documents that actually contain values in the correct range.

Currently this is just an idea i had when i looked at the impact implementation and i wanted get your opinion on this before i spend time building a proof of concept implementation.

--

Daniel Penning / Senior Product Developer
T.: +49 (0)30 5900113-83 / F.: +49 (0)30 5900113-99
E-Mail: [hidden email]
Web: www.stroeermediabrands.de

Ströer Media Brands AG, Torstraße 49, 10119 Berlin-Mitte
Vorstand: Marc Schmitz 
Handelsregister: Amtsgericht Berlin-Charlottenburg HRB 126603 B

Reply | Threaded
Open this post in threaded view
|

Re: Improving DocValue sorting performance

Alan Woodward-3
+1 This would be a very nice addition, and Toke’s recent work adding jump tables to docvalues would provide a natural place to store the information.

On 11 Feb 2019, at 12:42, Daniel Penning <[hidden email]> wrote:

Hi

I think sorting and searching on docvalues can benefit from similar optimizations like impacts do when scoring is used. By storing the minimum and maximum value for blocks of docvalues it would be possible to skip over sets of documents that don't contain any values in the relevant range.

When sorting a large resultset this could be used to skip over parts of the index once the desired number of documents is found and be narrowed down further step by step. This could also be used to improve performance of docvalue based range queries used in conjunctions by only looking at blocks of documents that actually contain values in the correct range.

Currently this is just an idea i had when i looked at the impact implementation and i wanted get your opinion on this before i spend time building a proof of concept implementation.

--

Daniel Penning / Senior Product Developer
T.: +49 (0)30 5900113-83 / F.: +49 (0)30 5900113-99
E-Mail: [hidden email]
Web: www.stroeermediabrands.de

Ströer Media Brands AG, Torstraße 49, 10119 Berlin-Mitte
Vorstand: Marc Schmitz 
Handelsregister: Amtsgericht Berlin-Charlottenburg HRB 126603 B


Reply | Threaded
Open this post in threaded view
|

Re: Improving DocValue sorting performance

Toke Eskildsen-2
In reply to this post by Daniel Penning-2
On Mon, 2019-02-11 at 13:42 +0100, Daniel Penning wrote:
> I think sorting and searching on docvalues can benefit from similar
> optimizations like impacts do when scoring is used. By storing the
> minimum and maximum value for blocks of docvalues it would be
> possible to skip over sets of documents that don't contain any values
> in the relevant range.

Nice idea: The docID-indexed blocks are handled by IndexedDISI, where
the blocks represents 65536 docs each. It would be simple enough to
enrich these with min & max, although it would mean 16 bytes extra for
each block, where the current minimum is 0 bytes + 8 bytes for the jump
table entry. It could be made into an option of course... As Alan
suggests, the jump tables would hold the values for maximum skippiness.

I unsure how much API-change it would require for the values to be
used. Adding a method such as nextDocIDWhereNumericMightBeHigherThan()
seems awfully specialized and might require knowledge of the block size
used inside of IndexedDISI, which is implementation specific. I do like
the general idea of "can we somehow skip ahead?".

Another sore point if how much this would statistically help. What are
the chances that a block can be skipped at all? Of course it depends on
the distribution of the values themselves, so sorting descending on a
long tail distribution might work very well, whereas a more uniform
distribution would result in no speed-up. A date field in a time-series
corpus, indexed in chronological order, would work extremely well
sorting ascending and extremely poorly descending (possible it's the
other way around. It's getting late here).


So yeah, it is definitely technically possible and a very interesting
idea. And maybe very specialized?

Cc: to Daniel as the thread is a month old.

- Toke Eskildsen, Royal Danish Library



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]