Question abount combining InvertedIndex and SortField

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Question abount combining InvertedIndex and SortField

小鱼儿-2
Assume i first use keyword search to get a DocIDSet from inverted index,
then i want to sort these docIds by some numeric field, like a
`updateTime`, does Lucene do this without need of loading the Document
objects but only with an sorted index on `updateTime`? Which i call it
"Index-Only Sort Optimization" (MUST be some equal concepts in RDBMS?)

And since Lucene has a `SortField` API, what does it do the sort? I thought
SortField is just a post-processing...
Reply | Threaded
Open this post in threaded view
|

Re: Question abount combining InvertedIndex and SortField

Mikhail Khludnev-2
Hello, 小鱼儿.

On Tue, Dec 31, 2019 at 6:32 AM 小鱼儿 <[hidden email]> wrote:

> Assume i first use keyword search to get a DocIDSet from inverted index,
> then i want to sort these docIds by some numeric field, like a
> `updateTime`, does Lucene do this without need of loading the Document
> objects but only with an sorted index on `updateTime`?

1. Lucene doesn't load Document objects from stored fields files while
sorting for sure.
2. Lucene uses dedicated columnar data structure (DocVaues made index time,
or in the worst case lazily loaded FieldCache) to obtain field values while
collecting search results from inverted index.
3. One deviation from this generic algorithm is sorted index and early
termination, that's probably what you meant in "Index-Only Search
Optimization".


> Which i call it
> "Index-Only Sort Optimization" (MUST be some equal concepts in RDBMS?)
>
> And since Lucene has a `SortField` API, what does it do the sort? I thought
>
It brings up TopFieldCollector instead of the default TopScoreDocCollector.

> SortField is just a post-processing...
>
Not really. Scoring/sorting should be done along side with searching to
reduce memory footprint by storing only top candidate results in a binary
heap.
IIRC it's described in this classic paper
http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Question abount combining InvertedIndex and SortField

小鱼儿-2
Hi,  Mikhail

Your words is very encouraging. I was thinking i might need to do another
Lucene custom query to apply my business-specific "index-only sort" and
"early-termination"... SortField API says can use any numeric/String field,
which is very perfect. (In this way, Lucene should be able to a high-perf
Top-N query if SortField can support dynamically-generated ranking
scores... not only the native indexed numeric/String fields)

Mikhail Khludnev <[hidden email]> 于2019年12月31日周二 下午4:41写道:

> Hello, 小鱼儿.
>
> On Tue, Dec 31, 2019 at 6:32 AM 小鱼儿 <[hidden email]> wrote:
>
> > Assume i first use keyword search to get a DocIDSet from inverted index,
> > then i want to sort these docIds by some numeric field, like a
> > `updateTime`, does Lucene do this without need of loading the Document
> > objects but only with an sorted index on `updateTime`?
>
> 1. Lucene doesn't load Document objects from stored fields files while
> sorting for sure.
> 2. Lucene uses dedicated columnar data structure (DocVaues made index time,
> or in the worst case lazily loaded FieldCache) to obtain field values while
> collecting search results from inverted index.
> 3. One deviation from this generic algorithm is sorted index and early
> termination, that's probably what you meant in "Index-Only Search
> Optimization".
>
>
> > Which i call it
> > "Index-Only Sort Optimization" (MUST be some equal concepts in RDBMS?)
> >
> > And since Lucene has a `SortField` API, what does it do the sort? I
> thought
> >
> It brings up TopFieldCollector instead of the default TopScoreDocCollector.
>
> > SortField is just a post-processing...
> >
> Not really. Scoring/sorting should be done along side with searching to
> reduce memory footprint by storing only top candidate results in a binary
> heap.
> IIRC it's described in this classic paper
> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>