Is there some sensible way to do giant BooleanQuery or similar lazily?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Is there some sensible way to do giant BooleanQuery or similar lazily?

Trejkaz
Hi all.

We have this one kind of query where you essentially specify a text
file which contains the actual query to search for. The catch is that
the text file can be large.

Our custom query currently computes the set of matching docs up-front,
and then when queries come in for one LeafReader, the larger doc ID
set is sliced so that the sub-slice for that leaf is returned. Which
is confusing, and seems backwards.

As an alternative, we could override rewrite(IndexReader) and return a
gigantic boolean query. Problems being:

  1) A gigantic BooleanQuery takes up a lot more memory than a list of
query strings.

  2) Lucene devs often say that gigantic boolean queries are bad,
maybe for reason #1, or maybe for another reason which nobody
understands

So in place of this, is there some kind of alternative?

For instance, is there some query type where I can provide an iterator
of sub-queries, so that they don't all have to be in memory at once?
The code to get each sub-query is always relatively straight-forward
and easy to understand.

I guess the snag is that sometimes the line of text is natural
language which gets run through an analyser, so we'd potentially be
re-analysing the text once per leaf reader? :/

This would replace about 1/3 of the remaining places where we have to
compute the doc ID set up-front.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is there some sensible way to do giant BooleanQuery or similar lazily?

Adrien Grand
Large boolean queries can cause a lot of random access as each sub clause
is advanced one after the other. Even in the case that everything fits in
the filesystem cache, the fact that the heap needs to be rebalanced after
each documents makes queries on many clauses slow. This is why we have
TermInSetQuery (TermsQuery on 6.x): it has a more disk-friendly access
pattern (1 seek per term per segment) and scales better with the number of
terms. Unfortunately it does not only come with benefits and its main
drawback is that it is always evaluated againts the entire index. So if you
intersect a very selective query (on an id field for instance) with a large
TermInSetQuery, the TermInSetQuery will dominate the execution time for
sure.

Le lun. 3 avr. 2017 à 03:18, Trejkaz <[hidden email]> a écrit :

> Hi all.
>
> We have this one kind of query where you essentially specify a text
> file which contains the actual query to search for. The catch is that
> the text file can be large.
>
> Our custom query currently computes the set of matching docs up-front,
> and then when queries come in for one LeafReader, the larger doc ID
> set is sliced so that the sub-slice for that leaf is returned. Which
> is confusing, and seems backwards.
>
> As an alternative, we could override rewrite(IndexReader) and return a
> gigantic boolean query. Problems being:
>
>   1) A gigantic BooleanQuery takes up a lot more memory than a list of
> query strings.
>
>   2) Lucene devs often say that gigantic boolean queries are bad,
> maybe for reason #1, or maybe for another reason which nobody
> understands
>
> So in place of this, is there some kind of alternative?
>
> For instance, is there some query type where I can provide an iterator
> of sub-queries, so that they don't all have to be in memory at once?
> The code to get each sub-query is always relatively straight-forward
> and easy to understand.
>
> I guess the snag is that sometimes the line of text is natural
> language which gets run through an analyser, so we'd potentially be
> re-analysing the text once per leaf reader? :/
>
> This would replace about 1/3 of the remaining places where we have to
> compute the doc ID set up-front.
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Is there some sensible way to do giant BooleanQuery or similar lazily?

Trejkaz
On Mon, Apr 3, 2017 at 6:25 PM, Adrien Grand <[hidden email]> wrote:

> Large boolean queries can cause a lot of random access as each sub clause
> is advanced one after the other. Even in the case that everything fits in
> the filesystem cache, the fact that the heap needs to be rebalanced after
> each documents makes queries on many clauses slow. This is why we have
> TermInSetQuery (TermsQuery on 6.x): it has a more disk-friendly access
> pattern (1 seek per term per segment) and scales better with the number of
> terms. Unfortunately it does not only come with benefits and its main
> drawback is that it is always evaluated againts the entire index. So if you
> intersect a very selective query (on an id field for instance) with a large
> TermInSetQuery, the TermInSetQuery will dominate the execution time for
> sure.

One such case which we do have is searching on file digests, where all
the values are spread across the entire index, and the common prefixes
don't allow much of a win from things like automata. For those,
though, TermsQuery might still work.

The problem is more things like word lists, where one "word" might
analyse to multiple terms, making a phrase query - which prevents
using TermsQuery. Collapsing it to some kind of conditional
multi-phrase query... yeah, I have no idea whether there is any
sensible way to do it.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...