Efficient way to define large Boolean Occur.FILTER clause in Lucene 6

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient way to define large Boolean Occur.FILTER clause in Lucene 6

Hasenberger, Josef
Hi,

I want to filter a result of a query by Long values (applicable for specific field, actually DocValue field) in Lucene 6 (as replacement for Filters which are removed in Lucene 6).

The amount of allowed Long values can range from just a few up to hundred thousands.
What I do now is to create a TermsQuery from generated Terms and apply them on a BooleanQuery as Filter, like this:

    public Query getFilteredQuery(Query query) {
        List<Term> terms = new ArrayList<>(getValueSize());
        String keyFieldName = getFieldName();
        for (Long value : getValues()) {
            BytesRef valueAsBytesRef = LongToUTF8Converter.toBytesRef(value); // save conversion from UTF16 to UTF8
            Term term = new Term(keyFieldName, valueAsBytesRef);
            terms.add(term);
        }
        TermsQuery termsQuery = new TermsQuery(terms);

        return new BooleanQuery.Builder()
                .add(query, Occur.MUST)  // original query
                .add(termsQuery, Occur.FILTER) // add filter
                .build();
    }

However, I have a feeling that the conversion from Long values to Terms is rather inefficient for large collections and also uses a lot of memory.
To ease conversion overhead somewhat, I created a class that converts a Long value directly to BytesRef instance (in order to avoid conversion to UTF16 and then UTF8 again) and pass that instance to the Term constructor.

I just wonder if there is a better method for passing large amount of filter criteria to a BooleanQuery Occur.FILTER clause, that avoids excessive object creation.
Or maybe there is a better approach than using BooleanQuery in this case?

Would be glad if you could share your thoughts on this.

Thanks a lot,
Josef

Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to define large Boolean Occur.FILTER clause in Lucene 6

Trejkaz
On Tue, Jun 26, 2018 at 7:02 PM, Hasenberger, Josef
<[hidden email]> wrote:
> However, I have a feeling that the conversion from Long values to Terms is
> rather inefficient for large collections and also uses a lot of memory.
> To ease conversion overhead somewhat, I created a class that converts a
> Long value directly to BytesRef instance (in order to avoid conversion to
> UTF16 and then UTF8 again) and pass that instance to the Term constructor.

First thought is, why are you using TermsQuery if they're in DocValues?
Is DocValuesTermsQuery any better? It does depend on how many terms
you're searching for.

Second thought is that there is also DocValuesNumbersQuery, which
avoids having to convert all the values.

> I just wonder if there is a better method for passing large amount of filter criteria
> to a BooleanQuery Occur.FILTER clause, that avoids excessive object creation.

If you can get your long values into something which implements Bits,
you could make a query using RandomAccessWeight to directly point at
the existing set you already have in memory.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Efficient way to define large Boolean Occur.FILTER clause in Lucene 6

Hasenberger, Josef
Hi,

Great, thanks a lot.

Pointing out to RandomAccessWeight and the approach used in DocValuesNumbersQuery is exactly what I need for my use case.
I created my own query type that takes advantage of already loaded LongBitSet values. It allows efficiently implementing the Bits that match a document inside my own RandomAccessWeight implementation.

This approach is efficient when number of values exceeds a certain threshold. Below that threshold, using TermsQuery is more efficient.
I can decide in my code which approach is actually more efficient by applying my specific heuristic.

Overall, for larger values map (above 20,000 entries), I decreased search time to about 10-30% of what I needed before. For smaller value maps, search time stay efficient due to usage of TermsQuery.

Thanks again!

Josef

-----Original Message-----
From: Trejkaz [mailto:[hidden email]]
Sent: Wednesday, June 27, 2018 4:51 AM
To: Lucene Users Mailing List
Subject: Re: Efficient way to define large Boolean Occur.FILTER clause in Lucene 6

On Tue, Jun 26, 2018 at 7:02 PM, Hasenberger, Josef
<[hidden email]> wrote:
> However, I have a feeling that the conversion from Long values to Terms is
> rather inefficient for large collections and also uses a lot of memory.
> To ease conversion overhead somewhat, I created a class that converts a
> Long value directly to BytesRef instance (in order to avoid conversion to
> UTF16 and then UTF8 again) and pass that instance to the Term constructor.

First thought is, why are you using TermsQuery if they're in DocValues?
Is DocValuesTermsQuery any better? It does depend on how many terms
you're searching for.

Second thought is that there is also DocValuesNumbersQuery, which
avoids having to convert all the values.

> I just wonder if there is a better method for passing large amount of filter criteria
> to a BooleanQuery Occur.FILTER clause, that avoids excessive object creation.

If you can get your long values into something which implements Bits,
you could make a query using RandomAccessWeight to directly point at
the existing set you already have in memory.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]