Optimizing fq query performance

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Optimizing fq query performance

John Davis
Hi there,

We noticed a sizable performance degradation when we add certain fq filters
to the query even though the result set does not change between the two
queries. I would've expected solr to optimize internally by picking the
most constrained fq filter first, but maybe my understanding is wrong.
Here's an example:

query1: fq = 'field1:* AND field2:value'
query2: fq = 'field2:value'

If we assume that the result set is identical between the two queries and
field1 is in general more frequent in the index, we noticed query1 takes
100x longer than query2. In case it matters field1 is of type tlongs while
field2 is a string.

Any tips for optimizing this?

John
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

Yonik Seeley
More constrained but matching the same set of documents just guarantees
that there is more information to evaluate per document matched.
For your specific case, you can optimize fq = 'field1:* AND field2:value'
to &fq=field1:*&fq=field2:value
This will at least cause field1:* to be cached and reused if it's a common
pattern.
field1:* is slow in general for indexed fields because all terms for the
field need to be iterated (e.g. does term1 match doc1, does term2 match
doc1, etc)
One can optimize this by indexing a term in a different field to turn it
into a single term query (i.e. exists:field1)

-Yonik

On Sat, Apr 13, 2019 at 2:58 PM John Davis <[hidden email]>
wrote:

> Hi there,
>
> We noticed a sizable performance degradation when we add certain fq filters
> to the query even though the result set does not change between the two
> queries. I would've expected solr to optimize internally by picking the
> most constrained fq filter first, but maybe my understanding is wrong.
> Here's an example:
>
> query1: fq = 'field1:* AND field2:value'
> query2: fq = 'field2:value'
>
> If we assume that the result set is identical between the two queries and
> field1 is in general more frequent in the index, we noticed query1 takes
> 100x longer than query2. In case it matters field1 is of type tlongs while
> field2 is a string.
>
> Any tips for optimizing this?
>
> John
>
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

Erick Erickson
Also note that field1:* does not necessarily match all documents. A document without that field will not match. So it really can’t be optimized they way you might expect since, as Yonik says, all the terms have to be enumerated….

Best,
Erick

> On Apr 13, 2019, at 12:30 PM, Yonik Seeley <[hidden email]> wrote:
>
> More constrained but matching the same set of documents just guarantees
> that there is more information to evaluate per document matched.
> For your specific case, you can optimize fq = 'field1:* AND field2:value'
> to &fq=field1:*&fq=field2:value
> This will at least cause field1:* to be cached and reused if it's a common
> pattern.
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)
> One can optimize this by indexing a term in a different field to turn it
> into a single term query (i.e. exists:field1)
>
> -Yonik
>
> On Sat, Apr 13, 2019 at 2:58 PM John Davis <[hidden email]>
> wrote:
>
>> Hi there,
>>
>> We noticed a sizable performance degradation when we add certain fq filters
>> to the query even though the result set does not change between the two
>> queries. I would've expected solr to optimize internally by picking the
>> most constrained fq filter first, but maybe my understanding is wrong.
>> Here's an example:
>>
>> query1: fq = 'field1:* AND field2:value'
>> query2: fq = 'field2:value'
>>
>> If we assume that the result set is identical between the two queries and
>> field1 is in general more frequent in the index, we noticed query1 takes
>> 100x longer than query2. In case it matters field1 is of type tlongs while
>> field2 is a string.
>>
>> Any tips for optimizing this?
>>
>> John
>>

Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

John Davis
In reply to this post by Yonik Seeley
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)

This feels like something could be optimized internally by tracking
existence of the field in a doc instead of making users index yet another
field to track existence?

BTW does this same behavior apply for tlong fields too where the value
might be more continuous vs discrete strings?

On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley <[hidden email]> wrote:

> More constrained but matching the same set of documents just guarantees
> that there is more information to evaluate per document matched.
> For your specific case, you can optimize fq = 'field1:* AND field2:value'
> to &fq=field1:*&fq=field2:value
> This will at least cause field1:* to be cached and reused if it's a common
> pattern.
> field1:* is slow in general for indexed fields because all terms for the
> field need to be iterated (e.g. does term1 match doc1, does term2 match
> doc1, etc)
> One can optimize this by indexing a term in a different field to turn it
> into a single term query (i.e. exists:field1)
>
> -Yonik
>
> On Sat, Apr 13, 2019 at 2:58 PM John Davis <[hidden email]>
> wrote:
>
> > Hi there,
> >
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
> >
> > If we assume that the result set is identical between the two queries and
> > field1 is in general more frequent in the index, we noticed query1 takes
> > 100x longer than query2. In case it matters field1 is of type tlongs
> while
> > field2 is a string.
> >
> > Any tips for optimizing this?
> >
> > John
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

Erick Erickson
Patches welcome, but how would that be done? There’s no fixed schema at the Lucene level. It’s even possible  that no two documents in the index have any fields in common. Given the structure of an inverted index, answering the question “for document X does it have any value?" is rather “interesting”. You might be able to do something with docValues and function queries, but that’s overkill.

In some sense, fq=field:* does this dynamically by putting the results in the filterCache where it requires no calculations the next time so it seems like more effort than it’s worth.

Best,
Erick

> On Apr 13, 2019, at 11:24 PM, John Davis <[hidden email]> wrote:
>
>> field1:* is slow in general for indexed fields because all terms for the
>> field need to be iterated (e.g. does term1 match doc1, does term2 match
>> doc1, etc)
>
> This feels like something could be optimized internally by tracking
> existence of the field in a doc instead of making users index yet another
> field to track existence?
>
> BTW does this same behavior apply for tlong fields too where the value
> might be more continuous vs discrete strings?
>
> On Sat, Apr 13, 2019 at 12:30 PM Yonik Seeley <[hidden email]> wrote:
>
>> More constrained but matching the same set of documents just guarantees
>> that there is more information to evaluate per document matched.
>> For your specific case, you can optimize fq = 'field1:* AND field2:value'
>> to &fq=field1:*&fq=field2:value
>> This will at least cause field1:* to be cached and reused if it's a common
>> pattern.
>> field1:* is slow in general for indexed fields because all terms for the
>> field need to be iterated (e.g. does term1 match doc1, does term2 match
>> doc1, etc)
>> One can optimize this by indexing a term in a different field to turn it
>> into a single term query (i.e. exists:field1)
>>
>> -Yonik
>>
>> On Sat, Apr 13, 2019 at 2:58 PM John Davis <[hidden email]>
>> wrote:
>>
>>> Hi there,
>>>
>>> We noticed a sizable performance degradation when we add certain fq
>> filters
>>> to the query even though the result set does not change between the two
>>> queries. I would've expected solr to optimize internally by picking the
>>> most constrained fq filter first, but maybe my understanding is wrong.
>>> Here's an example:
>>>
>>> query1: fq = 'field1:* AND field2:value'
>>> query2: fq = 'field2:value'
>>>
>>> If we assume that the result set is identical between the two queries and
>>> field1 is in general more frequent in the index, we noticed query1 takes
>>> 100x longer than query2. In case it matters field1 is of type tlongs
>> while
>>> field2 is a string.
>>>
>>> Any tips for optimizing this?
>>>
>>> John
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

Shawn Heisey-2
In reply to this post by John Davis
On 4/13/2019 12:58 PM, John Davis wrote:
> We noticed a sizable performance degradation when we add certain fq filters
> to the query even though the result set does not change between the two
> queries. I would've expected solr to optimize internally by picking the
> most constrained fq filter first, but maybe my understanding is wrong.

All filters cover the entire index, unless the query parser that you're
using implements the PostFilter interface, the filter cost is set high
enough, and caching is disabled.  All three of those conditions must be
met in order for a filter to only run on results instead of the entire
index.

http://yonik.com/advanced-filter-caching-in-solr/
https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/

Most query parsers don't implement the PostFilter interface.  The lucene
and edismax parsers do not implement PostFilter.  Unless you've
specified the query parser in the fq parameter, it will use the lucene
query parser, and it cannot be a PostFilter.

> Here's an example:
>
> query1: fq = 'field1:* AND field2:value'
> query2: fq = 'field2:value'

If the point of the "field1:*" query clause is "make sure field1 exists
in the document" then you would be a lot better off with this query clause:

field1:[* TO *]

This is an all-inclusive range query.  It works with all field types
where I have tried it, and that includes TextField types.   It will be a
lot more efficient than the wildcard query.

Here's what happens with "field1:*".  If the cardinality of field1 is
ten million different values, then the query that gets constructed for
Lucene will literally contain ten million values.  And every single one
of them will need to be compared to every document.  That's a LOT of
comparisons.  Wildcard queries are normally very slow.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

John Davis
Can you clarify why field:[* TO *] is lot more efficient than field:*

On Sun, Apr 14, 2019 at 12:14 PM Shawn Heisey <[hidden email]> wrote:

> On 4/13/2019 12:58 PM, John Davis wrote:
> > We noticed a sizable performance degradation when we add certain fq
> filters
> > to the query even though the result set does not change between the two
> > queries. I would've expected solr to optimize internally by picking the
> > most constrained fq filter first, but maybe my understanding is wrong.
>
> All filters cover the entire index, unless the query parser that you're
> using implements the PostFilter interface, the filter cost is set high
> enough, and caching is disabled.  All three of those conditions must be
> met in order for a filter to only run on results instead of the entire
> index.
>
> http://yonik.com/advanced-filter-caching-in-solr/
> https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/
>
> Most query parsers don't implement the PostFilter interface.  The lucene
> and edismax parsers do not implement PostFilter.  Unless you've
> specified the query parser in the fq parameter, it will use the lucene
> query parser, and it cannot be a PostFilter.
>
> > Here's an example:
> >
> > query1: fq = 'field1:* AND field2:value'
> > query2: fq = 'field2:value'
>
> If the point of the "field1:*" query clause is "make sure field1 exists
> in the document" then you would be a lot better off with this query clause:
>
> field1:[* TO *]
>
> This is an all-inclusive range query.  It works with all field types
> where I have tried it, and that includes TextField types.   It will be a
> lot more efficient than the wildcard query.
>
> Here's what happens with "field1:*".  If the cardinality of field1 is
> ten million different values, then the query that gets constructed for
> Lucene will literally contain ten million values.  And every single one
> of them will need to be compared to every document.  That's a LOT of
> comparisons.  Wildcard queries are normally very slow.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

Shawn Heisey-2
On 4/17/2019 10:51 AM, John Davis wrote:
> Can you clarify why field:[* TO *] is lot more efficient than field:*

It's a range query.  For every document, Lucene just has to answer two
questions -- is the value more than any possible value and is the value
less than any possible value.  The answer will be yes if the field
exists, and no if it doesn't.  With one million documents, there are two
million questions that Lucene has to answer.  Which probably seems like
a lot ... but keep reading.  (Side note:  It wouldn't surprise me if
Lucene has an optimization specifically for the all inclusive range such
that it actually only asks one question, not two)

With a wildcard query, there are as many questions as there are values
in the field.  Every question is asked for every single document.  So if
you have a million documents and there are three hundred thousand
different values contained in the field across the whole index, that's
300 billion questions.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

John Davis
If what you describe is the case for range query [* TO *], why would lucene
not optimize field:* similar way?

On Wed, Apr 17, 2019 at 10:36 AM Shawn Heisey <[hidden email]> wrote:

> On 4/17/2019 10:51 AM, John Davis wrote:
> > Can you clarify why field:[* TO *] is lot more efficient than field:*
>
> It's a range query.  For every document, Lucene just has to answer two
> questions -- is the value more than any possible value and is the value
> less than any possible value.  The answer will be yes if the field
> exists, and no if it doesn't.  With one million documents, there are two
> million questions that Lucene has to answer.  Which probably seems like
> a lot ... but keep reading.  (Side note:  It wouldn't surprise me if
> Lucene has an optimization specifically for the all inclusive range such
> that it actually only asks one question, not two)
>
> With a wildcard query, there are as many questions as there are values
> in the field.  Every question is asked for every single document.  So if
> you have a million documents and there are three hundred thousand
> different values contained in the field across the whole index, that's
> 300 billion questions.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

Shawn Heisey-2
On 4/17/2019 1:21 PM, John Davis wrote:
> If what you describe is the case for range query [* TO *], why would lucene
> not optimize field:* similar way?

I don't know.  Low level lucene operation is a mystery to me.

I have seen first-hand that the range query is MUCH faster than the
wildcard query.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

John Davis
I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
*] doesn't seem materially different compared to has_field:1. If no one
knows why Lucene optimizes one but not another, it's not clear whether it
even optimizes one to be sure.

On Wed, Apr 17, 2019 at 4:27 PM Shawn Heisey <[hidden email]> wrote:

> On 4/17/2019 1:21 PM, John Davis wrote:
> > If what you describe is the case for range query [* TO *], why would
> lucene
> > not optimize field:* similar way?
>
> I don't know.  Low level lucene operation is a mystery to me.
>
> I have seen first-hand that the range query is MUCH faster than the
> wildcard query.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

Shawn Heisey-2
On 4/17/2019 11:49 PM, John Davis wrote:
> I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
> *] doesn't seem materially different compared to has_field:1. If no one
> knows why Lucene optimizes one but not another, it's not clear whether it
> even optimizes one to be sure.

Queries using a boolean field will be even faster than the all-inclusive
range query ... but they require work at index time to function
properly.  If you can do it this way, that's definitely preferred.  I
was providing you with something that would work even without the
separate boolean field.

If the cardinality of the field you're searching is very low (only a few
possible values for that field across the whole index) then a wildcard
query can be fast.  It is only when the cardinality is high that the
wildcard query is slow.  Still, it is better to use the range query for
determining whether the field exists, unless you have a separate boolean
field for that purpose, in which case the boolean query will be a little
bit faster.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing fq query performance

John Davis
FYI
https://issues.apache.org/jira/browse/SOLR-11437
https://issues.apache.org/jira/browse/SOLR-12488

On Thu, Apr 18, 2019 at 7:24 AM Shawn Heisey <[hidden email]> wrote:

> On 4/17/2019 11:49 PM, John Davis wrote:
> > I did a few tests with our instance solr-7.4.0 and field:* vs field:[* TO
> > *] doesn't seem materially different compared to has_field:1. If no one
> > knows why Lucene optimizes one but not another, it's not clear whether it
> > even optimizes one to be sure.
>
> Queries using a boolean field will be even faster than the all-inclusive
> range query ... but they require work at index time to function
> properly.  If you can do it this way, that's definitely preferred.  I
> was providing you with something that would work even without the
> separate boolean field.
>
> If the cardinality of the field you're searching is very low (only a few
> possible values for that field across the whole index) then a wildcard
> query can be fast.  It is only when the cardinality is high that the
> wildcard query is slow.  Still, it is better to use the range query for
> determining whether the field exists, unless you have a separate boolean
> field for that purpose, in which case the boolean query will be a little
> bit faster.
>
> Thanks,
> Shawn
>