fq versus q

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

fq versus q

Esther Goldbraich
Hi,

We are comparing the performance of fq versus q for queries that are
actually filters and should not be cached.
In part of queries we see strange behavior where q performs 5-10x better
than fq. The question is why?

An example1:
q=maildate:{DATE1 to DATE2} COMPARED TO fq={!cache=false}maildate:{DATE1
to DATE2}
sort=maildate_sort* desc
rows=50
start=0
group=true
group.query=some query (without dates)
group.query=*:*
group.sort=maildate_sort desc
additional fqs

Schema:
<field name="maildate" stored="true" indexed="true" type="tdate"/>
<field name="maildate_sort" stored="false" indexed="false" type="tdate"
docValues="true"/>

Thank you,
Esther
-------------------------------------------------
Esther Goldbraich
Social Technologies & Analytics - IBM Haifa Research Lab
Phone: +972-4-8281059
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Esther Goldbraich
Some clarification:
I would like to understand how solr processes fq (without cache) versus q
when sort and group are required.




From:
Esther Goldbraich/Haifa/IBM@IBMIL
To:
[hidden email]
Cc:
Arnon Yogev/Haifa/IBM@IBMIL, Shai Erera/Haifa/IBM@IBMIL
Date:
24/06/2015 02:29 PM
Subject:
fq versus q



Hi,

We are comparing the performance of fq versus q for queries that are
actually filters and should not be cached.
In part of queries we see strange behavior where q performs 5-10x better
than fq. The question is why?

An example1:
q=maildate:{DATE1 to DATE2} COMPARED TO fq={!cache=false}maildate:{DATE1
to DATE2}
sort=maildate_sort* desc
rows=50
start=0
group=true
group.query=some query (without dates)
group.query=*:*
group.sort=maildate_sort desc
additional fqs

Schema:
<field name="maildate" stored="true" indexed="true" type="tdate"/>
<field name="maildate_sort" stored="false" indexed="false" type="tdate"
docValues="true"/>

Thank you,
Esther
-------------------------------------------------
Esther Goldbraich
Social Technologies & Analytics - IBM Haifa Research Lab
Phone: +972-4-8281059

Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Shawn Heisey-2
In reply to this post by Esther Goldbraich
On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
> We are comparing the performance of fq versus q for queries that are
> actually filters and should not be cached.
> In part of queries we see strange behavior where q performs 5-10x better
> than fq. The question is why?
>
> An example1:
> q=maildate:{DATE1 to DATE2} COMPARED TO fq={!cache=false}maildate:{DATE1
> to DATE2}
> sort=maildate_sort* desc

<snip>

> <field name="maildate" stored="true" indexed="true" type="tdate"/>
> <field name="maildate_sort" stored="false" indexed="false" type="tdate"
> docValues="true"/>

For simplicity, I would probably just use one field for that, rather
than a separate sort field.  The disk space required would probably be
the same either way, but your interaction with the index will not be as
complex.  There's nothing wrong with doing it the way you have, though.

I'm not at all an expert, but I've been a member of this community for a
long time.  Here's my guess about why your query is faster in the q
parameter than a non-cached filter:

The result of a standard query is the stored fields from the top N
documents, where N is the value in the rows parameter.  The default for
N is typically set to 10, and for most people will normally be 200 or less.

The result of a filter is very different -- it is a bitset of all the
documents in your entire index, with binary 0 for documents that don't
match the filter and binary 1 for documents that do match.

If your index has 100 million documents, every single one of those 100
million documents must be checked against the filter query to produce a
filter bitset, but when it's in the q parameter, shortcuts can be taken
which will get the top N results quickly.

The filterCache levels the playing field when filters are re-used.  If a
requested filter is already in the cache, it can be retrieved and
applied to a result VERY quickly.

You have turned off the caching for your filter.  I'm not sure why you
did this, but you know your use case a lot better than I do.  If it were
me, I would use filter queries and do everything possible to re-use the
same filters, and I would cache them.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Shai Erera
Thanks Shawn,

What's Solr equivalence to ConstantScoreQuery? I.e., what if you want to
run a query that does not score, but only filter. The rationale behind
using a non-cached 'fq' was just that.

Shai

On Wed, Jun 24, 2015 at 4:29 PM, Shawn Heisey <[hidden email]> wrote:

> On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
> > We are comparing the performance of fq versus q for queries that are
> > actually filters and should not be cached.
> > In part of queries we see strange behavior where q performs 5-10x better
> > than fq. The question is why?
> >
> > An example1:
> > q=maildate:{DATE1 to DATE2} COMPARED TO fq={!cache=false}maildate:{DATE1
> > to DATE2}
> > sort=maildate_sort* desc
>
> <snip>
>
> > <field name="maildate" stored="true" indexed="true" type="tdate"/>
> > <field name="maildate_sort" stored="false" indexed="false" type="tdate"
> > docValues="true"/>
>
> For simplicity, I would probably just use one field for that, rather
> than a separate sort field.  The disk space required would probably be
> the same either way, but your interaction with the index will not be as
> complex.  There's nothing wrong with doing it the way you have, though.
>
> I'm not at all an expert, but I've been a member of this community for a
> long time.  Here's my guess about why your query is faster in the q
> parameter than a non-cached filter:
>
> The result of a standard query is the stored fields from the top N
> documents, where N is the value in the rows parameter.  The default for
> N is typically set to 10, and for most people will normally be 200 or less.
>
> The result of a filter is very different -- it is a bitset of all the
> documents in your entire index, with binary 0 for documents that don't
> match the filter and binary 1 for documents that do match.
>
> If your index has 100 million documents, every single one of those 100
> million documents must be checked against the filter query to produce a
> filter bitset, but when it's in the q parameter, shortcuts can be taken
> which will get the top N results quickly.
>
> The filterCache levels the playing field when filters are re-used.  If a
> requested filter is already in the cache, it can be retrieved and
> applied to a result VERY quickly.
>
> You have turned off the caching for your filter.  I'm not sure why you
> did this, but you know your use case a lot better than I do.  If it were
> me, I would use filter queries and do everything possible to re-use the
> same filters, and I would cache them.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

jim ferenczi
In reply to this post by Esther Goldbraich
> In part of queries we see strange behavior where q performs 5-10x better
> than fq. The question is why?
Are you sure that the query result cache is disabled ?

2015-06-24 13:28 GMT+02:00 Esther Goldbraich <[hidden email]>:

> Hi,
>
> We are comparing the performance of fq versus q for queries that are
> actually filters and should not be cached.
> In part of queries we see strange behavior where q performs 5-10x better
> than fq. The question is why?
>
> An example1:
> q=maildate:{DATE1 to DATE2} COMPARED TO fq={!cache=false}maildate:{DATE1
> to DATE2}
> sort=maildate_sort* desc
> rows=50
> start=0
> group=true
> group.query=some query (without dates)
> group.query=*:*
> group.sort=maildate_sort desc
> additional fqs
>
> Schema:
> <field name="maildate" stored="true" indexed="true" type="tdate"/>
> <field name="maildate_sort" stored="false" indexed="false" type="tdate"
> docValues="true"/>
>
> Thank you,
> Esther
> -------------------------------------------------
> Esther Goldbraich
> Social Technologies & Analytics - IBM Haifa Research Lab
> Phone: +972-4-8281059
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Jack Krupansky-3
In reply to this post by Shai Erera
Yonik added syntax to request a constant score query in Solr with the ^=
operator.

For example: +color:blue^=1 text:shoes

See:
https://issues.apache.org/jira/browse/SOLR-7218

-- Jack Krupansky

On Wed, Jun 24, 2015 at 1:41 PM, Shai Erera <[hidden email]> wrote:

> Thanks Shawn,
>
> What's Solr equivalence to ConstantScoreQuery? I.e., what if you want to
> run a query that does not score, but only filter. The rationale behind
> using a non-cached 'fq' was just that.
>
> Shai
>
> On Wed, Jun 24, 2015 at 4:29 PM, Shawn Heisey <[hidden email]> wrote:
>
> > On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
> > > We are comparing the performance of fq versus q for queries that are
> > > actually filters and should not be cached.
> > > In part of queries we see strange behavior where q performs 5-10x
> better
> > > than fq. The question is why?
> > >
> > > An example1:
> > > q=maildate:{DATE1 to DATE2} COMPARED TO
> fq={!cache=false}maildate:{DATE1
> > > to DATE2}
> > > sort=maildate_sort* desc
> >
> > <snip>
> >
> > > <field name="maildate" stored="true" indexed="true" type="tdate"/>
> > > <field name="maildate_sort" stored="false" indexed="false" type="tdate"
> > > docValues="true"/>
> >
> > For simplicity, I would probably just use one field for that, rather
> > than a separate sort field.  The disk space required would probably be
> > the same either way, but your interaction with the index will not be as
> > complex.  There's nothing wrong with doing it the way you have, though.
> >
> > I'm not at all an expert, but I've been a member of this community for a
> > long time.  Here's my guess about why your query is faster in the q
> > parameter than a non-cached filter:
> >
> > The result of a standard query is the stored fields from the top N
> > documents, where N is the value in the rows parameter.  The default for
> > N is typically set to 10, and for most people will normally be 200 or
> less.
> >
> > The result of a filter is very different -- it is a bitset of all the
> > documents in your entire index, with binary 0 for documents that don't
> > match the filter and binary 1 for documents that do match.
> >
> > If your index has 100 million documents, every single one of those 100
> > million documents must be checked against the filter query to produce a
> > filter bitset, but when it's in the q parameter, shortcuts can be taken
> > which will get the top N results quickly.
> >
> > The filterCache levels the playing field when filters are re-used.  If a
> > requested filter is already in the cache, it can be retrieved and
> > applied to a result VERY quickly.
> >
> > You have turned off the caching for your filter.  I'm not sure why you
> > did this, but you know your use case a lot better than I do.  If it were
> > me, I would use filter queries and do everything possible to re-use the
> > same filters, and I would cache them.
> >
> > Thanks,
> > Shawn
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Shai Erera
Ah thanks. I see it was added in 5.1 - is there any other way prior to that
(like 4.7)?

if not, I guess the only option is to not use fq if we don't intend to
cache it, and on 5.1 use the ^= syntax.

Shai

On Wed, Jun 24, 2015 at 9:21 PM, Jack Krupansky <[hidden email]>
wrote:

> Yonik added syntax to request a constant score query in Solr with the ^=
> operator.
>
> For example: +color:blue^=1 text:shoes
>
> See:
> https://issues.apache.org/jira/browse/SOLR-7218
>
> -- Jack Krupansky
>
> On Wed, Jun 24, 2015 at 1:41 PM, Shai Erera <[hidden email]> wrote:
>
> > Thanks Shawn,
> >
> > What's Solr equivalence to ConstantScoreQuery? I.e., what if you want to
> > run a query that does not score, but only filter. The rationale behind
> > using a non-cached 'fq' was just that.
> >
> > Shai
> >
> > On Wed, Jun 24, 2015 at 4:29 PM, Shawn Heisey <[hidden email]>
> wrote:
> >
> > > On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
> > > > We are comparing the performance of fq versus q for queries that are
> > > > actually filters and should not be cached.
> > > > In part of queries we see strange behavior where q performs 5-10x
> > better
> > > > than fq. The question is why?
> > > >
> > > > An example1:
> > > > q=maildate:{DATE1 to DATE2} COMPARED TO
> > fq={!cache=false}maildate:{DATE1
> > > > to DATE2}
> > > > sort=maildate_sort* desc
> > >
> > > <snip>
> > >
> > > > <field name="maildate" stored="true" indexed="true" type="tdate"/>
> > > > <field name="maildate_sort" stored="false" indexed="false"
> type="tdate"
> > > > docValues="true"/>
> > >
> > > For simplicity, I would probably just use one field for that, rather
> > > than a separate sort field.  The disk space required would probably be
> > > the same either way, but your interaction with the index will not be as
> > > complex.  There's nothing wrong with doing it the way you have, though.
> > >
> > > I'm not at all an expert, but I've been a member of this community for
> a
> > > long time.  Here's my guess about why your query is faster in the q
> > > parameter than a non-cached filter:
> > >
> > > The result of a standard query is the stored fields from the top N
> > > documents, where N is the value in the rows parameter.  The default for
> > > N is typically set to 10, and for most people will normally be 200 or
> > less.
> > >
> > > The result of a filter is very different -- it is a bitset of all the
> > > documents in your entire index, with binary 0 for documents that don't
> > > match the filter and binary 1 for documents that do match.
> > >
> > > If your index has 100 million documents, every single one of those 100
> > > million documents must be checked against the filter query to produce a
> > > filter bitset, but when it's in the q parameter, shortcuts can be taken
> > > which will get the top N results quickly.
> > >
> > > The filterCache levels the playing field when filters are re-used.  If
> a
> > > requested filter is already in the cache, it can be retrieved and
> > > applied to a result VERY quickly.
> > >
> > > You have turned off the caching for your filter.  I'm not sure why you
> > > did this, but you know your use case a lot better than I do.  If it
> were
> > > me, I would use filter queries and do everything possible to re-use the
> > > same filters, and I would cache them.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Malcolm Upayavira Holmes
Are you wanting to do no scoring at all, or just have a portion of the
query not contribute to the score?

If you don't want scoring at all, just sort by another field. If you
don't have a field, I just tried "&sort=1 desc", and it worked! This
should, if I'm right, pull documents out of the index in index order.

Upayavira

On Wed, Jun 24, 2015, at 08:26 PM, Shai Erera wrote:

> Ah thanks. I see it was added in 5.1 - is there any other way prior to
> that
> (like 4.7)?
>
> if not, I guess the only option is to not use fq if we don't intend to
> cache it, and on 5.1 use the ^= syntax.
>
> Shai
>
> On Wed, Jun 24, 2015 at 9:21 PM, Jack Krupansky
> <[hidden email]>
> wrote:
>
> > Yonik added syntax to request a constant score query in Solr with the ^=
> > operator.
> >
> > For example: +color:blue^=1 text:shoes
> >
> > See:
> > https://issues.apache.org/jira/browse/SOLR-7218
> >
> > -- Jack Krupansky
> >
> > On Wed, Jun 24, 2015 at 1:41 PM, Shai Erera <[hidden email]> wrote:
> >
> > > Thanks Shawn,
> > >
> > > What's Solr equivalence to ConstantScoreQuery? I.e., what if you want to
> > > run a query that does not score, but only filter. The rationale behind
> > > using a non-cached 'fq' was just that.
> > >
> > > Shai
> > >
> > > On Wed, Jun 24, 2015 at 4:29 PM, Shawn Heisey <[hidden email]>
> > wrote:
> > >
> > > > On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
> > > > > We are comparing the performance of fq versus q for queries that are
> > > > > actually filters and should not be cached.
> > > > > In part of queries we see strange behavior where q performs 5-10x
> > > better
> > > > > than fq. The question is why?
> > > > >
> > > > > An example1:
> > > > > q=maildate:{DATE1 to DATE2} COMPARED TO
> > > fq={!cache=false}maildate:{DATE1
> > > > > to DATE2}
> > > > > sort=maildate_sort* desc
> > > >
> > > > <snip>
> > > >
> > > > > <field name="maildate" stored="true" indexed="true" type="tdate"/>
> > > > > <field name="maildate_sort" stored="false" indexed="false"
> > type="tdate"
> > > > > docValues="true"/>
> > > >
> > > > For simplicity, I would probably just use one field for that, rather
> > > > than a separate sort field.  The disk space required would probably be
> > > > the same either way, but your interaction with the index will not be as
> > > > complex.  There's nothing wrong with doing it the way you have, though.
> > > >
> > > > I'm not at all an expert, but I've been a member of this community for
> > a
> > > > long time.  Here's my guess about why your query is faster in the q
> > > > parameter than a non-cached filter:
> > > >
> > > > The result of a standard query is the stored fields from the top N
> > > > documents, where N is the value in the rows parameter.  The default for
> > > > N is typically set to 10, and for most people will normally be 200 or
> > > less.
> > > >
> > > > The result of a filter is very different -- it is a bitset of all the
> > > > documents in your entire index, with binary 0 for documents that don't
> > > > match the filter and binary 1 for documents that do match.
> > > >
> > > > If your index has 100 million documents, every single one of those 100
> > > > million documents must be checked against the filter query to produce a
> > > > filter bitset, but when it's in the q parameter, shortcuts can be taken
> > > > which will get the top N results quickly.
> > > >
> > > > The filterCache levels the playing field when filters are re-used.  If
> > a
> > > > requested filter is already in the cache, it can be retrieved and
> > > > applied to a result VERY quickly.
> > > >
> > > > You have turned off the caching for your filter.  I'm not sure why you
> > > > did this, but you know your use case a lot better than I do.  If it
> > were
> > > > me, I would use filter queries and do everything possible to re-use the
> > > > same filters, and I would cache them.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Erick Erickson
Tell us a bit more about your test setup. 1 or 2 tests
don't mean much. For instance, if the fq query has to
load the low-level caches from disk then the q-only
query is run and doesn't that could skew the results.
Or if somehow you're hitting the queryResultCache. Or....

Frankly I'd disable all my caches for running tests like
this, and be sure to mix-n-match the tests so I wasn't
getting bitten by caches.

And please tell us what the actual numbers are. 5-10X
doesn't mean much at all if it's 25ms .vs. 5 ms. It means
a lot (and something's very wrong) if it means
200ms .vs. 1,000ms.

Best,
Erick

On Wed, Jun 24, 2015 at 5:30 PM, Upayavira <[hidden email]> wrote:

> Are you wanting to do no scoring at all, or just have a portion of the
> query not contribute to the score?
>
> If you don't want scoring at all, just sort by another field. If you
> don't have a field, I just tried "&sort=1 desc", and it worked! This
> should, if I'm right, pull documents out of the index in index order.
>
> Upayavira
>
> On Wed, Jun 24, 2015, at 08:26 PM, Shai Erera wrote:
>> Ah thanks. I see it was added in 5.1 - is there any other way prior to
>> that
>> (like 4.7)?
>>
>> if not, I guess the only option is to not use fq if we don't intend to
>> cache it, and on 5.1 use the ^= syntax.
>>
>> Shai
>>
>> On Wed, Jun 24, 2015 at 9:21 PM, Jack Krupansky
>> <[hidden email]>
>> wrote:
>>
>> > Yonik added syntax to request a constant score query in Solr with the ^=
>> > operator.
>> >
>> > For example: +color:blue^=1 text:shoes
>> >
>> > See:
>> > https://issues.apache.org/jira/browse/SOLR-7218
>> >
>> > -- Jack Krupansky
>> >
>> > On Wed, Jun 24, 2015 at 1:41 PM, Shai Erera <[hidden email]> wrote:
>> >
>> > > Thanks Shawn,
>> > >
>> > > What's Solr equivalence to ConstantScoreQuery? I.e., what if you want to
>> > > run a query that does not score, but only filter. The rationale behind
>> > > using a non-cached 'fq' was just that.
>> > >
>> > > Shai
>> > >
>> > > On Wed, Jun 24, 2015 at 4:29 PM, Shawn Heisey <[hidden email]>
>> > wrote:
>> > >
>> > > > On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
>> > > > > We are comparing the performance of fq versus q for queries that are
>> > > > > actually filters and should not be cached.
>> > > > > In part of queries we see strange behavior where q performs 5-10x
>> > > better
>> > > > > than fq. The question is why?
>> > > > >
>> > > > > An example1:
>> > > > > q=maildate:{DATE1 to DATE2} COMPARED TO
>> > > fq={!cache=false}maildate:{DATE1
>> > > > > to DATE2}
>> > > > > sort=maildate_sort* desc
>> > > >
>> > > > <snip>
>> > > >
>> > > > > <field name="maildate" stored="true" indexed="true" type="tdate"/>
>> > > > > <field name="maildate_sort" stored="false" indexed="false"
>> > type="tdate"
>> > > > > docValues="true"/>
>> > > >
>> > > > For simplicity, I would probably just use one field for that, rather
>> > > > than a separate sort field.  The disk space required would probably be
>> > > > the same either way, but your interaction with the index will not be as
>> > > > complex.  There's nothing wrong with doing it the way you have, though.
>> > > >
>> > > > I'm not at all an expert, but I've been a member of this community for
>> > a
>> > > > long time.  Here's my guess about why your query is faster in the q
>> > > > parameter than a non-cached filter:
>> > > >
>> > > > The result of a standard query is the stored fields from the top N
>> > > > documents, where N is the value in the rows parameter.  The default for
>> > > > N is typically set to 10, and for most people will normally be 200 or
>> > > less.
>> > > >
>> > > > The result of a filter is very different -- it is a bitset of all the
>> > > > documents in your entire index, with binary 0 for documents that don't
>> > > > match the filter and binary 1 for documents that do match.
>> > > >
>> > > > If your index has 100 million documents, every single one of those 100
>> > > > million documents must be checked against the filter query to produce a
>> > > > filter bitset, but when it's in the q parameter, shortcuts can be taken
>> > > > which will get the top N results quickly.
>> > > >
>> > > > The filterCache levels the playing field when filters are re-used.  If
>> > a
>> > > > requested filter is already in the cache, it can be retrieved and
>> > > > applied to a result VERY quickly.
>> > > >
>> > > > You have turned off the caching for your filter.  I'm not sure why you
>> > > > did this, but you know your use case a lot better than I do.  If it
>> > were
>> > > > me, I would use filter queries and do everything possible to re-use the
>> > > > same filters, and I would cache them.
>> > > >
>> > > > Thanks,
>> > > > Shawn
>> > > >
>> > > >
>> > >
>> >
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Yonik Seeley
In reply to this post by Esther Goldbraich
Why is cache=false set for the filter?
Grouping uses a 2 pass algorithm by default, so that means that the
filter will need to be generated twice (I think) if caching is turned
off.

Also, when you try to use the "fq" version, what are you using for the
main query?

-Yonik


On Wed, Jun 24, 2015 at 7:28 AM, Esther Goldbraich
<[hidden email]> wrote:

> Hi,
>
> We are comparing the performance of fq versus q for queries that are
> actually filters and should not be cached.
> In part of queries we see strange behavior where q performs 5-10x better
> than fq. The question is why?
>
> An example1:
> q=maildate:{DATE1 to DATE2} COMPARED TO fq={!cache=false}maildate:{DATE1
> to DATE2}
> sort=maildate_sort* desc
> rows=50
> start=0
> group=true
> group.query=some query (without dates)
> group.query=*:*
> group.sort=maildate_sort desc
> additional fqs
>
> Schema:
> <field name="maildate" stored="true" indexed="true" type="tdate"/>
> <field name="maildate_sort" stored="false" indexed="false" type="tdate"
> docValues="true"/>
>
> Thank you,
> Esther
> -------------------------------------------------
> Esther Goldbraich
> Social Technologies & Analytics - IBM Haifa Research Lab
> Phone: +972-4-8281059
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Esther Goldbraich
Cache=false - cause the use-case requires distinct time ranges, no reuse.
When using fq: q is set to *:*.
Are there any alternatives for the grouping algorithm?
If not, is there a way to reuse filter results between 2 passes?

Thank you,
Esther



From:
Yonik Seeley <[hidden email]>
To:
"[hidden email]" <[hidden email]>
Cc:
Arnon Yogev/Haifa/IBM@IBMIL, Shai Erera/Haifa/IBM@IBMIL
Date:
25/06/2015 02:50 AM
Subject:
Re: fq versus q



Why is cache=false set for the filter?
Grouping uses a 2 pass algorithm by default, so that means that the
filter will need to be generated twice (I think) if caching is turned
off.

Also, when you try to use the "fq" version, what are you using for the
main query?

-Yonik


On Wed, Jun 24, 2015 at 7:28 AM, Esther Goldbraich
<[hidden email]> wrote:

> Hi,
>
> We are comparing the performance of fq versus q for queries that are
> actually filters and should not be cached.
> In part of queries we see strange behavior where q performs 5-10x better
> than fq. The question is why?
>
> An example1:
> q=maildate:{DATE1 to DATE2} COMPARED TO fq={!cache=false}maildate:{DATE1
> to DATE2}
> sort=maildate_sort* desc
> rows=50
> start=0
> group=true
> group.query=some query (without dates)
> group.query=*:*
> group.sort=maildate_sort desc
> additional fqs
>
> Schema:
> <field name="maildate" stored="true" indexed="true" type="tdate"/>
> <field name="maildate_sort" stored="false" indexed="false" type="tdate"
> docValues="true"/>
>
> Thank you,
> Esther
> -------------------------------------------------
> Esther Goldbraich
> Social Technologies & Analytics - IBM Haifa Research Lab
> Phone: +972-4-8281059



Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Esther Goldbraich
In reply to this post by Erick Erickson
Thank you all for collaborative thinking!

Ran additional benchmarks as proposed. Some results:

All solr caches are enabled (queryResultCache hit ratio = 0.02):

 
q
fq {!cache=false}
delta
original query
28
295
267
w/o grouping
58
325
267
w/o sort on date
28
293
265

All solr caches are disabled (except built in lucene field cache):

 
q
fq {!cache=false}
delta
original query
4113
4381
268
w/o grouping
131
407
276
w/o sort on date
4217
4400
183

*median runtime in ms

As you can see, disabling grouping and/or sorting does not affect the
results much. That is, the difference between running with
'fq{!cache=false}' or with 'q' is the same, while 'fq' performs slower in
all cases.

Is it correct to assume then that the performance difference comes from
computing the filter (traversing the posting lists and building the
bitset)?
Does it also mean that not caching the filter does not affect grouping?
I.e. perhaps the second pass of grouping uses the already computed filter,
and does not attempt to fetch it from the cache?

As a general rule of thumb, at least in our case, would you please comment
on the following assumptions/conclusions (note, all assuming that we don't
want to cache filters, and the 'fq' part is only used to avoid scoring):

1) If the query sorts by any other field than score (e.g. date), we can
put the 'fq' part in 'q'. Scoring won't be done, and we won't pay the cost
of building the filter, and then discarding it when the query completes.

2) In fact, if we don't intend to cache the filter, we might as well just
use only 'q'. At least, on our dataset (this may definitely *not* be a
general statement).

3) If we sort by relevance, but want to avoid scoring of the 'filter'
clauses, is there anything we can do on 4.7?
3.1) The ^= operator is only available in 5.1, which seems exactly what we
need.
3.2) Adding the filter clauses to the query w/ boost 0 will still compute
their score, only they won't affect the overall document score correct?

4) A more general question -- with the addition of ^= to query clauses in
5.1 (resolved to ConstantScoreQuery down stream), what is the use case for
using fq w/ !cache=false? As we understand it, users who use this want to
compute a filter but not cache it. As we see, there is some added cost to
building a filter, so if you pay this cost over and over, would it not be
better to just use ^=?

Best regards,
Esther




From:
Erick Erickson <[hidden email]>
To:
[hidden email]
Date:
25/06/2015 02:38 AM
Subject:
Re: fq versus q



Tell us a bit more about your test setup. 1 or 2 tests
don't mean much. For instance, if the fq query has to
load the low-level caches from disk then the q-only
query is run and doesn't that could skew the results.
Or if somehow you're hitting the queryResultCache. Or....

Frankly I'd disable all my caches for running tests like
this, and be sure to mix-n-match the tests so I wasn't
getting bitten by caches.

And please tell us what the actual numbers are. 5-10X
doesn't mean much at all if it's 25ms .vs. 5 ms. It means
a lot (and something's very wrong) if it means
200ms .vs. 1,000ms.

Best,
Erick

On Wed, Jun 24, 2015 at 5:30 PM, Upayavira <[hidden email]> wrote:

> Are you wanting to do no scoring at all, or just have a portion of the
> query not contribute to the score?
>
> If you don't want scoring at all, just sort by another field. If you
> don't have a field, I just tried "&sort=1 desc", and it worked! This
> should, if I'm right, pull documents out of the index in index order.
>
> Upayavira
>
> On Wed, Jun 24, 2015, at 08:26 PM, Shai Erera wrote:
>> Ah thanks. I see it was added in 5.1 - is there any other way prior to
>> that
>> (like 4.7)?
>>
>> if not, I guess the only option is to not use fq if we don't intend to
>> cache it, and on 5.1 use the ^= syntax.
>>
>> Shai
>>
>> On Wed, Jun 24, 2015 at 9:21 PM, Jack Krupansky
>> <[hidden email]>
>> wrote:
>>
>> > Yonik added syntax to request a constant score query in Solr with the
^=

>> > operator.
>> >
>> > For example: +color:blue^=1 text:shoes
>> >
>> > See:
>> > https://issues.apache.org/jira/browse/SOLR-7218
>> >
>> > -- Jack Krupansky
>> >
>> > On Wed, Jun 24, 2015 at 1:41 PM, Shai Erera <[hidden email]> wrote:
>> >
>> > > Thanks Shawn,
>> > >
>> > > What's Solr equivalence to ConstantScoreQuery? I.e., what if you
want to
>> > > run a query that does not score, but only filter. The rationale
behind
>> > > using a non-cached 'fq' was just that.
>> > >
>> > > Shai
>> > >
>> > > On Wed, Jun 24, 2015 at 4:29 PM, Shawn Heisey <[hidden email]>
>> > wrote:
>> > >
>> > > > On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
>> > > > > We are comparing the performance of fq versus q for queries
that are
>> > > > > actually filters and should not be cached.
>> > > > > In part of queries we see strange behavior where q performs
5-10x

>> > > better
>> > > > > than fq. The question is why?
>> > > > >
>> > > > > An example1:
>> > > > > q=maildate:{DATE1 to DATE2} COMPARED TO
>> > > fq={!cache=false}maildate:{DATE1
>> > > > > to DATE2}
>> > > > > sort=maildate_sort* desc
>> > > >
>> > > > <snip>
>> > > >
>> > > > > <field name="maildate" stored="true" indexed="true"
type="tdate"/>
>> > > > > <field name="maildate_sort" stored="false" indexed="false"
>> > type="tdate"
>> > > > > docValues="true"/>
>> > > >
>> > > > For simplicity, I would probably just use one field for that,
rather
>> > > > than a separate sort field.  The disk space required would
probably be
>> > > > the same either way, but your interaction with the index will not
be as
>> > > > complex.  There's nothing wrong with doing it the way you have,
though.
>> > > >
>> > > > I'm not at all an expert, but I've been a member of this
community for
>> > a
>> > > > long time.  Here's my guess about why your query is faster in the
q
>> > > > parameter than a non-cached filter:
>> > > >
>> > > > The result of a standard query is the stored fields from the top
N
>> > > > documents, where N is the value in the rows parameter.  The
default for
>> > > > N is typically set to 10, and for most people will normally be
200 or
>> > > less.
>> > > >
>> > > > The result of a filter is very different -- it is a bitset of all
the
>> > > > documents in your entire index, with binary 0 for documents that
don't
>> > > > match the filter and binary 1 for documents that do match.
>> > > >
>> > > > If your index has 100 million documents, every single one of
those 100
>> > > > million documents must be checked against the filter query to
produce a
>> > > > filter bitset, but when it's in the q parameter, shortcuts can be
taken
>> > > > which will get the top N results quickly.
>> > > >
>> > > > The filterCache levels the playing field when filters are
re-used.  If
>> > a
>> > > > requested filter is already in the cache, it can be retrieved and
>> > > > applied to a result VERY quickly.
>> > > >
>> > > > You have turned off the caching for your filter.  I'm not sure
why you
>> > > > did this, but you know your use case a lot better than I do.  If
it
>> > were
>> > > > me, I would use filter queries and do everything possible to
re-use the
>> > > > same filters, and I would cache them.
>> > > >
>> > > > Thanks,
>> > > > Shawn
>> > > >
>> > > >
>> > >
>> >



Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Shai Erera
The tables came across corrupt, here they are (times in ms):

Caches enabled:

                  q     fq     delta
original query    28    295    267
w/o grouping      58    325    267
w/o sort on date  28    293    265

Caches disabled:

                  q     fq     delta
original query    4113  4381   268
w/o grouping      131   407    276
w/o sort on date  4217  4400   183

Shai

On Thu, Jun 25, 2015 at 2:04 PM, Esther Goldbraich <[hidden email]>
wrote:

> Thank you all for collaborative thinking!
>
> Ran additional benchmarks as proposed. Some results:
>
> All solr caches are enabled (queryResultCache hit ratio = 0.02):
>
>
> q
> fq {!cache=false}
> delta
> original query
> 28
> 295
> 267
> w/o grouping
> 58
> 325
> 267
> w/o sort on date
> 28
> 293
> 265
>
> All solr caches are disabled (except built in lucene field cache):
>
>
> q
> fq {!cache=false}
> delta
> original query
> 4113
> 4381
> 268
> w/o grouping
> 131
> 407
> 276
> w/o sort on date
> 4217
> 4400
> 183
>
> *median runtime in ms
>
> As you can see, disabling grouping and/or sorting does not affect the
> results much. That is, the difference between running with
> 'fq{!cache=false}' or with 'q' is the same, while 'fq' performs slower in
> all cases.
>
> Is it correct to assume then that the performance difference comes from
> computing the filter (traversing the posting lists and building the
> bitset)?
> Does it also mean that not caching the filter does not affect grouping?
> I.e. perhaps the second pass of grouping uses the already computed filter,
> and does not attempt to fetch it from the cache?
>
> As a general rule of thumb, at least in our case, would you please comment
> on the following assumptions/conclusions (note, all assuming that we don't
> want to cache filters, and the 'fq' part is only used to avoid scoring):
>
> 1) If the query sorts by any other field than score (e.g. date), we can
> put the 'fq' part in 'q'. Scoring won't be done, and we won't pay the cost
> of building the filter, and then discarding it when the query completes.
>
> 2) In fact, if we don't intend to cache the filter, we might as well just
> use only 'q'. At least, on our dataset (this may definitely *not* be a
> general statement).
>
> 3) If we sort by relevance, but want to avoid scoring of the 'filter'
> clauses, is there anything we can do on 4.7?
> 3.1) The ^= operator is only available in 5.1, which seems exactly what we
> need.
> 3.2) Adding the filter clauses to the query w/ boost 0 will still compute
> their score, only they won't affect the overall document score correct?
>
> 4) A more general question -- with the addition of ^= to query clauses in
> 5.1 (resolved to ConstantScoreQuery down stream), what is the use case for
> using fq w/ !cache=false? As we understand it, users who use this want to
> compute a filter but not cache it. As we see, there is some added cost to
> building a filter, so if you pay this cost over and over, would it not be
> better to just use ^=?
>
> Best regards,
> Esther
>
>
>
>
> From:
> Erick Erickson <[hidden email]>
> To:
> [hidden email]
> Date:
> 25/06/2015 02:38 AM
> Subject:
> Re: fq versus q
>
>
>
> Tell us a bit more about your test setup. 1 or 2 tests
> don't mean much. For instance, if the fq query has to
> load the low-level caches from disk then the q-only
> query is run and doesn't that could skew the results.
> Or if somehow you're hitting the queryResultCache. Or....
>
> Frankly I'd disable all my caches for running tests like
> this, and be sure to mix-n-match the tests so I wasn't
> getting bitten by caches.
>
> And please tell us what the actual numbers are. 5-10X
> doesn't mean much at all if it's 25ms .vs. 5 ms. It means
> a lot (and something's very wrong) if it means
> 200ms .vs. 1,000ms.
>
> Best,
> Erick
>
> On Wed, Jun 24, 2015 at 5:30 PM, Upayavira <[hidden email]> wrote:
> > Are you wanting to do no scoring at all, or just have a portion of the
> > query not contribute to the score?
> >
> > If you don't want scoring at all, just sort by another field. If you
> > don't have a field, I just tried "&sort=1 desc", and it worked! This
> > should, if I'm right, pull documents out of the index in index order.
> >
> > Upayavira
> >
> > On Wed, Jun 24, 2015, at 08:26 PM, Shai Erera wrote:
> >> Ah thanks. I see it was added in 5.1 - is there any other way prior to
> >> that
> >> (like 4.7)?
> >>
> >> if not, I guess the only option is to not use fq if we don't intend to
> >> cache it, and on 5.1 use the ^= syntax.
> >>
> >> Shai
> >>
> >> On Wed, Jun 24, 2015 at 9:21 PM, Jack Krupansky
> >> <[hidden email]>
> >> wrote:
> >>
> >> > Yonik added syntax to request a constant score query in Solr with the
> ^=
> >> > operator.
> >> >
> >> > For example: +color:blue^=1 text:shoes
> >> >
> >> > See:
> >> > https://issues.apache.org/jira/browse/SOLR-7218
> >> >
> >> > -- Jack Krupansky
> >> >
> >> > On Wed, Jun 24, 2015 at 1:41 PM, Shai Erera <[hidden email]> wrote:
> >> >
> >> > > Thanks Shawn,
> >> > >
> >> > > What's Solr equivalence to ConstantScoreQuery? I.e., what if you
> want to
> >> > > run a query that does not score, but only filter. The rationale
> behind
> >> > > using a non-cached 'fq' was just that.
> >> > >
> >> > > Shai
> >> > >
> >> > > On Wed, Jun 24, 2015 at 4:29 PM, Shawn Heisey <[hidden email]>
> >> > wrote:
> >> > >
> >> > > > On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
> >> > > > > We are comparing the performance of fq versus q for queries
> that are
> >> > > > > actually filters and should not be cached.
> >> > > > > In part of queries we see strange behavior where q performs
> 5-10x
> >> > > better
> >> > > > > than fq. The question is why?
> >> > > > >
> >> > > > > An example1:
> >> > > > > q=maildate:{DATE1 to DATE2} COMPARED TO
> >> > > fq={!cache=false}maildate:{DATE1
> >> > > > > to DATE2}
> >> > > > > sort=maildate_sort* desc
> >> > > >
> >> > > > <snip>
> >> > > >
> >> > > > > <field name="maildate" stored="true" indexed="true"
> type="tdate"/>
> >> > > > > <field name="maildate_sort" stored="false" indexed="false"
> >> > type="tdate"
> >> > > > > docValues="true"/>
> >> > > >
> >> > > > For simplicity, I would probably just use one field for that,
> rather
> >> > > > than a separate sort field.  The disk space required would
> probably be
> >> > > > the same either way, but your interaction with the index will not
> be as
> >> > > > complex.  There's nothing wrong with doing it the way you have,
> though.
> >> > > >
> >> > > > I'm not at all an expert, but I've been a member of this
> community for
> >> > a
> >> > > > long time.  Here's my guess about why your query is faster in the
> q
> >> > > > parameter than a non-cached filter:
> >> > > >
> >> > > > The result of a standard query is the stored fields from the top
> N
> >> > > > documents, where N is the value in the rows parameter.  The
> default for
> >> > > > N is typically set to 10, and for most people will normally be
> 200 or
> >> > > less.
> >> > > >
> >> > > > The result of a filter is very different -- it is a bitset of all
> the
> >> > > > documents in your entire index, with binary 0 for documents that
> don't
> >> > > > match the filter and binary 1 for documents that do match.
> >> > > >
> >> > > > If your index has 100 million documents, every single one of
> those 100
> >> > > > million documents must be checked against the filter query to
> produce a
> >> > > > filter bitset, but when it's in the q parameter, shortcuts can be
> taken
> >> > > > which will get the top N results quickly.
> >> > > >
> >> > > > The filterCache levels the playing field when filters are
> re-used.  If
> >> > a
> >> > > > requested filter is already in the cache, it can be retrieved and
> >> > > > applied to a result VERY quickly.
> >> > > >
> >> > > > You have turned off the caching for your filter.  I'm not sure
> why you
> >> > > > did this, but you know your use case a lot better than I do.  If
> it
> >> > were
> >> > > > me, I would use filter queries and do everything possible to
> re-use the
> >> > > > same filters, and I would cache them.
> >> > > >
> >> > > > Thanks,
> >> > > > Shawn
> >> > > >
> >> > > >
> >> > >
> >> >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Erick Erickson
Side note on dates and fqs. If you're using NOW in your date
expressions you may be able to re-use fqs by using "date math",
see:
https://lucidworks.com/blog/date-math-now-and-filter-queries/
Of course this may not be applicable in your situation...

FWIW,
Erick

On Thu, Jun 25, 2015 at 8:03 AM, Shai Erera <[hidden email]> wrote:

> The tables came across corrupt, here they are (times in ms):
>
> Caches enabled:
>
>                   q     fq     delta
> original query    28    295    267
> w/o grouping      58    325    267
> w/o sort on date  28    293    265
>
> Caches disabled:
>
>                   q     fq     delta
> original query    4113  4381   268
> w/o grouping      131   407    276
> w/o sort on date  4217  4400   183
>
> Shai
>
> On Thu, Jun 25, 2015 at 2:04 PM, Esther Goldbraich <[hidden email]>
> wrote:
>
>> Thank you all for collaborative thinking!
>>
>> Ran additional benchmarks as proposed. Some results:
>>
>> All solr caches are enabled (queryResultCache hit ratio = 0.02):
>>
>>
>> q
>> fq {!cache=false}
>> delta
>> original query
>> 28
>> 295
>> 267
>> w/o grouping
>> 58
>> 325
>> 267
>> w/o sort on date
>> 28
>> 293
>> 265
>>
>> All solr caches are disabled (except built in lucene field cache):
>>
>>
>> q
>> fq {!cache=false}
>> delta
>> original query
>> 4113
>> 4381
>> 268
>> w/o grouping
>> 131
>> 407
>> 276
>> w/o sort on date
>> 4217
>> 4400
>> 183
>>
>> *median runtime in ms
>>
>> As you can see, disabling grouping and/or sorting does not affect the
>> results much. That is, the difference between running with
>> 'fq{!cache=false}' or with 'q' is the same, while 'fq' performs slower in
>> all cases.
>>
>> Is it correct to assume then that the performance difference comes from
>> computing the filter (traversing the posting lists and building the
>> bitset)?
>> Does it also mean that not caching the filter does not affect grouping?
>> I.e. perhaps the second pass of grouping uses the already computed filter,
>> and does not attempt to fetch it from the cache?
>>
>> As a general rule of thumb, at least in our case, would you please comment
>> on the following assumptions/conclusions (note, all assuming that we don't
>> want to cache filters, and the 'fq' part is only used to avoid scoring):
>>
>> 1) If the query sorts by any other field than score (e.g. date), we can
>> put the 'fq' part in 'q'. Scoring won't be done, and we won't pay the cost
>> of building the filter, and then discarding it when the query completes.
>>
>> 2) In fact, if we don't intend to cache the filter, we might as well just
>> use only 'q'. At least, on our dataset (this may definitely *not* be a
>> general statement).
>>
>> 3) If we sort by relevance, but want to avoid scoring of the 'filter'
>> clauses, is there anything we can do on 4.7?
>> 3.1) The ^= operator is only available in 5.1, which seems exactly what we
>> need.
>> 3.2) Adding the filter clauses to the query w/ boost 0 will still compute
>> their score, only they won't affect the overall document score correct?
>>
>> 4) A more general question -- with the addition of ^= to query clauses in
>> 5.1 (resolved to ConstantScoreQuery down stream), what is the use case for
>> using fq w/ !cache=false? As we understand it, users who use this want to
>> compute a filter but not cache it. As we see, there is some added cost to
>> building a filter, so if you pay this cost over and over, would it not be
>> better to just use ^=?
>>
>> Best regards,
>> Esther
>>
>>
>>
>>
>> From:
>> Erick Erickson <[hidden email]>
>> To:
>> [hidden email]
>> Date:
>> 25/06/2015 02:38 AM
>> Subject:
>> Re: fq versus q
>>
>>
>>
>> Tell us a bit more about your test setup. 1 or 2 tests
>> don't mean much. For instance, if the fq query has to
>> load the low-level caches from disk then the q-only
>> query is run and doesn't that could skew the results.
>> Or if somehow you're hitting the queryResultCache. Or....
>>
>> Frankly I'd disable all my caches for running tests like
>> this, and be sure to mix-n-match the tests so I wasn't
>> getting bitten by caches.
>>
>> And please tell us what the actual numbers are. 5-10X
>> doesn't mean much at all if it's 25ms .vs. 5 ms. It means
>> a lot (and something's very wrong) if it means
>> 200ms .vs. 1,000ms.
>>
>> Best,
>> Erick
>>
>> On Wed, Jun 24, 2015 at 5:30 PM, Upayavira <[hidden email]> wrote:
>> > Are you wanting to do no scoring at all, or just have a portion of the
>> > query not contribute to the score?
>> >
>> > If you don't want scoring at all, just sort by another field. If you
>> > don't have a field, I just tried "&sort=1 desc", and it worked! This
>> > should, if I'm right, pull documents out of the index in index order.
>> >
>> > Upayavira
>> >
>> > On Wed, Jun 24, 2015, at 08:26 PM, Shai Erera wrote:
>> >> Ah thanks. I see it was added in 5.1 - is there any other way prior to
>> >> that
>> >> (like 4.7)?
>> >>
>> >> if not, I guess the only option is to not use fq if we don't intend to
>> >> cache it, and on 5.1 use the ^= syntax.
>> >>
>> >> Shai
>> >>
>> >> On Wed, Jun 24, 2015 at 9:21 PM, Jack Krupansky
>> >> <[hidden email]>
>> >> wrote:
>> >>
>> >> > Yonik added syntax to request a constant score query in Solr with the
>> ^=
>> >> > operator.
>> >> >
>> >> > For example: +color:blue^=1 text:shoes
>> >> >
>> >> > See:
>> >> > https://issues.apache.org/jira/browse/SOLR-7218
>> >> >
>> >> > -- Jack Krupansky
>> >> >
>> >> > On Wed, Jun 24, 2015 at 1:41 PM, Shai Erera <[hidden email]> wrote:
>> >> >
>> >> > > Thanks Shawn,
>> >> > >
>> >> > > What's Solr equivalence to ConstantScoreQuery? I.e., what if you
>> want to
>> >> > > run a query that does not score, but only filter. The rationale
>> behind
>> >> > > using a non-cached 'fq' was just that.
>> >> > >
>> >> > > Shai
>> >> > >
>> >> > > On Wed, Jun 24, 2015 at 4:29 PM, Shawn Heisey <[hidden email]>
>> >> > wrote:
>> >> > >
>> >> > > > On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
>> >> > > > > We are comparing the performance of fq versus q for queries
>> that are
>> >> > > > > actually filters and should not be cached.
>> >> > > > > In part of queries we see strange behavior where q performs
>> 5-10x
>> >> > > better
>> >> > > > > than fq. The question is why?
>> >> > > > >
>> >> > > > > An example1:
>> >> > > > > q=maildate:{DATE1 to DATE2} COMPARED TO
>> >> > > fq={!cache=false}maildate:{DATE1
>> >> > > > > to DATE2}
>> >> > > > > sort=maildate_sort* desc
>> >> > > >
>> >> > > > <snip>
>> >> > > >
>> >> > > > > <field name="maildate" stored="true" indexed="true"
>> type="tdate"/>
>> >> > > > > <field name="maildate_sort" stored="false" indexed="false"
>> >> > type="tdate"
>> >> > > > > docValues="true"/>
>> >> > > >
>> >> > > > For simplicity, I would probably just use one field for that,
>> rather
>> >> > > > than a separate sort field.  The disk space required would
>> probably be
>> >> > > > the same either way, but your interaction with the index will not
>> be as
>> >> > > > complex.  There's nothing wrong with doing it the way you have,
>> though.
>> >> > > >
>> >> > > > I'm not at all an expert, but I've been a member of this
>> community for
>> >> > a
>> >> > > > long time.  Here's my guess about why your query is faster in the
>> q
>> >> > > > parameter than a non-cached filter:
>> >> > > >
>> >> > > > The result of a standard query is the stored fields from the top
>> N
>> >> > > > documents, where N is the value in the rows parameter.  The
>> default for
>> >> > > > N is typically set to 10, and for most people will normally be
>> 200 or
>> >> > > less.
>> >> > > >
>> >> > > > The result of a filter is very different -- it is a bitset of all
>> the
>> >> > > > documents in your entire index, with binary 0 for documents that
>> don't
>> >> > > > match the filter and binary 1 for documents that do match.
>> >> > > >
>> >> > > > If your index has 100 million documents, every single one of
>> those 100
>> >> > > > million documents must be checked against the filter query to
>> produce a
>> >> > > > filter bitset, but when it's in the q parameter, shortcuts can be
>> taken
>> >> > > > which will get the top N results quickly.
>> >> > > >
>> >> > > > The filterCache levels the playing field when filters are
>> re-used.  If
>> >> > a
>> >> > > > requested filter is already in the cache, it can be retrieved and
>> >> > > > applied to a result VERY quickly.
>> >> > > >
>> >> > > > You have turned off the caching for your filter.  I'm not sure
>> why you
>> >> > > > did this, but you know your use case a lot better than I do.  If
>> it
>> >> > were
>> >> > > > me, I would use filter queries and do everything possible to
>> re-use the
>> >> > > > same filters, and I would cache them.
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Shawn
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>>
>>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: fq versus q

Esther Goldbraich
Thank you Erick.
This solution fits part of our queries, will adopt it for those. Yet we
have use-cases in which the results can not be cached.

Everyone,
What do you think about our assumptions and conclusions?

 As a general rule of thumb, at least in our case, would you please
comment
 on the following assumptions/conclusions (note, all assuming that we
don't
 want to cache filters, and the 'fq' part is only used to avoid scoring):

 1) If the query sorts by any other field than score (e.g. date), we can
 put the 'fq' part in 'q'. Scoring won't be done, and we won't pay the
cost
 of building the filter, and then discarding it when the query completes.

 2) In fact, if we don't intend to cache the filter, we might as well just
 use only 'q'. At least, on our dataset (this may definitely *not* be a
 general statement).

 3) If we sort by relevance, but want to avoid scoring of the 'filter'
 clauses, is there anything we can do on 4.7?
 3.1) The ^= operator is only available in 5.1, which seems exactly what
we
 need.
 3.2) Adding the filter clauses to the query w/ boost 0 will still compute
 their score, only they won't affect the overall document score correct?

 4) A more general question -- with the addition of ^= to query clauses in
 5.1 (resolved to ConstantScoreQuery down stream), what is the use case
for
 using fq w/ !cache=false? As we understand it, users who use this want to
 compute a filter but not cache it. As we see, there is some added cost to
 building a filter, so if you pay this cost over and over, would it not be
 better to just use ^=?

Have a good day,
Esther



From:
Erick Erickson <[hidden email]>
To:
[hidden email]
Date:
25/06/2015 03:27 PM
Subject:
Re: fq versus q



Side note on dates and fqs. If you're using NOW in your date
expressions you may be able to re-use fqs by using "date math",
see:
https://lucidworks.com/blog/date-math-now-and-filter-queries/
Of course this may not be applicable in your situation...

FWIW,
Erick

On Thu, Jun 25, 2015 at 8:03 AM, Shai Erera <[hidden email]> wrote:

> The tables came across corrupt, here they are (times in ms):
>
> Caches enabled:
>
>                   q     fq     delta
> original query    28    295    267
> w/o grouping      58    325    267
> w/o sort on date  28    293    265
>
> Caches disabled:
>
>                   q     fq     delta
> original query    4113  4381   268
> w/o grouping      131   407    276
> w/o sort on date  4217  4400   183
>
> Shai
>
> On Thu, Jun 25, 2015 at 2:04 PM, Esther Goldbraich
<[hidden email]>

> wrote:
>
>> Thank you all for collaborative thinking!
>>
>> Ran additional benchmarks as proposed. Some results:
>>
>> All solr caches are enabled (queryResultCache hit ratio = 0.02):
>>
>>
>> q
>> fq {!cache=false}
>> delta
>> original query
>> 28
>> 295
>> 267
>> w/o grouping
>> 58
>> 325
>> 267
>> w/o sort on date
>> 28
>> 293
>> 265
>>
>> All solr caches are disabled (except built in lucene field cache):
>>
>>
>> q
>> fq {!cache=false}
>> delta
>> original query
>> 4113
>> 4381
>> 268
>> w/o grouping
>> 131
>> 407
>> 276
>> w/o sort on date
>> 4217
>> 4400
>> 183
>>
>> *median runtime in ms
>>
>> As you can see, disabling grouping and/or sorting does not affect the
>> results much. That is, the difference between running with
>> 'fq{!cache=false}' or with 'q' is the same, while 'fq' performs slower
in
>> all cases.
>>
>> Is it correct to assume then that the performance difference comes from
>> computing the filter (traversing the posting lists and building the
>> bitset)?
>> Does it also mean that not caching the filter does not affect grouping?
>> I.e. perhaps the second pass of grouping uses the already computed
filter,
>> and does not attempt to fetch it from the cache?
>>
>> As a general rule of thumb, at least in our case, would you please
comment
>> on the following assumptions/conclusions (note, all assuming that we
don't
>> want to cache filters, and the 'fq' part is only used to avoid
scoring):
>>
>> 1) If the query sorts by any other field than score (e.g. date), we can
>> put the 'fq' part in 'q'. Scoring won't be done, and we won't pay the
cost
>> of building the filter, and then discarding it when the query
completes.
>>
>> 2) In fact, if we don't intend to cache the filter, we might as well
just
>> use only 'q'. At least, on our dataset (this may definitely *not* be a
>> general statement).
>>
>> 3) If we sort by relevance, but want to avoid scoring of the 'filter'
>> clauses, is there anything we can do on 4.7?
>> 3.1) The ^= operator is only available in 5.1, which seems exactly what
we
>> need.
>> 3.2) Adding the filter clauses to the query w/ boost 0 will still
compute
>> their score, only they won't affect the overall document score correct?
>>
>> 4) A more general question -- with the addition of ^= to query clauses
in
>> 5.1 (resolved to ConstantScoreQuery down stream), what is the use case
for
>> using fq w/ !cache=false? As we understand it, users who use this want
to
>> compute a filter but not cache it. As we see, there is some added cost
to
>> building a filter, so if you pay this cost over and over, would it not
be

>> better to just use ^=?
>>
>> Best regards,
>> Esther
>>
>>
>>
>>
>> From:
>> Erick Erickson <[hidden email]>
>> To:
>> [hidden email]
>> Date:
>> 25/06/2015 02:38 AM
>> Subject:
>> Re: fq versus q
>>
>>
>>
>> Tell us a bit more about your test setup. 1 or 2 tests
>> don't mean much. For instance, if the fq query has to
>> load the low-level caches from disk then the q-only
>> query is run and doesn't that could skew the results.
>> Or if somehow you're hitting the queryResultCache. Or....
>>
>> Frankly I'd disable all my caches for running tests like
>> this, and be sure to mix-n-match the tests so I wasn't
>> getting bitten by caches.
>>
>> And please tell us what the actual numbers are. 5-10X
>> doesn't mean much at all if it's 25ms .vs. 5 ms. It means
>> a lot (and something's very wrong) if it means
>> 200ms .vs. 1,000ms.
>>
>> Best,
>> Erick
>>
>> On Wed, Jun 24, 2015 at 5:30 PM, Upayavira <[hidden email]> wrote:
>> > Are you wanting to do no scoring at all, or just have a portion of
the

>> > query not contribute to the score?
>> >
>> > If you don't want scoring at all, just sort by another field. If you
>> > don't have a field, I just tried "&sort=1 desc", and it worked! This
>> > should, if I'm right, pull documents out of the index in index order.
>> >
>> > Upayavira
>> >
>> > On Wed, Jun 24, 2015, at 08:26 PM, Shai Erera wrote:
>> >> Ah thanks. I see it was added in 5.1 - is there any other way prior
to
>> >> that
>> >> (like 4.7)?
>> >>
>> >> if not, I guess the only option is to not use fq if we don't intend
to
>> >> cache it, and on 5.1 use the ^= syntax.
>> >>
>> >> Shai
>> >>
>> >> On Wed, Jun 24, 2015 at 9:21 PM, Jack Krupansky
>> >> <[hidden email]>
>> >> wrote:
>> >>
>> >> > Yonik added syntax to request a constant score query in Solr with
the

>> ^=
>> >> > operator.
>> >> >
>> >> > For example: +color:blue^=1 text:shoes
>> >> >
>> >> > See:
>> >> > https://issues.apache.org/jira/browse/SOLR-7218
>> >> >
>> >> > -- Jack Krupansky
>> >> >
>> >> > On Wed, Jun 24, 2015 at 1:41 PM, Shai Erera <[hidden email]>
wrote:

>> >> >
>> >> > > Thanks Shawn,
>> >> > >
>> >> > > What's Solr equivalence to ConstantScoreQuery? I.e., what if you
>> want to
>> >> > > run a query that does not score, but only filter. The rationale
>> behind
>> >> > > using a non-cached 'fq' was just that.
>> >> > >
>> >> > > Shai
>> >> > >
>> >> > > On Wed, Jun 24, 2015 at 4:29 PM, Shawn Heisey
<[hidden email]>

>> >> > wrote:
>> >> > >
>> >> > > > On 6/24/2015 5:28 AM, Esther Goldbraich wrote:
>> >> > > > > We are comparing the performance of fq versus q for queries
>> that are
>> >> > > > > actually filters and should not be cached.
>> >> > > > > In part of queries we see strange behavior where q performs
>> 5-10x
>> >> > > better
>> >> > > > > than fq. The question is why?
>> >> > > > >
>> >> > > > > An example1:
>> >> > > > > q=maildate:{DATE1 to DATE2} COMPARED TO
>> >> > > fq={!cache=false}maildate:{DATE1
>> >> > > > > to DATE2}
>> >> > > > > sort=maildate_sort* desc
>> >> > > >
>> >> > > > <snip>
>> >> > > >
>> >> > > > > <field name="maildate" stored="true" indexed="true"
>> type="tdate"/>
>> >> > > > > <field name="maildate_sort" stored="false" indexed="false"
>> >> > type="tdate"
>> >> > > > > docValues="true"/>
>> >> > > >
>> >> > > > For simplicity, I would probably just use one field for that,
>> rather
>> >> > > > than a separate sort field.  The disk space required would
>> probably be
>> >> > > > the same either way, but your interaction with the index will
not
>> be as
>> >> > > > complex.  There's nothing wrong with doing it the way you
have,
>> though.
>> >> > > >
>> >> > > > I'm not at all an expert, but I've been a member of this
>> community for
>> >> > a
>> >> > > > long time.  Here's my guess about why your query is faster in
the
>> q
>> >> > > > parameter than a non-cached filter:
>> >> > > >
>> >> > > > The result of a standard query is the stored fields from the
top
>> N
>> >> > > > documents, where N is the value in the rows parameter.  The
>> default for
>> >> > > > N is typically set to 10, and for most people will normally be
>> 200 or
>> >> > > less.
>> >> > > >
>> >> > > > The result of a filter is very different -- it is a bitset of
all
>> the
>> >> > > > documents in your entire index, with binary 0 for documents
that
>> don't
>> >> > > > match the filter and binary 1 for documents that do match.
>> >> > > >
>> >> > > > If your index has 100 million documents, every single one of
>> those 100
>> >> > > > million documents must be checked against the filter query to
>> produce a
>> >> > > > filter bitset, but when it's in the q parameter, shortcuts can
be
>> taken
>> >> > > > which will get the top N results quickly.
>> >> > > >
>> >> > > > The filterCache levels the playing field when filters are
>> re-used.  If
>> >> > a
>> >> > > > requested filter is already in the cache, it can be retrieved
and
>> >> > > > applied to a result VERY quickly.
>> >> > > >
>> >> > > > You have turned off the caching for your filter.  I'm not sure
>> why you
>> >> > > > did this, but you know your use case a lot better than I do.
If

>> it
>> >> > were
>> >> > > > me, I would use filter queries and do everything possible to
>> re-use the
>> >> > > > same filters, and I would cache them.
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Shawn
>> >> > > >
>> >> > > >
>> >> > >
>> >> >
>>
>>
>>
>>