Efficient filtering advise

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient filtering advise

Eran S.
Hi,

I have a need to filter my queries using a rather large subset of terms (can
be 10K or even 50K).
All these terms are sure to exist in the index so the number of results can
be about the same number of terms in the filter.
The terms are numbers but are not subsequent and are from a large set of
possible values (so range queries are probably not good for me).
The index itself is about 1M docs and running even a simple query with such
a large filter takes a lot of time even if the number of results is only a
few hundred docs.
It seems like the speed is affected by the length of the filter even if the
number of results remains more or less the same, which is logical but not by
such a large loss of performance as I'm experiencing (running the query with
a 10K terms filter takes an average of 1s 187ms with 600 results while
running it with a 50K terms filter takes an average of 5s 207ms with 1000
results).

Currently I'm using a QueryFilter with a boolean query in which I "OR" the
different terms together.
I also can't use a cached filter efficiently since the terms to filter on
change almost every query.

I was wondering if there's a better way to filter my queries so they won't
take a few seconds to run?

Thanks in advance for any advise,
Eran.
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Paul Elschot
Try a MultiTermQueryWrapperFilter instead of the QueryFilter.
I'd expect a modest gain in performance.

In case it is possible to form a few groups of terms that are reused,
it could even be more efficient to also use a CachingWrapperFilter
for each of these groups.

Regards,
Paul Elschot

Op zondag 22 november 2009 15:48:39 schreef Eran Sevi:

> Hi,
>
> I have a need to filter my queries using a rather large subset of terms (can
> be 10K or even 50K).
> All these terms are sure to exist in the index so the number of results can
> be about the same number of terms in the filter.
> The terms are numbers but are not subsequent and are from a large set of
> possible values (so range queries are probably not good for me).
> The index itself is about 1M docs and running even a simple query with such
> a large filter takes a lot of time even if the number of results is only a
> few hundred docs.
> It seems like the speed is affected by the length of the filter even if the
> number of results remains more or less the same, which is logical but not by
> such a large loss of performance as I'm experiencing (running the query with
> a 10K terms filter takes an average of 1s 187ms with 600 results while
> running it with a 50K terms filter takes an average of 5s 207ms with 1000
> results).
>
> Currently I'm using a QueryFilter with a boolean query in which I "OR" the
> different terms together.
> I also can't use a cached filter efficiently since the terms to filter on
> change almost every query.
>
> I was wondering if there's a better way to filter my queries so they won't
> take a few seconds to run?
>
> Thanks in advance for any advise,
> Eran.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Erick Erickson
In reply to this post by Eran S.
Hmmm, I'm not very clear here. Are you saying that you effectively
form 10-50K filters and OR them all together? That would be
consistent with the 50K case taking approx. 5X a long as the 10K
case.....

Do you know where in your code the time is being spent? That'd
be a big help in suggesting alternatives. If I'm on the right track,
I'd expect the time to be spent assembling the filters.....

Not much help here, but I'm having trouble wrapping my head
around this...

Best
Erick

On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <[hidden email]> wrote:

> Hi,
>
> I have a need to filter my queries using a rather large subset of terms
> (can
> be 10K or even 50K).
> All these terms are sure to exist in the index so the number of results can
> be about the same number of terms in the filter.
> The terms are numbers but are not subsequent and are from a large set of
> possible values (so range queries are probably not good for me).
> The index itself is about 1M docs and running even a simple query with such
> a large filter takes a lot of time even if the number of results is only a
> few hundred docs.
> It seems like the speed is affected by the length of the filter even if the
> number of results remains more or less the same, which is logical but not
> by
> such a large loss of performance as I'm experiencing (running the query
> with
> a 10K terms filter takes an average of 1s 187ms with 600 results while
> running it with a 50K terms filter takes an average of 5s 207ms with 1000
> results).
>
> Currently I'm using a QueryFilter with a boolean query in which I "OR" the
> different terms together.
> I also can't use a cached filter efficiently since the terms to filter on
> change almost every query.
>
> I was wondering if there's a better way to filter my queries so they won't
> take a few seconds to run?
>
> Thanks in advance for any advise,
> Eran.
>
Reply | Threaded
Open this post in threaded view
|

RE: Efficient filtering advise

Uwe Schindler
In reply to this post by Eran S.
Maybe this helps you, but read the docs, it will work only with
single-value-fields:
http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC
acheTermsFilter.html

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]


> -----Original Message-----
> From: Eran Sevi [mailto:[hidden email]]
> Sent: Sunday, November 22, 2009 3:49 PM
> To: [hidden email]
> Subject: Efficient filtering advise
>
> Hi,
>
> I have a need to filter my queries using a rather large subset of terms
> (can
> be 10K or even 50K).
> All these terms are sure to exist in the index so the number of results
> can
> be about the same number of terms in the filter.
> The terms are numbers but are not subsequent and are from a large set of
> possible values (so range queries are probably not good for me).
> The index itself is about 1M docs and running even a simple query with
> such
> a large filter takes a lot of time even if the number of results is only a
> few hundred docs.
> It seems like the speed is affected by the length of the filter even if
> the
> number of results remains more or less the same, which is logical but not
> by
> such a large loss of performance as I'm experiencing (running the query
> with
> a 10K terms filter takes an average of 1s 187ms with 600 results while
> running it with a 50K terms filter takes an average of 5s 207ms with 1000
> results).
>
> Currently I'm using a QueryFilter with a boolean query in which I "OR" the
> different terms together.
> I also can't use a cached filter efficiently since the terms to filter on
> change almost every query.
>
> I was wondering if there's a better way to filter my queries so they won't
> take a few seconds to run?
>
> Thanks in advance for any advise,
> Eran.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Eran S.
Thanks for the tips.

I'm still using version 2.4 so I can't use MultiTermQueryWrapperFilter but
I'll definitely try to re-group the the terms that are not changing in order
to cache them.
How can I join several such filters together?

Using FieldCacheTermsFilter sounds promising. Fortunately it is a single
value field (our unique doc id).

I'll consider very seriously moving to 2.9.1 in order to try it out and see
if I can get so real gain from using it or maybe using TermsFilter from
contrib.


On Sun, Nov 22, 2009 at 6:10 PM, Uwe Schindler <[hidden email]> wrote:

> Maybe this helps you, but read the docs, it will work only with
> single-value-fields:
>
> http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC
> acheTermsFilter.html<http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC%0AacheTermsFilter.html>
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>
> > -----Original Message-----
> > From: Eran Sevi [mailto:[hidden email]]
> > Sent: Sunday, November 22, 2009 3:49 PM
> > To: [hidden email]
> > Subject: Efficient filtering advise
> >
> > Hi,
> >
> > I have a need to filter my queries using a rather large subset of terms
> > (can
> > be 10K or even 50K).
> > All these terms are sure to exist in the index so the number of results
> > can
> > be about the same number of terms in the filter.
> > The terms are numbers but are not subsequent and are from a large set of
> > possible values (so range queries are probably not good for me).
> > The index itself is about 1M docs and running even a simple query with
> > such
> > a large filter takes a lot of time even if the number of results is only
> a
> > few hundred docs.
> > It seems like the speed is affected by the length of the filter even if
> > the
> > number of results remains more or less the same, which is logical but not
> > by
> > such a large loss of performance as I'm experiencing (running the query
> > with
> > a 10K terms filter takes an average of 1s 187ms with 600 results while
> > running it with a 50K terms filter takes an average of 5s 207ms with 1000
> > results).
> >
> > Currently I'm using a QueryFilter with a boolean query in which I "OR"
> the
> > different terms together.
> > I also can't use a cached filter efficiently since the terms to filter on
> > change almost every query.
> >
> > I was wondering if there's a better way to filter my queries so they
> won't
> > take a few seconds to run?
> >
> > Thanks in advance for any advise,
> > Eran.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Eran S.
In reply to this post by Erick Erickson
I think it shouldn't take X5 times longer since the number of results is
only about X2 times larger (and much smaller than the number of terms in the
filter), but maybe I'm wrong here since I'm not familiar with the filter
internals.

Unfortunately, the time to construct the filter is mere milliseconds.
almost all of the time (~5secs) are spent in the search method.
I'm using a collector to retrieve all the results (and fetch a value for
some fields) but without the filter this also takes less then a second for
the same number of results.

On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <[hidden email]>wrote:

> Hmmm, I'm not very clear here. Are you saying that you effectively
> form 10-50K filters and OR them all together? That would be
> consistent with the 50K case taking approx. 5X a long as the 10K
> case.....
>
> Do you know where in your code the time is being spent? That'd
> be a big help in suggesting alternatives. If I'm on the right track,
> I'd expect the time to be spent assembling the filters.....
>
> Not much help here, but I'm having trouble wrapping my head
> around this...
>
> Best
> Erick
>
> On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <[hidden email]> wrote:
>
> > Hi,
> >
> > I have a need to filter my queries using a rather large subset of terms
> > (can
> > be 10K or even 50K).
> > All these terms are sure to exist in the index so the number of results
> can
> > be about the same number of terms in the filter.
> > The terms are numbers but are not subsequent and are from a large set of
> > possible values (so range queries are probably not good for me).
> > The index itself is about 1M docs and running even a simple query with
> such
> > a large filter takes a lot of time even if the number of results is only
> a
> > few hundred docs.
> > It seems like the speed is affected by the length of the filter even if
> the
> > number of results remains more or less the same, which is logical but not
> > by
> > such a large loss of performance as I'm experiencing (running the query
> > with
> > a 10K terms filter takes an average of 1s 187ms with 600 results while
> > running it with a 50K terms filter takes an average of 5s 207ms with 1000
> > results).
> >
> > Currently I'm using a QueryFilter with a boolean query in which I "OR"
> the
> > different terms together.
> > I also can't use a cached filter efficiently since the terms to filter on
> > change almost every query.
> >
> > I was wondering if there's a better way to filter my queries so they
> won't
> > take a few seconds to run?
> >
> > Thanks in advance for any advise,
> > Eran.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Paul Elschot
In reply to this post by Eran S.
Op zondag 22 november 2009 17:23:53 schreef Eran Sevi:
> Thanks for the tips.
>
> I'm still using version 2.4 so I can't use MultiTermQueryWrapperFilter but
> I'll definitely try to re-group the the terms that are not changing in order
> to cache them.
> How can I join several such filters together?

There are various ways. OpenBitSet and OpenBitSetDISI can do this,
and there's also BooleanFilter and ChainedFilter in contrib.

> Using FieldCacheTermsFilter sounds promising. Fortunately it is a single
> value field (our unique doc id).


Regards,
Paul Elschot
 

> I'll consider very seriously moving to 2.9.1 in order to try it out and see
> if I can get so real gain from using it or maybe using TermsFilter from
> contrib.
>
>
> On Sun, Nov 22, 2009 at 6:10 PM, Uwe Schindler <[hidden email]> wrote:
>
> > Maybe this helps you, but read the docs, it will work only with
> > single-value-fields:
> >
> > http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC
> > acheTermsFilter.html<http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldC%0AacheTermsFilter.html>
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [hidden email]
> >
> >
> > > -----Original Message-----
> > > From: Eran Sevi [mailto:[hidden email]]
> > > Sent: Sunday, November 22, 2009 3:49 PM
> > > To: [hidden email]
> > > Subject: Efficient filtering advise
> > >
> > > Hi,
> > >
> > > I have a need to filter my queries using a rather large subset of terms
> > > (can
> > > be 10K or even 50K).
> > > All these terms are sure to exist in the index so the number of results
> > > can
> > > be about the same number of terms in the filter.
> > > The terms are numbers but are not subsequent and are from a large set of
> > > possible values (so range queries are probably not good for me).
> > > The index itself is about 1M docs and running even a simple query with
> > > such
> > > a large filter takes a lot of time even if the number of results is only
> > a
> > > few hundred docs.
> > > It seems like the speed is affected by the length of the filter even if
> > > the
> > > number of results remains more or less the same, which is logical but not
> > > by
> > > such a large loss of performance as I'm experiencing (running the query
> > > with
> > > a 10K terms filter takes an average of 1s 187ms with 600 results while
> > > running it with a 50K terms filter takes an average of 5s 207ms with 1000
> > > results).
> > >
> > > Currently I'm using a QueryFilter with a boolean query in which I "OR"
> > the
> > > different terms together.
> > > I also can't use a cached filter efficiently since the terms to filter on
> > > change almost every query.
> > >
> > > I was wondering if there's a better way to filter my queries so they
> > won't
> > > take a few seconds to run?
> > >
> > > Thanks in advance for any advise,
> > > Eran.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Erick Erickson
In reply to this post by Eran S.
Hmmm, could you show us what you do in your collector? Because
one of the gotchas about a collector is loading the documents in
the inner loop. Quick test: comment out whatever you're doing in
the underlying collector loop, and see if there's *any* noticeable
difference in speed. That'll tell you whether your problems
arise from the filter construction/search or what you're doing
in the collector....

Best
Erick

On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <[hidden email]> wrote:

> I think it shouldn't take X5 times longer since the number of results is
> only about X2 times larger (and much smaller than the number of terms in
> the
> filter), but maybe I'm wrong here since I'm not familiar with the filter
> internals.
>
> Unfortunately, the time to construct the filter is mere milliseconds.
> almost all of the time (~5secs) are spent in the search method.
> I'm using a collector to retrieve all the results (and fetch a value for
> some fields) but without the filter this also takes less then a second for
> the same number of results.
>
> On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <[hidden email]
> >wrote:
>
> > Hmmm, I'm not very clear here. Are you saying that you effectively
> > form 10-50K filters and OR them all together? That would be
> > consistent with the 50K case taking approx. 5X a long as the 10K
> > case.....
> >
> > Do you know where in your code the time is being spent? That'd
> > be a big help in suggesting alternatives. If I'm on the right track,
> > I'd expect the time to be spent assembling the filters.....
> >
> > Not much help here, but I'm having trouble wrapping my head
> > around this...
> >
> > Best
> > Erick
> >
> > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <[hidden email]> wrote:
> >
> > > Hi,
> > >
> > > I have a need to filter my queries using a rather large subset of terms
> > > (can
> > > be 10K or even 50K).
> > > All these terms are sure to exist in the index so the number of results
> > can
> > > be about the same number of terms in the filter.
> > > The terms are numbers but are not subsequent and are from a large set
> of
> > > possible values (so range queries are probably not good for me).
> > > The index itself is about 1M docs and running even a simple query with
> > such
> > > a large filter takes a lot of time even if the number of results is
> only
> > a
> > > few hundred docs.
> > > It seems like the speed is affected by the length of the filter even if
> > the
> > > number of results remains more or less the same, which is logical but
> not
> > > by
> > > such a large loss of performance as I'm experiencing (running the query
> > > with
> > > a 10K terms filter takes an average of 1s 187ms with 600 results while
> > > running it with a 50K terms filter takes an average of 5s 207ms with
> 1000
> > > results).
> > >
> > > Currently I'm using a QueryFilter with a boolean query in which I "OR"
> > the
> > > different terms together.
> > > I also can't use a cached filter efficiently since the terms to filter
> on
> > > change almost every query.
> > >
> > > I was wondering if there's a better way to filter my queries so they
> > won't
> > > take a few seconds to run?
> > >
> > > Thanks in advance for any advise,
> > > Eran.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Eran S.
After commenting out the collector logic, the time is still more or less the
same.
Anyway, since without the filter collecting the documents is very fast it's
probably something with the filter itself.

I don't know how the filter (or boolean query) work internally but probably
for 10K or 50K clauses, it does something that takes a lot of time. It might
be because of the inner data structures that are used or maybe just the
iteration on so many terms takes time.

I'll continue to try and pinpoint the exact bottleneck postion, or maybe
using the new filters in 2.9.1 might help.

On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <[hidden email]>wrote:

> Hmmm, could you show us what you do in your collector? Because
> one of the gotchas about a collector is loading the documents in
> the inner loop. Quick test: comment out whatever you're doing in
> the underlying collector loop, and see if there's *any* noticeable
> difference in speed. That'll tell you whether your problems
> arise from the filter construction/search or what you're doing
> in the collector....
>
> Best
> Erick
>
> On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <[hidden email]> wrote:
>
> > I think it shouldn't take X5 times longer since the number of results is
> > only about X2 times larger (and much smaller than the number of terms in
> > the
> > filter), but maybe I'm wrong here since I'm not familiar with the filter
> > internals.
> >
> > Unfortunately, the time to construct the filter is mere milliseconds.
> > almost all of the time (~5secs) are spent in the search method.
> > I'm using a collector to retrieve all the results (and fetch a value for
> > some fields) but without the filter this also takes less then a second
> for
> > the same number of results.
> >
> > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <[hidden email]
> > >wrote:
> >
> > > Hmmm, I'm not very clear here. Are you saying that you effectively
> > > form 10-50K filters and OR them all together? That would be
> > > consistent with the 50K case taking approx. 5X a long as the 10K
> > > case.....
> > >
> > > Do you know where in your code the time is being spent? That'd
> > > be a big help in suggesting alternatives. If I'm on the right track,
> > > I'd expect the time to be spent assembling the filters.....
> > >
> > > Not much help here, but I'm having trouble wrapping my head
> > > around this...
> > >
> > > Best
> > > Erick
> > >
> > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <[hidden email]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a need to filter my queries using a rather large subset of
> terms
> > > > (can
> > > > be 10K or even 50K).
> > > > All these terms are sure to exist in the index so the number of
> results
> > > can
> > > > be about the same number of terms in the filter.
> > > > The terms are numbers but are not subsequent and are from a large set
> > of
> > > > possible values (so range queries are probably not good for me).
> > > > The index itself is about 1M docs and running even a simple query
> with
> > > such
> > > > a large filter takes a lot of time even if the number of results is
> > only
> > > a
> > > > few hundred docs.
> > > > It seems like the speed is affected by the length of the filter even
> if
> > > the
> > > > number of results remains more or less the same, which is logical but
> > not
> > > > by
> > > > such a large loss of performance as I'm experiencing (running the
> query
> > > > with
> > > > a 10K terms filter takes an average of 1s 187ms with 600 results
> while
> > > > running it with a 50K terms filter takes an average of 5s 207ms with
> > 1000
> > > > results).
> > > >
> > > > Currently I'm using a QueryFilter with a boolean query in which I
> "OR"
> > > the
> > > > different terms together.
> > > > I also can't use a cached filter efficiently since the terms to
> filter
> > on
> > > > change almost every query.
> > > >
> > > > I was wondering if there's a better way to filter my queries so they
> > > won't
> > > > take a few seconds to run?
> > > >
> > > > Thanks in advance for any advise,
> > > > Eran.
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Erick Erickson
Now I'm really confused, which usually means I'm making some
assumptions that aren't true. So here they are...

1> You're talking about Filters that contain BitSets, right? Not some other
     kind of filter.
2> When you create your 10-50K filters, you wind up with a single filter
     by combining them all with one of the BitSet operators, right? And
     that *single* filter is the one you send to your query...

If I'm off base here, could you post a reasonable extract of your filter
construction code, and how you use them to search? Because I don't
think we're all talking about the same thing here.....

HTH
Erick@ThisMakesNoSenseToMe<G>...

On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <[hidden email]> wrote:

> After commenting out the collector logic, the time is still more or less
> the
> same.
> Anyway, since without the filter collecting the documents is very fast it's
> probably something with the filter itself.
>
> I don't know how the filter (or boolean query) work internally but probably
> for 10K or 50K clauses, it does something that takes a lot of time. It
> might
> be because of the inner data structures that are used or maybe just the
> iteration on so many terms takes time.
>
> I'll continue to try and pinpoint the exact bottleneck postion, or maybe
> using the new filters in 2.9.1 might help.
>
> On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <[hidden email]
> >wrote:
>
> > Hmmm, could you show us what you do in your collector? Because
> > one of the gotchas about a collector is loading the documents in
> > the inner loop. Quick test: comment out whatever you're doing in
> > the underlying collector loop, and see if there's *any* noticeable
> > difference in speed. That'll tell you whether your problems
> > arise from the filter construction/search or what you're doing
> > in the collector....
> >
> > Best
> > Erick
> >
> > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <[hidden email]> wrote:
> >
> > > I think it shouldn't take X5 times longer since the number of results
> is
> > > only about X2 times larger (and much smaller than the number of terms
> in
> > > the
> > > filter), but maybe I'm wrong here since I'm not familiar with the
> filter
> > > internals.
> > >
> > > Unfortunately, the time to construct the filter is mere milliseconds.
> > > almost all of the time (~5secs) are spent in the search method.
> > > I'm using a collector to retrieve all the results (and fetch a value
> for
> > > some fields) but without the filter this also takes less then a second
> > for
> > > the same number of results.
> > >
> > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
> [hidden email]
> > > >wrote:
> > >
> > > > Hmmm, I'm not very clear here. Are you saying that you effectively
> > > > form 10-50K filters and OR them all together? That would be
> > > > consistent with the 50K case taking approx. 5X a long as the 10K
> > > > case.....
> > > >
> > > > Do you know where in your code the time is being spent? That'd
> > > > be a big help in suggesting alternatives. If I'm on the right track,
> > > > I'd expect the time to be spent assembling the filters.....
> > > >
> > > > Not much help here, but I'm having trouble wrapping my head
> > > > around this...
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <[hidden email]>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have a need to filter my queries using a rather large subset of
> > terms
> > > > > (can
> > > > > be 10K or even 50K).
> > > > > All these terms are sure to exist in the index so the number of
> > results
> > > > can
> > > > > be about the same number of terms in the filter.
> > > > > The terms are numbers but are not subsequent and are from a large
> set
> > > of
> > > > > possible values (so range queries are probably not good for me).
> > > > > The index itself is about 1M docs and running even a simple query
> > with
> > > > such
> > > > > a large filter takes a lot of time even if the number of results is
> > > only
> > > > a
> > > > > few hundred docs.
> > > > > It seems like the speed is affected by the length of the filter
> even
> > if
> > > > the
> > > > > number of results remains more or less the same, which is logical
> but
> > > not
> > > > > by
> > > > > such a large loss of performance as I'm experiencing (running the
> > query
> > > > > with
> > > > > a 10K terms filter takes an average of 1s 187ms with 600 results
> > while
> > > > > running it with a 50K terms filter takes an average of 5s 207ms
> with
> > > 1000
> > > > > results).
> > > > >
> > > > > Currently I'm using a QueryFilter with a boolean query in which I
> > "OR"
> > > > the
> > > > > different terms together.
> > > > > I also can't use a cached filter efficiently since the terms to
> > filter
> > > on
> > > > > change almost every query.
> > > > >
> > > > > I was wondering if there's a better way to filter my queries so
> they
> > > > won't
> > > > > take a few seconds to run?
> > > > >
> > > > > Thanks in advance for any advise,
> > > > > Eran.
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Eran S.
Erick,

Maybe I didn't make myself clear enough.
I'm talking about high level filters used when searching.

I construct a very big BooleanQuery and add 50K clauses to it (I removed the
limit on max clauses).
Each clause is a TermQuery on the same field.
I don't know the internal doc ids that I want. I only know the value stored
in the field. this value is an external key of the document and guaranteed
to be unique.

I pass this query to QueryFilter and run a search using search
(query,filter,collector) (reminder: i'm using ver 2.4)

If I missed some other magic construction method to achieve the same goal
then it would be great.

Thanks for you help so far,
Eran.

On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <[hidden email]>wrote:

> Now I'm really confused, which usually means I'm making some
> assumptions that aren't true. So here they are...
>
> 1> You're talking about Filters that contain BitSets, right? Not some other
>     kind of filter.
> 2> When you create your 10-50K filters, you wind up with a single filter
>     by combining them all with one of the BitSet operators, right? And
>     that *single* filter is the one you send to your query...
>
> If I'm off base here, could you post a reasonable extract of your filter
> construction code, and how you use them to search? Because I don't
> think we're all talking about the same thing here.....
>
> HTH
> Erick@ThisMakesNoSenseToMe<G>...
>
> On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <[hidden email]> wrote:
>
> > After commenting out the collector logic, the time is still more or less
> > the
> > same.
> > Anyway, since without the filter collecting the documents is very fast
> it's
> > probably something with the filter itself.
> >
> > I don't know how the filter (or boolean query) work internally but
> probably
> > for 10K or 50K clauses, it does something that takes a lot of time. It
> > might
> > be because of the inner data structures that are used or maybe just the
> > iteration on so many terms takes time.
> >
> > I'll continue to try and pinpoint the exact bottleneck postion, or maybe
> > using the new filters in 2.9.1 might help.
> >
> > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <[hidden email]
> > >wrote:
> >
> > > Hmmm, could you show us what you do in your collector? Because
> > > one of the gotchas about a collector is loading the documents in
> > > the inner loop. Quick test: comment out whatever you're doing in
> > > the underlying collector loop, and see if there's *any* noticeable
> > > difference in speed. That'll tell you whether your problems
> > > arise from the filter construction/search or what you're doing
> > > in the collector....
> > >
> > > Best
> > > Erick
> > >
> > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <[hidden email]>
> wrote:
> > >
> > > > I think it shouldn't take X5 times longer since the number of results
> > is
> > > > only about X2 times larger (and much smaller than the number of terms
> > in
> > > > the
> > > > filter), but maybe I'm wrong here since I'm not familiar with the
> > filter
> > > > internals.
> > > >
> > > > Unfortunately, the time to construct the filter is mere milliseconds.
> > > > almost all of the time (~5secs) are spent in the search method.
> > > > I'm using a collector to retrieve all the results (and fetch a value
> > for
> > > > some fields) but without the filter this also takes less then a
> second
> > > for
> > > > the same number of results.
> > > >
> > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
> > [hidden email]
> > > > >wrote:
> > > >
> > > > > Hmmm, I'm not very clear here. Are you saying that you effectively
> > > > > form 10-50K filters and OR them all together? That would be
> > > > > consistent with the 50K case taking approx. 5X a long as the 10K
> > > > > case.....
> > > > >
> > > > > Do you know where in your code the time is being spent? That'd
> > > > > be a big help in suggesting alternatives. If I'm on the right
> track,
> > > > > I'd expect the time to be spent assembling the filters.....
> > > > >
> > > > > Not much help here, but I'm having trouble wrapping my head
> > > > > around this...
> > > > >
> > > > > Best
> > > > > Erick
> > > > >
> > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <[hidden email]>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have a need to filter my queries using a rather large subset of
> > > terms
> > > > > > (can
> > > > > > be 10K or even 50K).
> > > > > > All these terms are sure to exist in the index so the number of
> > > results
> > > > > can
> > > > > > be about the same number of terms in the filter.
> > > > > > The terms are numbers but are not subsequent and are from a large
> > set
> > > > of
> > > > > > possible values (so range queries are probably not good for me).
> > > > > > The index itself is about 1M docs and running even a simple query
> > > with
> > > > > such
> > > > > > a large filter takes a lot of time even if the number of results
> is
> > > > only
> > > > > a
> > > > > > few hundred docs.
> > > > > > It seems like the speed is affected by the length of the filter
> > even
> > > if
> > > > > the
> > > > > > number of results remains more or less the same, which is logical
> > but
> > > > not
> > > > > > by
> > > > > > such a large loss of performance as I'm experiencing (running the
> > > query
> > > > > > with
> > > > > > a 10K terms filter takes an average of 1s 187ms with 600 results
> > > while
> > > > > > running it with a 50K terms filter takes an average of 5s 207ms
> > with
> > > > 1000
> > > > > > results).
> > > > > >
> > > > > > Currently I'm using a QueryFilter with a boolean query in which I
> > > "OR"
> > > > > the
> > > > > > different terms together.
> > > > > > I also can't use a cached filter efficiently since the terms to
> > > filter
> > > > on
> > > > > > change almost every query.
> > > > > >
> > > > > > I was wondering if there's a better way to filter my queries so
> > they
> > > > > won't
> > > > > > take a few seconds to run?
> > > > > >
> > > > > > Thanks in advance for any advise,
> > > > > > Eran.
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Erick Erickson
Oh my goodness yes. No wonder nothing I suggested made any
difference <G>. Ignore everything I've written....

OK, here's something to try, and it goes back to a Filter. Rather than
make this enormous bunch of ORs, try creating a Filter. Use TermDocs
to run through your list of IDs assembling a Filter (the one with the
 BitSet) and send *that* through to your query rather than
all your clauses. Running through the terms is much faster than I thought.

The only issue is that TermDocs is ordered by doc IDs, so you'll
have to use, say, a Map to see if the unique IDs you're after
correspond to your desired IDs.... Or who knows? Just using
TermDocs.seek(Term) for your list might do the trick.... Term.createTerm()
will help a bit here....

You can get a feel for this by just spinning through your list of
unique IDs using TermDocs.seek(Term) to see how long assembling
the filter would take. Using the Filter in a query doesn't cost
much at all....

Best
Erick


On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi <[hidden email]> wrote:

> Erick,
>
> Maybe I didn't make myself clear enough.
> I'm talking about high level filters used when searching.
>
> I construct a very big BooleanQuery and add 50K clauses to it (I removed
> the
> limit on max clauses).
> Each clause is a TermQuery on the same field.
> I don't know the internal doc ids that I want. I only know the value stored
> in the field. this value is an external key of the document and guaranteed
> to be unique.
>
> I pass this query to QueryFilter and run a search using search
> (query,filter,collector) (reminder: i'm using ver 2.4)
>
> If I missed some other magic construction method to achieve the same goal
> then it would be great.
>
> Thanks for you help so far,
> Eran.
>
> On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <[hidden email]
> >wrote:
>
> > Now I'm really confused, which usually means I'm making some
> > assumptions that aren't true. So here they are...
> >
> > 1> You're talking about Filters that contain BitSets, right? Not some
> other
> >     kind of filter.
> > 2> When you create your 10-50K filters, you wind up with a single filter
> >     by combining them all with one of the BitSet operators, right? And
> >     that *single* filter is the one you send to your query...
> >
> > If I'm off base here, could you post a reasonable extract of your filter
> > construction code, and how you use them to search? Because I don't
> > think we're all talking about the same thing here.....
> >
> > HTH
> > Erick@ThisMakesNoSenseToMe<G>...
> >
> > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <[hidden email]> wrote:
> >
> > > After commenting out the collector logic, the time is still more or
> less
> > > the
> > > same.
> > > Anyway, since without the filter collecting the documents is very fast
> > it's
> > > probably something with the filter itself.
> > >
> > > I don't know how the filter (or boolean query) work internally but
> > probably
> > > for 10K or 50K clauses, it does something that takes a lot of time. It
> > > might
> > > be because of the inner data structures that are used or maybe just the
> > > iteration on so many terms takes time.
> > >
> > > I'll continue to try and pinpoint the exact bottleneck postion, or
> maybe
> > > using the new filters in 2.9.1 might help.
> > >
> > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <
> [hidden email]
> > > >wrote:
> > >
> > > > Hmmm, could you show us what you do in your collector? Because
> > > > one of the gotchas about a collector is loading the documents in
> > > > the inner loop. Quick test: comment out whatever you're doing in
> > > > the underlying collector loop, and see if there's *any* noticeable
> > > > difference in speed. That'll tell you whether your problems
> > > > arise from the filter construction/search or what you're doing
> > > > in the collector....
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <[hidden email]>
> > wrote:
> > > >
> > > > > I think it shouldn't take X5 times longer since the number of
> results
> > > is
> > > > > only about X2 times larger (and much smaller than the number of
> terms
> > > in
> > > > > the
> > > > > filter), but maybe I'm wrong here since I'm not familiar with the
> > > filter
> > > > > internals.
> > > > >
> > > > > Unfortunately, the time to construct the filter is mere
> milliseconds.
> > > > > almost all of the time (~5secs) are spent in the search method.
> > > > > I'm using a collector to retrieve all the results (and fetch a
> value
> > > for
> > > > > some fields) but without the filter this also takes less then a
> > second
> > > > for
> > > > > the same number of results.
> > > > >
> > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
> > > [hidden email]
> > > > > >wrote:
> > > > >
> > > > > > Hmmm, I'm not very clear here. Are you saying that you
> effectively
> > > > > > form 10-50K filters and OR them all together? That would be
> > > > > > consistent with the 50K case taking approx. 5X a long as the 10K
> > > > > > case.....
> > > > > >
> > > > > > Do you know where in your code the time is being spent? That'd
> > > > > > be a big help in suggesting alternatives. If I'm on the right
> > track,
> > > > > > I'd expect the time to be spent assembling the filters.....
> > > > > >
> > > > > > Not much help here, but I'm having trouble wrapping my head
> > > > > > around this...
> > > > > >
> > > > > > Best
> > > > > > Erick
> > > > > >
> > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <[hidden email]>
> > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have a need to filter my queries using a rather large subset
> of
> > > > terms
> > > > > > > (can
> > > > > > > be 10K or even 50K).
> > > > > > > All these terms are sure to exist in the index so the number of
> > > > results
> > > > > > can
> > > > > > > be about the same number of terms in the filter.
> > > > > > > The terms are numbers but are not subsequent and are from a
> large
> > > set
> > > > > of
> > > > > > > possible values (so range queries are probably not good for
> me).
> > > > > > > The index itself is about 1M docs and running even a simple
> query
> > > > with
> > > > > > such
> > > > > > > a large filter takes a lot of time even if the number of
> results
> > is
> > > > > only
> > > > > > a
> > > > > > > few hundred docs.
> > > > > > > It seems like the speed is affected by the length of the filter
> > > even
> > > > if
> > > > > > the
> > > > > > > number of results remains more or less the same, which is
> logical
> > > but
> > > > > not
> > > > > > > by
> > > > > > > such a large loss of performance as I'm experiencing (running
> the
> > > > query
> > > > > > > with
> > > > > > > a 10K terms filter takes an average of 1s 187ms with 600
> results
> > > > while
> > > > > > > running it with a 50K terms filter takes an average of 5s 207ms
> > > with
> > > > > 1000
> > > > > > > results).
> > > > > > >
> > > > > > > Currently I'm using a QueryFilter with a boolean query in which
> I
> > > > "OR"
> > > > > > the
> > > > > > > different terms together.
> > > > > > > I also can't use a cached filter efficiently since the terms to
> > > > filter
> > > > > on
> > > > > > > change almost every query.
> > > > > > >
> > > > > > > I was wondering if there's a better way to filter my queries so
> > > they
> > > > > > won't
> > > > > > > take a few seconds to run?
> > > > > > >
> > > > > > > Thanks in advance for any advise,
> > > > > > > Eran.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Eran S.
I've taken TermsFilter from contrib which does exactly that and indeed the
speed was reduced to half, which starts to be reasonable for my needs.

I've researched the regular QueryFilter and what I write here might not be
the complete picture:
I found out that most of the time is spent on scoring the documents in the
filter !
It looks like the QueryFilter (which caches a QueryWrapperFilter) is
actually running the query passed to it using a searcher and a regular hit
collector that justs discards the score and set the doc id bit in the
filter. The score is still being calculated (by BooleanScorer in my case)
which takes a lot of time.

TermsFilter solves part of my problem since I actually need a complex
boolean query (there are other terms which are being AND to the long OR
list). Maybe I can separate the filter to several single Term filters (maybe
using BooleanFilter also from contrib).

The fact is that seeking on the term docs is much faster than performing the
equivalent search using a boolean query althought one might think that
seeking on the term docs is what should be done by the boolean query as well
and the results should be more or less the same when the query is on only
one field.

I know scoring and constructing the search results are closely related to
each other so the real question is:
Is there any way to run a complex query filter WITHOUT actually scoring the
hits?

Eran

On Mon, Nov 23, 2009 at 3:54 PM, Erick Erickson <[hidden email]>wrote:

> Oh my goodness yes. No wonder nothing I suggested made any
> difference <G>. Ignore everything I've written....
>
> OK, here's something to try, and it goes back to a Filter. Rather than
> make this enormous bunch of ORs, try creating a Filter. Use TermDocs
> to run through your list of IDs assembling a Filter (the one with the
>  BitSet) and send *that* through to your query rather than
> all your clauses. Running through the terms is much faster than I thought.
>
> The only issue is that TermDocs is ordered by doc IDs, so you'll
> have to use, say, a Map to see if the unique IDs you're after
> correspond to your desired IDs.... Or who knows? Just using
> TermDocs.seek(Term) for your list might do the trick.... Term.createTerm()
> will help a bit here....
>
> You can get a feel for this by just spinning through your list of
> unique IDs using TermDocs.seek(Term) to see how long assembling
> the filter would take. Using the Filter in a query doesn't cost
> much at all....
>
> Best
> Erick
>
>
> On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi <[hidden email]> wrote:
>
> > Erick,
> >
> > Maybe I didn't make myself clear enough.
> > I'm talking about high level filters used when searching.
> >
> > I construct a very big BooleanQuery and add 50K clauses to it (I removed
> > the
> > limit on max clauses).
> > Each clause is a TermQuery on the same field.
> > I don't know the internal doc ids that I want. I only know the value
> stored
> > in the field. this value is an external key of the document and
> guaranteed
> > to be unique.
> >
> > I pass this query to QueryFilter and run a search using search
> > (query,filter,collector) (reminder: i'm using ver 2.4)
> >
> > If I missed some other magic construction method to achieve the same goal
> > then it would be great.
> >
> > Thanks for you help so far,
> > Eran.
> >
> > On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <[hidden email]
> > >wrote:
> >
> > > Now I'm really confused, which usually means I'm making some
> > > assumptions that aren't true. So here they are...
> > >
> > > 1> You're talking about Filters that contain BitSets, right? Not some
> > other
> > >     kind of filter.
> > > 2> When you create your 10-50K filters, you wind up with a single
> filter
> > >     by combining them all with one of the BitSet operators, right? And
> > >     that *single* filter is the one you send to your query...
> > >
> > > If I'm off base here, could you post a reasonable extract of your
> filter
> > > construction code, and how you use them to search? Because I don't
> > > think we're all talking about the same thing here.....
> > >
> > > HTH
> > > Erick@ThisMakesNoSenseToMe<G>...
> > >
> > > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <[hidden email]> wrote:
> > >
> > > > After commenting out the collector logic, the time is still more or
> > less
> > > > the
> > > > same.
> > > > Anyway, since without the filter collecting the documents is very
> fast
> > > it's
> > > > probably something with the filter itself.
> > > >
> > > > I don't know how the filter (or boolean query) work internally but
> > > probably
> > > > for 10K or 50K clauses, it does something that takes a lot of time.
> It
> > > > might
> > > > be because of the inner data structures that are used or maybe just
> the
> > > > iteration on so many terms takes time.
> > > >
> > > > I'll continue to try and pinpoint the exact bottleneck postion, or
> > maybe
> > > > using the new filters in 2.9.1 might help.
> > > >
> > > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <
> > [hidden email]
> > > > >wrote:
> > > >
> > > > > Hmmm, could you show us what you do in your collector? Because
> > > > > one of the gotchas about a collector is loading the documents in
> > > > > the inner loop. Quick test: comment out whatever you're doing in
> > > > > the underlying collector loop, and see if there's *any* noticeable
> > > > > difference in speed. That'll tell you whether your problems
> > > > > arise from the filter construction/search or what you're doing
> > > > > in the collector....
> > > > >
> > > > > Best
> > > > > Erick
> > > > >
> > > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <[hidden email]>
> > > wrote:
> > > > >
> > > > > > I think it shouldn't take X5 times longer since the number of
> > results
> > > > is
> > > > > > only about X2 times larger (and much smaller than the number of
> > terms
> > > > in
> > > > > > the
> > > > > > filter), but maybe I'm wrong here since I'm not familiar with the
> > > > filter
> > > > > > internals.
> > > > > >
> > > > > > Unfortunately, the time to construct the filter is mere
> > milliseconds.
> > > > > > almost all of the time (~5secs) are spent in the search method.
> > > > > > I'm using a collector to retrieve all the results (and fetch a
> > value
> > > > for
> > > > > > some fields) but without the filter this also takes less then a
> > > second
> > > > > for
> > > > > > the same number of results.
> > > > > >
> > > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
> > > > [hidden email]
> > > > > > >wrote:
> > > > > >
> > > > > > > Hmmm, I'm not very clear here. Are you saying that you
> > effectively
> > > > > > > form 10-50K filters and OR them all together? That would be
> > > > > > > consistent with the 50K case taking approx. 5X a long as the
> 10K
> > > > > > > case.....
> > > > > > >
> > > > > > > Do you know where in your code the time is being spent? That'd
> > > > > > > be a big help in suggesting alternatives. If I'm on the right
> > > track,
> > > > > > > I'd expect the time to be spent assembling the filters.....
> > > > > > >
> > > > > > > Not much help here, but I'm having trouble wrapping my head
> > > > > > > around this...
> > > > > > >
> > > > > > > Best
> > > > > > > Erick
> > > > > > >
> > > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <[hidden email]
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I have a need to filter my queries using a rather large
> subset
> > of
> > > > > terms
> > > > > > > > (can
> > > > > > > > be 10K or even 50K).
> > > > > > > > All these terms are sure to exist in the index so the number
> of
> > > > > results
> > > > > > > can
> > > > > > > > be about the same number of terms in the filter.
> > > > > > > > The terms are numbers but are not subsequent and are from a
> > large
> > > > set
> > > > > > of
> > > > > > > > possible values (so range queries are probably not good for
> > me).
> > > > > > > > The index itself is about 1M docs and running even a simple
> > query
> > > > > with
> > > > > > > such
> > > > > > > > a large filter takes a lot of time even if the number of
> > results
> > > is
> > > > > > only
> > > > > > > a
> > > > > > > > few hundred docs.
> > > > > > > > It seems like the speed is affected by the length of the
> filter
> > > > even
> > > > > if
> > > > > > > the
> > > > > > > > number of results remains more or less the same, which is
> > logical
> > > > but
> > > > > > not
> > > > > > > > by
> > > > > > > > such a large loss of performance as I'm experiencing (running
> > the
> > > > > query
> > > > > > > > with
> > > > > > > > a 10K terms filter takes an average of 1s 187ms with 600
> > results
> > > > > while
> > > > > > > > running it with a 50K terms filter takes an average of 5s
> 207ms
> > > > with
> > > > > > 1000
> > > > > > > > results).
> > > > > > > >
> > > > > > > > Currently I'm using a QueryFilter with a boolean query in
> which
> > I
> > > > > "OR"
> > > > > > > the
> > > > > > > > different terms together.
> > > > > > > > I also can't use a cached filter efficiently since the terms
> to
> > > > > filter
> > > > > > on
> > > > > > > > change almost every query.
> > > > > > > >
> > > > > > > > I was wondering if there's a better way to filter my queries
> so
> > > > they
> > > > > > > won't
> > > > > > > > take a few seconds to run?
> > > > > > > >
> > > > > > > > Thanks in advance for any advise,
> > > > > > > > Eran.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Erick Erickson
See: http://issues.apache.org/jira/browse/LUCENE-1427

<http://issues.apache.org/jira/browse/LUCENE-1427>Short form: this is fixed,
but not until 2.9.

If you don't want to upgrade, you could always leave the Filter off
your initial query and have your Collector insure that any docs
were in the Filter......

FWIW
Erick



On Mon, Nov 23, 2009 at 11:01 AM, Eran Sevi <[hidden email]> wrote:

> I've taken TermsFilter from contrib which does exactly that and indeed the
> speed was reduced to half, which starts to be reasonable for my needs.
>
> I've researched the regular QueryFilter and what I write here might not be
> the complete picture:
> I found out that most of the time is spent on scoring the documents in the
> filter !
> It looks like the QueryFilter (which caches a QueryWrapperFilter) is
> actually running the query passed to it using a searcher and a regular hit
> collector that justs discards the score and set the doc id bit in the
> filter. The score is still being calculated (by BooleanScorer in my case)
> which takes a lot of time.
>
> TermsFilter solves part of my problem since I actually need a complex
> boolean query (there are other terms which are being AND to the long OR
> list). Maybe I can separate the filter to several single Term filters
> (maybe
> using BooleanFilter also from contrib).
>
> The fact is that seeking on the term docs is much faster than performing
> the
> equivalent search using a boolean query althought one might think that
> seeking on the term docs is what should be done by the boolean query as
> well
> and the results should be more or less the same when the query is on only
> one field.
>
> I know scoring and constructing the search results are closely related to
> each other so the real question is:
> Is there any way to run a complex query filter WITHOUT actually scoring the
> hits?
>
> Eran
>
> On Mon, Nov 23, 2009 at 3:54 PM, Erick Erickson <[hidden email]
> >wrote:
>
> > Oh my goodness yes. No wonder nothing I suggested made any
> > difference <G>. Ignore everything I've written....
> >
> > OK, here's something to try, and it goes back to a Filter. Rather than
> > make this enormous bunch of ORs, try creating a Filter. Use TermDocs
> > to run through your list of IDs assembling a Filter (the one with the
> >  BitSet) and send *that* through to your query rather than
> > all your clauses. Running through the terms is much faster than I
> thought.
> >
> > The only issue is that TermDocs is ordered by doc IDs, so you'll
> > have to use, say, a Map to see if the unique IDs you're after
> > correspond to your desired IDs.... Or who knows? Just using
> > TermDocs.seek(Term) for your list might do the trick....
> Term.createTerm()
> > will help a bit here....
> >
> > You can get a feel for this by just spinning through your list of
> > unique IDs using TermDocs.seek(Term) to see how long assembling
> > the filter would take. Using the Filter in a query doesn't cost
> > much at all....
> >
> > Best
> > Erick
> >
> >
> > On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi <[hidden email]> wrote:
> >
> > > Erick,
> > >
> > > Maybe I didn't make myself clear enough.
> > > I'm talking about high level filters used when searching.
> > >
> > > I construct a very big BooleanQuery and add 50K clauses to it (I
> removed
> > > the
> > > limit on max clauses).
> > > Each clause is a TermQuery on the same field.
> > > I don't know the internal doc ids that I want. I only know the value
> > stored
> > > in the field. this value is an external key of the document and
> > guaranteed
> > > to be unique.
> > >
> > > I pass this query to QueryFilter and run a search using search
> > > (query,filter,collector) (reminder: i'm using ver 2.4)
> > >
> > > If I missed some other magic construction method to achieve the same
> goal
> > > then it would be great.
> > >
> > > Thanks for you help so far,
> > > Eran.
> > >
> > > On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <
> [hidden email]
> > > >wrote:
> > >
> > > > Now I'm really confused, which usually means I'm making some
> > > > assumptions that aren't true. So here they are...
> > > >
> > > > 1> You're talking about Filters that contain BitSets, right? Not some
> > > other
> > > >     kind of filter.
> > > > 2> When you create your 10-50K filters, you wind up with a single
> > filter
> > > >     by combining them all with one of the BitSet operators, right?
> And
> > > >     that *single* filter is the one you send to your query...
> > > >
> > > > If I'm off base here, could you post a reasonable extract of your
> > filter
> > > > construction code, and how you use them to search? Because I don't
> > > > think we're all talking about the same thing here.....
> > > >
> > > > HTH
> > > > Erick@ThisMakesNoSenseToMe<G>...
> > > >
> > > > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <[hidden email]>
> wrote:
> > > >
> > > > > After commenting out the collector logic, the time is still more or
> > > less
> > > > > the
> > > > > same.
> > > > > Anyway, since without the filter collecting the documents is very
> > fast
> > > > it's
> > > > > probably something with the filter itself.
> > > > >
> > > > > I don't know how the filter (or boolean query) work internally but
> > > > probably
> > > > > for 10K or 50K clauses, it does something that takes a lot of time.
> > It
> > > > > might
> > > > > be because of the inner data structures that are used or maybe just
> > the
> > > > > iteration on so many terms takes time.
> > > > >
> > > > > I'll continue to try and pinpoint the exact bottleneck postion, or
> > > maybe
> > > > > using the new filters in 2.9.1 might help.
> > > > >
> > > > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <
> > > [hidden email]
> > > > > >wrote:
> > > > >
> > > > > > Hmmm, could you show us what you do in your collector? Because
> > > > > > one of the gotchas about a collector is loading the documents in
> > > > > > the inner loop. Quick test: comment out whatever you're doing in
> > > > > > the underlying collector loop, and see if there's *any*
> noticeable
> > > > > > difference in speed. That'll tell you whether your problems
> > > > > > arise from the filter construction/search or what you're doing
> > > > > > in the collector....
> > > > > >
> > > > > > Best
> > > > > > Erick
> > > > > >
> > > > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <[hidden email]>
> > > > wrote:
> > > > > >
> > > > > > > I think it shouldn't take X5 times longer since the number of
> > > results
> > > > > is
> > > > > > > only about X2 times larger (and much smaller than the number of
> > > terms
> > > > > in
> > > > > > > the
> > > > > > > filter), but maybe I'm wrong here since I'm not familiar with
> the
> > > > > filter
> > > > > > > internals.
> > > > > > >
> > > > > > > Unfortunately, the time to construct the filter is mere
> > > milliseconds.
> > > > > > > almost all of the time (~5secs) are spent in the search method.
> > > > > > > I'm using a collector to retrieve all the results (and fetch a
> > > value
> > > > > for
> > > > > > > some fields) but without the filter this also takes less then a
> > > > second
> > > > > > for
> > > > > > > the same number of results.
> > > > > > >
> > > > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
> > > > > [hidden email]
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Hmmm, I'm not very clear here. Are you saying that you
> > > effectively
> > > > > > > > form 10-50K filters and OR them all together? That would be
> > > > > > > > consistent with the 50K case taking approx. 5X a long as the
> > 10K
> > > > > > > > case.....
> > > > > > > >
> > > > > > > > Do you know where in your code the time is being spent?
> That'd
> > > > > > > > be a big help in suggesting alternatives. If I'm on the right
> > > > track,
> > > > > > > > I'd expect the time to be spent assembling the filters.....
> > > > > > > >
> > > > > > > > Not much help here, but I'm having trouble wrapping my head
> > > > > > > > around this...
> > > > > > > >
> > > > > > > > Best
> > > > > > > > Erick
> > > > > > > >
> > > > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <
> [hidden email]
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I have a need to filter my queries using a rather large
> > subset
> > > of
> > > > > > terms
> > > > > > > > > (can
> > > > > > > > > be 10K or even 50K).
> > > > > > > > > All these terms are sure to exist in the index so the
> number
> > of
> > > > > > results
> > > > > > > > can
> > > > > > > > > be about the same number of terms in the filter.
> > > > > > > > > The terms are numbers but are not subsequent and are from a
> > > large
> > > > > set
> > > > > > > of
> > > > > > > > > possible values (so range queries are probably not good for
> > > me).
> > > > > > > > > The index itself is about 1M docs and running even a simple
> > > query
> > > > > > with
> > > > > > > > such
> > > > > > > > > a large filter takes a lot of time even if the number of
> > > results
> > > > is
> > > > > > > only
> > > > > > > > a
> > > > > > > > > few hundred docs.
> > > > > > > > > It seems like the speed is affected by the length of the
> > filter
> > > > > even
> > > > > > if
> > > > > > > > the
> > > > > > > > > number of results remains more or less the same, which is
> > > logical
> > > > > but
> > > > > > > not
> > > > > > > > > by
> > > > > > > > > such a large loss of performance as I'm experiencing
> (running
> > > the
> > > > > > query
> > > > > > > > > with
> > > > > > > > > a 10K terms filter takes an average of 1s 187ms with 600
> > > results
> > > > > > while
> > > > > > > > > running it with a 50K terms filter takes an average of 5s
> > 207ms
> > > > > with
> > > > > > > 1000
> > > > > > > > > results).
> > > > > > > > >
> > > > > > > > > Currently I'm using a QueryFilter with a boolean query in
> > which
> > > I
> > > > > > "OR"
> > > > > > > > the
> > > > > > > > > different terms together.
> > > > > > > > > I also can't use a cached filter efficiently since the
> terms
> > to
> > > > > > filter
> > > > > > > on
> > > > > > > > > change almost every query.
> > > > > > > > >
> > > > > > > > > I was wondering if there's a better way to filter my
> queries
> > so
> > > > > they
> > > > > > > > won't
> > > > > > > > > take a few seconds to run?
> > > > > > > > >
> > > > > > > > > Thanks in advance for any advise,
> > > > > > > > > Eran.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Erick Erickson
This was a really silly idea I had <G>. If your time is being spent in the
scoring
in the first place, keeping the Filter out of the query and checking against
it later in your Collector won't change the timing because you'll have done
all the scoring anyway. But I only thought about it on the way home this
afternoon....

Erick

On Mon, Nov 23, 2009 at 12:12 PM, Erick Erickson <[hidden email]>wrote:

> See: http://issues.apache.org/jira/browse/LUCENE-1427
>
> <http://issues.apache.org/jira/browse/LUCENE-1427>Short form: this is
> fixed, but not until 2.9.
>
> If you don't want to upgrade, you could always leave the Filter off
> your initial query and have your Collector insure that any docs
> were in the Filter......
>
> FWIW
> Erick
>
>
>
> On Mon, Nov 23, 2009 at 11:01 AM, Eran Sevi <[hidden email]> wrote:
>
>> I've taken TermsFilter from contrib which does exactly that and indeed the
>> speed was reduced to half, which starts to be reasonable for my needs.
>>
>> I've researched the regular QueryFilter and what I write here might not be
>> the complete picture:
>> I found out that most of the time is spent on scoring the documents in the
>> filter !
>> It looks like the QueryFilter (which caches a QueryWrapperFilter) is
>> actually running the query passed to it using a searcher and a regular hit
>> collector that justs discards the score and set the doc id bit in the
>> filter. The score is still being calculated (by BooleanScorer in my case)
>> which takes a lot of time.
>>
>> TermsFilter solves part of my problem since I actually need a complex
>> boolean query (there are other terms which are being AND to the long OR
>> list). Maybe I can separate the filter to several single Term filters
>> (maybe
>> using BooleanFilter also from contrib).
>>
>> The fact is that seeking on the term docs is much faster than performing
>> the
>> equivalent search using a boolean query althought one might think that
>> seeking on the term docs is what should be done by the boolean query as
>> well
>> and the results should be more or less the same when the query is on only
>> one field.
>>
>> I know scoring and constructing the search results are closely related to
>> each other so the real question is:
>> Is there any way to run a complex query filter WITHOUT actually scoring
>> the
>> hits?
>>
>> Eran
>>
>> On Mon, Nov 23, 2009 at 3:54 PM, Erick Erickson <[hidden email]
>> >wrote:
>>
>> > Oh my goodness yes. No wonder nothing I suggested made any
>> > difference <G>. Ignore everything I've written....
>> >
>> > OK, here's something to try, and it goes back to a Filter. Rather than
>> > make this enormous bunch of ORs, try creating a Filter. Use TermDocs
>> > to run through your list of IDs assembling a Filter (the one with the
>> >  BitSet) and send *that* through to your query rather than
>> > all your clauses. Running through the terms is much faster than I
>> thought.
>> >
>> > The only issue is that TermDocs is ordered by doc IDs, so you'll
>> > have to use, say, a Map to see if the unique IDs you're after
>> > correspond to your desired IDs.... Or who knows? Just using
>> > TermDocs.seek(Term) for your list might do the trick....
>> Term.createTerm()
>> > will help a bit here....
>> >
>> > You can get a feel for this by just spinning through your list of
>> > unique IDs using TermDocs.seek(Term) to see how long assembling
>> > the filter would take. Using the Filter in a query doesn't cost
>> > much at all....
>> >
>> > Best
>> > Erick
>> >
>> >
>> > On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi <[hidden email]> wrote:
>> >
>> > > Erick,
>> > >
>> > > Maybe I didn't make myself clear enough.
>> > > I'm talking about high level filters used when searching.
>> > >
>> > > I construct a very big BooleanQuery and add 50K clauses to it (I
>> removed
>> > > the
>> > > limit on max clauses).
>> > > Each clause is a TermQuery on the same field.
>> > > I don't know the internal doc ids that I want. I only know the value
>> > stored
>> > > in the field. this value is an external key of the document and
>> > guaranteed
>> > > to be unique.
>> > >
>> > > I pass this query to QueryFilter and run a search using search
>> > > (query,filter,collector) (reminder: i'm using ver 2.4)
>> > >
>> > > If I missed some other magic construction method to achieve the same
>> goal
>> > > then it would be great.
>> > >
>> > > Thanks for you help so far,
>> > > Eran.
>> > >
>> > > On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <
>> [hidden email]
>> > > >wrote:
>> > >
>> > > > Now I'm really confused, which usually means I'm making some
>> > > > assumptions that aren't true. So here they are...
>> > > >
>> > > > 1> You're talking about Filters that contain BitSets, right? Not
>> some
>> > > other
>> > > >     kind of filter.
>> > > > 2> When you create your 10-50K filters, you wind up with a single
>> > filter
>> > > >     by combining them all with one of the BitSet operators, right?
>> And
>> > > >     that *single* filter is the one you send to your query...
>> > > >
>> > > > If I'm off base here, could you post a reasonable extract of your
>> > filter
>> > > > construction code, and how you use them to search? Because I don't
>> > > > think we're all talking about the same thing here.....
>> > > >
>> > > > HTH
>> > > > Erick@ThisMakesNoSenseToMe<G>...
>> > > >
>> > > > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <[hidden email]>
>> wrote:
>> > > >
>> > > > > After commenting out the collector logic, the time is still more
>> or
>> > > less
>> > > > > the
>> > > > > same.
>> > > > > Anyway, since without the filter collecting the documents is very
>> > fast
>> > > > it's
>> > > > > probably something with the filter itself.
>> > > > >
>> > > > > I don't know how the filter (or boolean query) work internally but
>> > > > probably
>> > > > > for 10K or 50K clauses, it does something that takes a lot of
>> time.
>> > It
>> > > > > might
>> > > > > be because of the inner data structures that are used or maybe
>> just
>> > the
>> > > > > iteration on so many terms takes time.
>> > > > >
>> > > > > I'll continue to try and pinpoint the exact bottleneck postion, or
>> > > maybe
>> > > > > using the new filters in 2.9.1 might help.
>> > > > >
>> > > > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <
>> > > [hidden email]
>> > > > > >wrote:
>> > > > >
>> > > > > > Hmmm, could you show us what you do in your collector? Because
>> > > > > > one of the gotchas about a collector is loading the documents in
>> > > > > > the inner loop. Quick test: comment out whatever you're doing in
>> > > > > > the underlying collector loop, and see if there's *any*
>> noticeable
>> > > > > > difference in speed. That'll tell you whether your problems
>> > > > > > arise from the filter construction/search or what you're doing
>> > > > > > in the collector....
>> > > > > >
>> > > > > > Best
>> > > > > > Erick
>> > > > > >
>> > > > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <[hidden email]
>> >
>> > > > wrote:
>> > > > > >
>> > > > > > > I think it shouldn't take X5 times longer since the number of
>> > > results
>> > > > > is
>> > > > > > > only about X2 times larger (and much smaller than the number
>> of
>> > > terms
>> > > > > in
>> > > > > > > the
>> > > > > > > filter), but maybe I'm wrong here since I'm not familiar with
>> the
>> > > > > filter
>> > > > > > > internals.
>> > > > > > >
>> > > > > > > Unfortunately, the time to construct the filter is mere
>> > > milliseconds.
>> > > > > > > almost all of the time (~5secs) are spent in the search
>> method.
>> > > > > > > I'm using a collector to retrieve all the results (and fetch a
>> > > value
>> > > > > for
>> > > > > > > some fields) but without the filter this also takes less then
>> a
>> > > > second
>> > > > > > for
>> > > > > > > the same number of results.
>> > > > > > >
>> > > > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
>> > > > > [hidden email]
>> > > > > > > >wrote:
>> > > > > > >
>> > > > > > > > Hmmm, I'm not very clear here. Are you saying that you
>> > > effectively
>> > > > > > > > form 10-50K filters and OR them all together? That would be
>> > > > > > > > consistent with the 50K case taking approx. 5X a long as the
>> > 10K
>> > > > > > > > case.....
>> > > > > > > >
>> > > > > > > > Do you know where in your code the time is being spent?
>> That'd
>> > > > > > > > be a big help in suggesting alternatives. If I'm on the
>> right
>> > > > track,
>> > > > > > > > I'd expect the time to be spent assembling the filters.....
>> > > > > > > >
>> > > > > > > > Not much help here, but I'm having trouble wrapping my head
>> > > > > > > > around this...
>> > > > > > > >
>> > > > > > > > Best
>> > > > > > > > Erick
>> > > > > > > >
>> > > > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <
>> [hidden email]
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi,
>> > > > > > > > >
>> > > > > > > > > I have a need to filter my queries using a rather large
>> > subset
>> > > of
>> > > > > > terms
>> > > > > > > > > (can
>> > > > > > > > > be 10K or even 50K).
>> > > > > > > > > All these terms are sure to exist in the index so the
>> number
>> > of
>> > > > > > results
>> > > > > > > > can
>> > > > > > > > > be about the same number of terms in the filter.
>> > > > > > > > > The terms are numbers but are not subsequent and are from
>> a
>> > > large
>> > > > > set
>> > > > > > > of
>> > > > > > > > > possible values (so range queries are probably not good
>> for
>> > > me).
>> > > > > > > > > The index itself is about 1M docs and running even a
>> simple
>> > > query
>> > > > > > with
>> > > > > > > > such
>> > > > > > > > > a large filter takes a lot of time even if the number of
>> > > results
>> > > > is
>> > > > > > > only
>> > > > > > > > a
>> > > > > > > > > few hundred docs.
>> > > > > > > > > It seems like the speed is affected by the length of the
>> > filter
>> > > > > even
>> > > > > > if
>> > > > > > > > the
>> > > > > > > > > number of results remains more or less the same, which is
>> > > logical
>> > > > > but
>> > > > > > > not
>> > > > > > > > > by
>> > > > > > > > > such a large loss of performance as I'm experiencing
>> (running
>> > > the
>> > > > > > query
>> > > > > > > > > with
>> > > > > > > > > a 10K terms filter takes an average of 1s 187ms with 600
>> > > results
>> > > > > > while
>> > > > > > > > > running it with a 50K terms filter takes an average of 5s
>> > 207ms
>> > > > > with
>> > > > > > > 1000
>> > > > > > > > > results).
>> > > > > > > > >
>> > > > > > > > > Currently I'm using a QueryFilter with a boolean query in
>> > which
>> > > I
>> > > > > > "OR"
>> > > > > > > > the
>> > > > > > > > > different terms together.
>> > > > > > > > > I also can't use a cached filter efficiently since the
>> terms
>> > to
>> > > > > > filter
>> > > > > > > on
>> > > > > > > > > change almost every query.
>> > > > > > > > >
>> > > > > > > > > I was wondering if there's a better way to filter my
>> queries
>> > so
>> > > > > they
>> > > > > > > > won't
>> > > > > > > > > take a few seconds to run?
>> > > > > > > > >
>> > > > > > > > > Thanks in advance for any advise,
>> > > > > > > > > Eran.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Efficient filtering advise

Eran S.
Erick,
Thanks for all your help so far.

I'll try and see if upgrading to 2.9.1 won't cause us too much changes and
if it's stable enough.
If upgrading won't work then I can revert to using TermsFilter and
BooleanFilter from contrib which should cover all my needs and maybe it will
even be faster than the fixed QueryWrapperFilter.

Eran.

On Tue, Nov 24, 2009 at 2:14 AM, Erick Erickson <[hidden email]>wrote:

> This was a really silly idea I had <G>. If your time is being spent in the
> scoring
> in the first place, keeping the Filter out of the query and checking
> against
> it later in your Collector won't change the timing because you'll have done
> all the scoring anyway. But I only thought about it on the way home this
> afternoon....
>
> Erick
>
> On Mon, Nov 23, 2009 at 12:12 PM, Erick Erickson <[hidden email]
> >wrote:
>
> > See: http://issues.apache.org/jira/browse/LUCENE-1427
> >
> > <http://issues.apache.org/jira/browse/LUCENE-1427>Short form: this is
> > fixed, but not until 2.9.
> >
> > If you don't want to upgrade, you could always leave the Filter off
> > your initial query and have your Collector insure that any docs
> > were in the Filter......
> >
> > FWIW
> > Erick
> >
> >
> >
> > On Mon, Nov 23, 2009 at 11:01 AM, Eran Sevi <[hidden email]> wrote:
> >
> >> I've taken TermsFilter from contrib which does exactly that and indeed
> the
> >> speed was reduced to half, which starts to be reasonable for my needs.
> >>
> >> I've researched the regular QueryFilter and what I write here might not
> be
> >> the complete picture:
> >> I found out that most of the time is spent on scoring the documents in
> the
> >> filter !
> >> It looks like the QueryFilter (which caches a QueryWrapperFilter) is
> >> actually running the query passed to it using a searcher and a regular
> hit
> >> collector that justs discards the score and set the doc id bit in the
> >> filter. The score is still being calculated (by BooleanScorer in my
> case)
> >> which takes a lot of time.
> >>
> >> TermsFilter solves part of my problem since I actually need a complex
> >> boolean query (there are other terms which are being AND to the long OR
> >> list). Maybe I can separate the filter to several single Term filters
> >> (maybe
> >> using BooleanFilter also from contrib).
> >>
> >> The fact is that seeking on the term docs is much faster than performing
> >> the
> >> equivalent search using a boolean query althought one might think that
> >> seeking on the term docs is what should be done by the boolean query as
> >> well
> >> and the results should be more or less the same when the query is on
> only
> >> one field.
> >>
> >> I know scoring and constructing the search results are closely related
> to
> >> each other so the real question is:
> >> Is there any way to run a complex query filter WITHOUT actually scoring
> >> the
> >> hits?
> >>
> >> Eran
> >>
> >> On Mon, Nov 23, 2009 at 3:54 PM, Erick Erickson <
> [hidden email]
> >> >wrote:
> >>
> >> > Oh my goodness yes. No wonder nothing I suggested made any
> >> > difference <G>. Ignore everything I've written....
> >> >
> >> > OK, here's something to try, and it goes back to a Filter. Rather than
> >> > make this enormous bunch of ORs, try creating a Filter. Use TermDocs
> >> > to run through your list of IDs assembling a Filter (the one with the
> >> >  BitSet) and send *that* through to your query rather than
> >> > all your clauses. Running through the terms is much faster than I
> >> thought.
> >> >
> >> > The only issue is that TermDocs is ordered by doc IDs, so you'll
> >> > have to use, say, a Map to see if the unique IDs you're after
> >> > correspond to your desired IDs.... Or who knows? Just using
> >> > TermDocs.seek(Term) for your list might do the trick....
> >> Term.createTerm()
> >> > will help a bit here....
> >> >
> >> > You can get a feel for this by just spinning through your list of
> >> > unique IDs using TermDocs.seek(Term) to see how long assembling
> >> > the filter would take. Using the Filter in a query doesn't cost
> >> > much at all....
> >> >
> >> > Best
> >> > Erick
> >> >
> >> >
> >> > On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi <[hidden email]>
> wrote:
> >> >
> >> > > Erick,
> >> > >
> >> > > Maybe I didn't make myself clear enough.
> >> > > I'm talking about high level filters used when searching.
> >> > >
> >> > > I construct a very big BooleanQuery and add 50K clauses to it (I
> >> removed
> >> > > the
> >> > > limit on max clauses).
> >> > > Each clause is a TermQuery on the same field.
> >> > > I don't know the internal doc ids that I want. I only know the value
> >> > stored
> >> > > in the field. this value is an external key of the document and
> >> > guaranteed
> >> > > to be unique.
> >> > >
> >> > > I pass this query to QueryFilter and run a search using search
> >> > > (query,filter,collector) (reminder: i'm using ver 2.4)
> >> > >
> >> > > If I missed some other magic construction method to achieve the same
> >> goal
> >> > > then it would be great.
> >> > >
> >> > > Thanks for you help so far,
> >> > > Eran.
> >> > >
> >> > > On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <
> >> [hidden email]
> >> > > >wrote:
> >> > >
> >> > > > Now I'm really confused, which usually means I'm making some
> >> > > > assumptions that aren't true. So here they are...
> >> > > >
> >> > > > 1> You're talking about Filters that contain BitSets, right? Not
> >> some
> >> > > other
> >> > > >     kind of filter.
> >> > > > 2> When you create your 10-50K filters, you wind up with a single
> >> > filter
> >> > > >     by combining them all with one of the BitSet operators, right?
> >> And
> >> > > >     that *single* filter is the one you send to your query...
> >> > > >
> >> > > > If I'm off base here, could you post a reasonable extract of your
> >> > filter
> >> > > > construction code, and how you use them to search? Because I don't
> >> > > > think we're all talking about the same thing here.....
> >> > > >
> >> > > > HTH
> >> > > > Erick@ThisMakesNoSenseToMe<G>...
> >> > > >
> >> > > > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <[hidden email]>
> >> wrote:
> >> > > >
> >> > > > > After commenting out the collector logic, the time is still more
> >> or
> >> > > less
> >> > > > > the
> >> > > > > same.
> >> > > > > Anyway, since without the filter collecting the documents is
> very
> >> > fast
> >> > > > it's
> >> > > > > probably something with the filter itself.
> >> > > > >
> >> > > > > I don't know how the filter (or boolean query) work internally
> but
> >> > > > probably
> >> > > > > for 10K or 50K clauses, it does something that takes a lot of
> >> time.
> >> > It
> >> > > > > might
> >> > > > > be because of the inner data structures that are used or maybe
> >> just
> >> > the
> >> > > > > iteration on so many terms takes time.
> >> > > > >
> >> > > > > I'll continue to try and pinpoint the exact bottleneck postion,
> or
> >> > > maybe
> >> > > > > using the new filters in 2.9.1 might help.
> >> > > > >
> >> > > > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <
> >> > > [hidden email]
> >> > > > > >wrote:
> >> > > > >
> >> > > > > > Hmmm, could you show us what you do in your collector? Because
> >> > > > > > one of the gotchas about a collector is loading the documents
> in
> >> > > > > > the inner loop. Quick test: comment out whatever you're doing
> in
> >> > > > > > the underlying collector loop, and see if there's *any*
> >> noticeable
> >> > > > > > difference in speed. That'll tell you whether your problems
> >> > > > > > arise from the filter construction/search or what you're doing
> >> > > > > > in the collector....
> >> > > > > >
> >> > > > > > Best
> >> > > > > > Erick
> >> > > > > >
> >> > > > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <
> [hidden email]
> >> >
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > I think it shouldn't take X5 times longer since the number
> of
> >> > > results
> >> > > > > is
> >> > > > > > > only about X2 times larger (and much smaller than the number
> >> of
> >> > > terms
> >> > > > > in
> >> > > > > > > the
> >> > > > > > > filter), but maybe I'm wrong here since I'm not familiar
> with
> >> the
> >> > > > > filter
> >> > > > > > > internals.
> >> > > > > > >
> >> > > > > > > Unfortunately, the time to construct the filter is mere
> >> > > milliseconds.
> >> > > > > > > almost all of the time (~5secs) are spent in the search
> >> method.
> >> > > > > > > I'm using a collector to retrieve all the results (and fetch
> a
> >> > > value
> >> > > > > for
> >> > > > > > > some fields) but without the filter this also takes less
> then
> >> a
> >> > > > second
> >> > > > > > for
> >> > > > > > > the same number of results.
> >> > > > > > >
> >> > > > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
> >> > > > > [hidden email]
> >> > > > > > > >wrote:
> >> > > > > > >
> >> > > > > > > > Hmmm, I'm not very clear here. Are you saying that you
> >> > > effectively
> >> > > > > > > > form 10-50K filters and OR them all together? That would
> be
> >> > > > > > > > consistent with the 50K case taking approx. 5X a long as
> the
> >> > 10K
> >> > > > > > > > case.....
> >> > > > > > > >
> >> > > > > > > > Do you know where in your code the time is being spent?
> >> That'd
> >> > > > > > > > be a big help in suggesting alternatives. If I'm on the
> >> right
> >> > > > track,
> >> > > > > > > > I'd expect the time to be spent assembling the
> filters.....
> >> > > > > > > >
> >> > > > > > > > Not much help here, but I'm having trouble wrapping my
> head
> >> > > > > > > > around this...
> >> > > > > > > >
> >> > > > > > > > Best
> >> > > > > > > > Erick
> >> > > > > > > >
> >> > > > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <
> >> [hidden email]
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi,
> >> > > > > > > > >
> >> > > > > > > > > I have a need to filter my queries using a rather large
> >> > subset
> >> > > of
> >> > > > > > terms
> >> > > > > > > > > (can
> >> > > > > > > > > be 10K or even 50K).
> >> > > > > > > > > All these terms are sure to exist in the index so the
> >> number
> >> > of
> >> > > > > > results
> >> > > > > > > > can
> >> > > > > > > > > be about the same number of terms in the filter.
> >> > > > > > > > > The terms are numbers but are not subsequent and are
> from
> >> a
> >> > > large
> >> > > > > set
> >> > > > > > > of
> >> > > > > > > > > possible values (so range queries are probably not good
> >> for
> >> > > me).
> >> > > > > > > > > The index itself is about 1M docs and running even a
> >> simple
> >> > > query
> >> > > > > > with
> >> > > > > > > > such
> >> > > > > > > > > a large filter takes a lot of time even if the number of
> >> > > results
> >> > > > is
> >> > > > > > > only
> >> > > > > > > > a
> >> > > > > > > > > few hundred docs.
> >> > > > > > > > > It seems like the speed is affected by the length of the
> >> > filter
> >> > > > > even
> >> > > > > > if
> >> > > > > > > > the
> >> > > > > > > > > number of results remains more or less the same, which
> is
> >> > > logical
> >> > > > > but
> >> > > > > > > not
> >> > > > > > > > > by
> >> > > > > > > > > such a large loss of performance as I'm experiencing
> >> (running
> >> > > the
> >> > > > > > query
> >> > > > > > > > > with
> >> > > > > > > > > a 10K terms filter takes an average of 1s 187ms with 600
> >> > > results
> >> > > > > > while
> >> > > > > > > > > running it with a 50K terms filter takes an average of
> 5s
> >> > 207ms
> >> > > > > with
> >> > > > > > > 1000
> >> > > > > > > > > results).
> >> > > > > > > > >
> >> > > > > > > > > Currently I'm using a QueryFilter with a boolean query
> in
> >> > which
> >> > > I
> >> > > > > > "OR"
> >> > > > > > > > the
> >> > > > > > > > > different terms together.
> >> > > > > > > > > I also can't use a cached filter efficiently since the
> >> terms
> >> > to
> >> > > > > > filter
> >> > > > > > > on
> >> > > > > > > > > change almost every query.
> >> > > > > > > > >
> >> > > > > > > > > I was wondering if there's a better way to filter my
> >> queries
> >> > so
> >> > > > > they
> >> > > > > > > > won't
> >> > > > > > > > > take a few seconds to run?
> >> > > > > > > > >
> >> > > > > > > > > Thanks in advance for any advise,
> >> > > > > > > > > Eran.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>