Using filters to speed up queries

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Using filters to speed up queries

Khash Sajadi
My index contains documents for different users. Each document has the user id as a field on it.

There are about 500 different users with 3 million documents.

Currently I'm calling Search with the query (parsed from user) and FieldCacheTermsFilter for the user id.

It works but the performance is not great.

Ideally, I would like to perform the search only on the documents that are relevant, this should make it much faster. However, it seems Search(Query, Filter) runs the query first and then applies the filter.

Is there a way to improve this? (i.e. run the query only on a subset of documents)

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

mark harwood
Look at BooleanQuery with 2 "must" clauses - one for the query, one for a ConstantScoreQuery wrapping the filter.
BooleanQuery should then use automatically use skips when reading matching docs from the main query and skip to the next docs identified by the filter.
Give it a try, otherwise you may be looking at using separate indexes


On 23 Oct 2010, at 23:18, Khash Sajadi wrote:

> My index contains documents for different users. Each document has the user id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user) and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are relevant, this should make it much faster. However, it seems Search(Query, Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of documents)
>
> Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

Khash Sajadi
Thanks. Will try it. Been thinking about separate indexes but have one worry: memory and file handle issues.

I'm worried that in scenarios I might end up with thousands of IndexReaders/IndexWriters open in the process (it is Windows). How is that going to play out with memory?

On 23 October 2010 23:44, Mark Harwood <[hidden email]> wrote:
Look at BooleanQuery with 2 "must" clauses - one for the query, one for a ConstantScoreQuery wrapping the filter.
BooleanQuery should then use automatically use skips when reading matching docs from the main query and skip to the next docs identified by the filter.
Give it a try, otherwise you may be looking at using separate indexes


On 23 Oct 2010, at 23:18, Khash Sajadi wrote:

> My index contains documents for different users. Each document has the user id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user) and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are relevant, this should make it much faster. However, it seems Search(Query, Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of documents)
>
> Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

Khash Sajadi
On the topic of BooleanQuery. Would the order of the queries being added matter? Is it clever enough to skip the second query when the first one is returning nothing and is a MUST?

On 23 October 2010 23:47, Khash Sajadi <[hidden email]> wrote:
Thanks. Will try it. Been thinking about separate indexes but have one worry: memory and file handle issues.

I'm worried that in scenarios I might end up with thousands of IndexReaders/IndexWriters open in the process (it is Windows). How is that going to play out with memory?


On 23 October 2010 23:44, Mark Harwood <[hidden email]> wrote:
Look at BooleanQuery with 2 "must" clauses - one for the query, one for a ConstantScoreQuery wrapping the filter.
BooleanQuery should then use automatically use skips when reading matching docs from the main query and skip to the next docs identified by the filter.
Give it a try, otherwise you may be looking at using separate indexes


On 23 Oct 2010, at 23:18, Khash Sajadi wrote:

> My index contains documents for different users. Each document has the user id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user) and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are relevant, this should make it much faster. However, it seems Search(Query, Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of documents)
>
> Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

RE: Using filters to speed up queries

Uwe Schindler

Yes it has some heuristics on which query “drives” the execution. So the query on which the first hit has a larger docid is the driving one. The other one then only gets seeked to.

 

With filters this is not the case (this may change in future, when filters also use ConjunctionScorer). In general queries are mostly faster now than filters, only when you cache filters, you get improvements.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: [hidden email]

 

From: Khash Sajadi [mailto:[hidden email]]
Sent: Sunday, October 24, 2010 12:52 AM
To: [hidden email]
Subject: Re: Using filters to speed up queries

 

On the topic of BooleanQuery. Would the order of the queries being added matter? Is it clever enough to skip the second query when the first one is returning nothing and is a MUST?

On 23 October 2010 23:47, Khash Sajadi <[hidden email]> wrote:

Thanks. Will try it. Been thinking about separate indexes but have one worry: memory and file handle issues.

 

I'm worried that in scenarios I might end up with thousands of IndexReaders/IndexWriters open in the process (it is Windows). How is that going to play out with memory?

 

On 23 October 2010 23:44, Mark Harwood <[hidden email]> wrote:

Look at BooleanQuery with 2 "must" clauses - one for the query, one for a ConstantScoreQuery wrapping the filter.
BooleanQuery should then use automatically use skips when reading matching docs from the main query and skip to the next docs identified by the filter.
Give it a try, otherwise you may be looking at using separate indexes



On 23 Oct 2010, at 23:18, Khash Sajadi wrote:


> My index contains documents for different users. Each document has the user id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user) and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are relevant, this should make it much faster. However, it seems Search(Query, Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of documents)
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

Michael McCandless-2
In reply to this post by Khash Sajadi
Unfortunately, Lucene's performance with filters isn't great.

This is because we now always apply filters "up high", using a
leapfrog approach, where we alternate asking the filter and then the
scorer to skip to each other's docID.

But if the filter accepts "enough" (~1% in my testing) of the
documents in the index, it's often better to apply the filter "down
low" like we do deleted docs (which really is its own filter), ie
where we quickly eliminate docs as we enumerate them from the
postings.

I did a blog post about this too:

  http://chbits.blogspot.com/2010/09/fast-search-filters-using-flex.html

That post shows some of the perf gains we could get by switching
filters to apply down low, though this was for a filter that randomly
accepts 50% of the index.  And this is using the flex APIs (for 4.0);
you may be able to do something similar using FilterIndexReader
pre-4.0.

Of course you shouldn't have to do such tricks --
https://issues.apache.org/jira/browse/LUCENE-1536 is open for Lucene
to do this itself when you pass a filter.

You should test, but, I suspect a MUST clause on an AND query may not
perform that much better in general for filters that accept a biggish
part of the index, since it's still using skipping, especially if your
query wasn't already a BooleanQuery.  For restrictive filters it
should be a decent gain, but those queries are already fast to begin
with.

Do you have some perf numbers to share?  What kind of queries are you
running with the filters?  Are there certain users that have a highish
%tg of the documents, with a long tail of the other users?  If so you
could consider making dedicated indices for those high doc count
users...

Also note that static index partitioning like this does not result in
the same scoring as you'd get if each user had their own index, since
the term stats (IDF) is aggregated across all users.  So for queries
with more than one term, users can see docs sorted differently, and
this is actually a known security risk in that users can gleen some
details about the documents they aren't allowed to see due to the
shared terms stats... there is a paper somewhere (Robert?) that delves
into it.

Mike

On Sat, Oct 23, 2010 at 6:18 PM, Khash Sajadi <[hidden email]> wrote:

> My index contains documents for different users. Each document has the user
> id as a field on it.
> There are about 500 different users with 3 million documents.
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
> It works but the performance is not great.
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

Paul Elschot
In reply to this post by Khash Sajadi
Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:

> My index contains documents for different users. Each document has the user
> id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
>
> Thanks
>

When running the query with the filter, the query is run at the same time
as the filter. Initially and after each matching document, the filter is assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next matching
document until they are at the same document (ie. there is a match), similar to
a boolean query with two required clauses.
The java code doing this is in the private method IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

Khash Sajadi
Here is what I've found so far:

I have three main sets to use in a query:
Account MUST be xxx
User query
DateRange on the query MUST be in (a,b) it is a NumericField

I tried the following combinations (all using a BooleanQuery with the user query added to it)

1. One:
- Add ACCOUNT as a TermQuery
- Add DATE RANGE as Filter

2. Two 
- Add ACCOUNT as Filer
- Add DATE RANGE as NumericRangeQuery

I tried caching the filters on both scenarios.
I also tried both scenarios by passing the query as a ConstantScoreQuery as well.

I got the best result (about 4x faster) by using a cached filter for the DATE RANGE and leaving the ACCOUNT as a TermQuery.

I think I'm happy with this approach. However, the security risk Uwe mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?

As for document distribution, the ACCOUNTS have a similar distribution of documents.

Also, I still would like to try the multi index approach, but not sure about the memory, file handle burden of it (having potentially thousands of reades/writers/searchers) open at the same time. I use two processes one as indexer and one for search with the same underlying FSDirectory. As for search, I use writer.getReader().reopen within a SearchManager as suggested by Lucene in Action.




On 24 October 2010 10:27, Paul Elschot <[hidden email]> wrote:
Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
> My index contains documents for different users. Each document has the user
> id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
>
> Thanks
>

When running the query with the filter, the query is run at the same time
as the filter. Initially and after each matching document, the filter is assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next matching
document until they are at the same document (ie. there is a match), similar to
a boolean query with two required clauses.
The java code doing this is in the private method IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

RE: Using filters to speed up queries

Uwe Schindler

Security risk? I did not say anything about that!

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: [hidden email]

 

From: Khash Sajadi [mailto:[hidden email]]
Sent: Sunday, October 24, 2010 12:34 PM
To: [hidden email]
Subject: Re: Using filters to speed up queries

 

Here is what I've found so far:

I have three main sets to use in a query:

Account MUST be xxx

User query

DateRange on the query MUST be in (a,b) it is a NumericField

 

I tried the following combinations (all using a BooleanQuery with the user query added to it)

 

1. One:

- Add ACCOUNT as a TermQuery

- Add DATE RANGE as Filter

 

2. Two 

- Add ACCOUNT as Filer

- Add DATE RANGE as NumericRangeQuery

 

I tried caching the filters on both scenarios.

I also tried both scenarios by passing the query as a ConstantScoreQuery as well.

 

I got the best result (about 4x faster) by using a cached filter for the DATE RANGE and leaving the ACCOUNT as a TermQuery.

 

I think I'm happy with this approach. However, the security risk Uwe mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?

 

As for document distribution, the ACCOUNTS have a similar distribution of documents.

 

Also, I still would like to try the multi index approach, but not sure about the memory, file handle burden of it (having potentially thousands of reades/writers/searchers) open at the same time. I use two processes one as indexer and one for search with the same underlying FSDirectory. As for search, I use writer.getReader().reopen within a SearchManager as suggested by Lucene in Action.

 

 

 

On 24 October 2010 10:27, Paul Elschot <[hidden email]> wrote:

Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:

> My index contains documents for different users. Each document has the user


> id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
>
> Thanks
>

When running the query with the filter, the query is run at the same time

as the filter. Initially and after each matching document, the filter is assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next matching
document until they are at the same document (ie. there is a match), similar to
a boolean query with two required clauses.
The java code doing this is in the private method IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

 

Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

Khash Sajadi
Terribly sorry. I meant Mike:

>
Also note that static index partitioning like this does not result in
the same scoring as you'd get if each user had their own index, since
the term stats (IDF) is aggregated across all users.  So for queries
with more than one term, users can see docs sorted differently, and
this is actually a known security risk in that users can gleen some
details about the documents they aren't allowed to see due to the
shared terms stats... there is a paper somewhere (Robert?) that delves
into it.


On 24 October 2010 11:46, Uwe Schindler <[hidden email]> wrote:

Security risk? I did not say anything about that!

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: [hidden email]

 

From: Khash Sajadi [mailto:[hidden email]]
Sent: Sunday, October 24, 2010 12:34 PM


To: [hidden email]
Subject: Re: Using filters to speed up queries

 

Here is what I've found so far:

I have three main sets to use in a query:

Account MUST be xxx

User query

DateRange on the query MUST be in (a,b) it is a NumericField

 

I tried the following combinations (all using a BooleanQuery with the user query added to it)

 

1. One:

- Add ACCOUNT as a TermQuery

- Add DATE RANGE as Filter

 

2. Two 

- Add ACCOUNT as Filer

- Add DATE RANGE as NumericRangeQuery

 

I tried caching the filters on both scenarios.

I also tried both scenarios by passing the query as a ConstantScoreQuery as well.

 

I got the best result (about 4x faster) by using a cached filter for the DATE RANGE and leaving the ACCOUNT as a TermQuery.

 

I think I'm happy with this approach. However, the security risk Uwe mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?

 

As for document distribution, the ACCOUNTS have a similar distribution of documents.

 

Also, I still would like to try the multi index approach, but not sure about the memory, file handle burden of it (having potentially thousands of reades/writers/searchers) open at the same time. I use two processes one as indexer and one for search with the same underlying FSDirectory. As for search, I use writer.getReader().reopen within a SearchManager as suggested by Lucene in Action.

 

 

 

On 24 October 2010 10:27, Paul Elschot <[hidden email]> wrote:

Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:

> My index contains documents for different users. Each document has the user
> id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
>
> Thanks
>

When running the query with the filter, the query is run at the same time
as the filter. Initially and after each matching document, the filter is assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next matching
document until they are at the same document (ie. there is a match), similar to
a boolean query with two required clauses.
The java code doing this is in the private method IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

 


Reply | Threaded
Open this post in threaded view
|

RE: Using filters to speed up queries

Uwe Schindler

The trick is to wrap the TermQuery using a ConstantScoreQuery(new QueryWrapperFilter(new TermQuery(…))). Because for filtering, the TermQuery used instead of a filter should not contribute to score. This code is used quite often in Lucene, so don’t care about the strange looking code. E.g. in MultiTermQuery.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: [hidden email]

 

From: Khash Sajadi [mailto:[hidden email]]
Sent: Sunday, October 24, 2010 12:50 PM
To: [hidden email]
Subject: Re: Using filters to speed up queries

 

Terribly sorry. I meant Mike:

 

> 

Also note that static index partitioning like this does not result in
the same scoring as you'd get if each user had their own index, since
the term stats (IDF) is aggregated across all users.  So for queries
with more than one term, users can see docs sorted differently, and
this is actually a known security risk in that users can gleen some
details about the documents they aren't allowed to see due to the
shared terms stats... there is a paper somewhere (Robert?) that delves
into it.

 

On 24 October 2010 11:46, Uwe Schindler <[hidden email]> wrote:

Security risk? I did not say anything about that!

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de

eMail: [hidden email]

 

From: Khash Sajadi [mailto:[hidden email]]
Sent: Sunday, October 24, 2010 12:34 PM


To: [hidden email]
Subject: Re: Using filters to speed up queries

 

Here is what I've found so far:

I have three main sets to use in a query:

Account MUST be xxx

User query

DateRange on the query MUST be in (a,b) it is a NumericField

 

I tried the following combinations (all using a BooleanQuery with the user query added to it)

 

1. One:

- Add ACCOUNT as a TermQuery

- Add DATE RANGE as Filter

 

2. Two 

- Add ACCOUNT as Filer

- Add DATE RANGE as NumericRangeQuery

 

I tried caching the filters on both scenarios.

I also tried both scenarios by passing the query as a ConstantScoreQuery as well.

 

I got the best result (about 4x faster) by using a cached filter for the DATE RANGE and leaving the ACCOUNT as a TermQuery.

 

I think I'm happy with this approach. However, the security risk Uwe mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?

 

As for document distribution, the ACCOUNTS have a similar distribution of documents.

 

Also, I still would like to try the multi index approach, but not sure about the memory, file handle burden of it (having potentially thousands of reades/writers/searchers) open at the same time. I use two processes one as indexer and one for search with the same underlying FSDirectory. As for search, I use writer.getReader().reopen within a SearchManager as suggested by Lucene in Action.

 

 

 

On 24 October 2010 10:27, Paul Elschot <[hidden email]> wrote:

Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:

> My index contains documents for different users. Each document has the user


> id as a field on it.
>
> There are about 500 different users with 3 million documents.
>
> Currently I'm calling Search with the query (parsed from user)
> and FieldCacheTermsFilter for the user id.
>
> It works but the performance is not great.
>
> Ideally, I would like to perform the search only on the documents that are
> relevant, this should make it much faster. However, it seems Search(Query,
> Filter) runs the query first and then applies the filter.
>
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
>
> Thanks
>

When running the query with the filter, the query is run at the same time

as the filter. Initially and after each matching document, the filter is assumed to
be cheaper to execute and its first or next matching document is determined.
Then the query and the filter are repeatedly advanced to each other's next matching
document until they are at the same document (ie. there is a match), similar to
a boolean query with two required clauses.
The java code doing this is in the private method IndexSearcher.searchWithFilter().

It could be that filling the field cache is the performance problem.
How is the performance when this search call with the FieldCacheTermsFilter
is repeated?

Also, for a single indexed term to be used as a filter (the user id in this case)
there may be no need for a cache, a QueryWrapperFilter around the TermQuery
might suffice.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

Paul Elschot
In reply to this post by Khash Sajadi
Some more speed up may be possible when the same combination of
filters (user account and date range here) is reused for another query.
The combined filter can then be made as an OpenBitSetDISI
(in the util package) and kept around for reuse.

Regards,
Paul Elschot

Op zondag 24 oktober 2010 12:34:07 schreef Khash Sajadi:

> Here is what I've found so far:
>
> I have three main sets to use in a query:
> Account MUST be xxx
> User query
> DateRange on the query MUST be in (a,b) it is a NumericField
>
> I tried the following combinations (all using a BooleanQuery with the user
> query added to it)
>
> 1. One:
> - Add ACCOUNT as a TermQuery
> - Add DATE RANGE as Filter
>
> 2. Two
> - Add ACCOUNT as Filer
> - Add DATE RANGE as NumericRangeQuery
>
> I tried caching the filters on both scenarios.
> I also tried both scenarios by passing the query as a ConstantScoreQuery as
> well.
>
> I got the best result (about 4x faster) by using a cached filter for the
> DATE RANGE and leaving the ACCOUNT as a TermQuery.
>
> I think I'm happy with this approach. However, the security risk Uwe
> mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?
>
> As for document distribution, the ACCOUNTS have a similar distribution of
> documents.
>
> Also, I still would like to try the multi index approach, but not sure about
> the memory, file handle burden of it (having potentially thousands of
> reades/writers/searchers) open at the same time. I use two processes one as
> indexer and one for search with the same underlying FSDirectory. As for
> search, I use writer.getReader().reopen within a SearchManager as suggested
> by Lucene in Action.
>
>
>
>
> On 24 October 2010 10:27, Paul Elschot <[hidden email]> wrote:
>
> > Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
> > > My index contains documents for different users. Each document has the
> > user
> > > id as a field on it.
> > >
> > > There are about 500 different users with 3 million documents.
> > >
> > > Currently I'm calling Search with the query (parsed from user)
> > > and FieldCacheTermsFilter for the user id.
> > >
> > > It works but the performance is not great.
> > >
> > > Ideally, I would like to perform the search only on the documents that
> > are
> > > relevant, this should make it much faster. However, it seems
> > Search(Query,
> > > Filter) runs the query first and then applies the filter.
> > >
> > > Is there a way to improve this? (i.e. run the query only on a subset of
> > > documents)
> > >
> > > Thanks
> > >
> >
> > When running the query with the filter, the query is run at the same time
> > as the filter. Initially and after each matching document, the filter is
> > assumed to
> > be cheaper to execute and its first or next matching document is
> > determined.
> > Then the query and the filter are repeatedly advanced to each other's next
> > matching
> > document until they are at the same document (ie. there is a match),
> > similar to
> > a boolean query with two required clauses.
> > The java code doing this is in the private method
> > IndexSearcher.searchWithFilter().
> >
> > It could be that filling the field cache is the performance problem.
> > How is the performance when this search call with the FieldCacheTermsFilter
> > is repeated?
> >
> > Also, for a single indexed term to be used as a filter (the user id in this
> > case)
> > there may be no need for a cache, a QueryWrapperFilter around the TermQuery
> > might suffice.
> >
> > Regards,
> > Paul Elschot
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

Michael McCandless-2
In reply to this post by Khash Sajadi
Here's the paper I was thinking of (Robert found this):
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.9682 ...
eg note this sentence from the abstract:

    We show that the first implementation, based on a postprocessing
approach, allows an arbitrary user to obtain information about the
content of files for which he does not have read permission.

Note that one simple way to "gauge" performance of filtering-down-low
would be to open an IndexReader, delete all documents except those
matching your filter (eg the ACCOUNT filter), then run your searches
against that IndexReader without the ACCOUNT clause.  If you don't
close that reader then these deletes are never committed.  This is a
simple way to compile in a filter to an open IR, but, you'd still then
have one reader open per user class, so the risks of too many files,
etc still stands.

Hmm though you could open an initial reader, then clone it, then do
all your deletes on that clone for user class 1, then clone it again,
do all deletes on that clone for user class 2.  This way you only have
one set of open files, but you've "compiled" your filter into the
delete docs for each reader.

But, in order to do this, you'd have to disable locking (use
NoLockFactory) in your Directory impl, just for these readers, since
you know you'll never commit the readers with pending deletions.  Just
be sure you never close those readers!

This should give sizable speedups if the filter is non-sparse.

Mike

On Sun, Oct 24, 2010 at 6:34 AM, Khash Sajadi <[hidden email]> wrote:

> Here is what I've found so far:
>
> I have three main sets to use in a query:
> Account MUST be xxx
> User query
> DateRange on the query MUST be in (a,b) it is a NumericField
> I tried the following combinations (all using a BooleanQuery with the user
> query added to it)
> 1. One:
> - Add ACCOUNT as a TermQuery
> - Add DATE RANGE as Filter
> 2. Two
> - Add ACCOUNT as Filer
> - Add DATE RANGE as NumericRangeQuery
> I tried caching the filters on both scenarios.
> I also tried both scenarios by passing the query as a ConstantScoreQuery as
> well.
> I got the best result (about 4x faster) by using a cached filter for the
> DATE RANGE and leaving the ACCOUNT as a TermQuery.
> I think I'm happy with this approach. However, the security risk Uwe
> mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?
> As for document distribution, the ACCOUNTS have a similar distribution of
> documents.
> Also, I still would like to try the multi index approach, but not sure about
> the memory, file handle burden of it (having potentially thousands of
> reades/writers/searchers) open at the same time. I use two processes one as
> indexer and one for search with the same underlying FSDirectory. As for
> search, I use writer.getReader().reopen within a SearchManager as suggested
> by Lucene in Action.
>
>
>
> On 24 October 2010 10:27, Paul Elschot <[hidden email]> wrote:
>>
>> Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
>> > My index contains documents for different users. Each document has the
>> > user
>> > id as a field on it.
>> >
>> > There are about 500 different users with 3 million documents.
>> >
>> > Currently I'm calling Search with the query (parsed from user)
>> > and FieldCacheTermsFilter for the user id.
>> >
>> > It works but the performance is not great.
>> >
>> > Ideally, I would like to perform the search only on the documents that
>> > are
>> > relevant, this should make it much faster. However, it seems
>> > Search(Query,
>> > Filter) runs the query first and then applies the filter.
>> >
>> > Is there a way to improve this? (i.e. run the query only on a subset of
>> > documents)
>> >
>> > Thanks
>> >
>>
>> When running the query with the filter, the query is run at the same time
>> as the filter. Initially and after each matching document, the filter is
>> assumed to
>> be cheaper to execute and its first or next matching document is
>> determined.
>> Then the query and the filter are repeatedly advanced to each other's next
>> matching
>> document until they are at the same document (ie. there is a match),
>> similar to
>> a boolean query with two required clauses.
>> The java code doing this is in the private method
>> IndexSearcher.searchWithFilter().
>>
>> It could be that filling the field cache is the performance problem.
>> How is the performance when this search call with the
>> FieldCacheTermsFilter
>> is repeated?
>>
>> Also, for a single indexed term to be used as a filter (the user id in
>> this case)
>> there may be no need for a cache, a QueryWrapperFilter around the
>> TermQuery
>> might suffice.
>>
>> Regards,
>> Paul Elschot
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Using filters to speed up queries

Khash Sajadi
Thanks everyone for your help.

At the end, I settled for using the Constant Score Query for the ACCOUNT and cached filter for the date range. The performance on a 20mm document index with 500 accounts is awesome!



On 25 October 2010 11:28, Michael McCandless <[hidden email]> wrote:
Here's the paper I was thinking of (Robert found this):
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.9682 ...
eg note this sentence from the abstract:

   We show that the first implementation, based on a postprocessing
approach, allows an arbitrary user to obtain information about the
content of files for which he does not have read permission.

Note that one simple way to "gauge" performance of filtering-down-low
would be to open an IndexReader, delete all documents except those
matching your filter (eg the ACCOUNT filter), then run your searches
against that IndexReader without the ACCOUNT clause.  If you don't
close that reader then these deletes are never committed.  This is a
simple way to compile in a filter to an open IR, but, you'd still then
have one reader open per user class, so the risks of too many files,
etc still stands.

Hmm though you could open an initial reader, then clone it, then do
all your deletes on that clone for user class 1, then clone it again,
do all deletes on that clone for user class 2.  This way you only have
one set of open files, but you've "compiled" your filter into the
delete docs for each reader.

But, in order to do this, you'd have to disable locking (use
NoLockFactory) in your Directory impl, just for these readers, since
you know you'll never commit the readers with pending deletions.  Just
be sure you never close those readers!

This should give sizable speedups if the filter is non-sparse.

Mike

On Sun, Oct 24, 2010 at 6:34 AM, Khash Sajadi <[hidden email]> wrote:
> Here is what I've found so far:
>
> I have three main sets to use in a query:
> Account MUST be xxx
> User query
> DateRange on the query MUST be in (a,b) it is a NumericField
> I tried the following combinations (all using a BooleanQuery with the user
> query added to it)
> 1. One:
> - Add ACCOUNT as a TermQuery
> - Add DATE RANGE as Filter
> 2. Two
> - Add ACCOUNT as Filer
> - Add DATE RANGE as NumericRangeQuery
> I tried caching the filters on both scenarios.
> I also tried both scenarios by passing the query as a ConstantScoreQuery as
> well.
> I got the best result (about 4x faster) by using a cached filter for the
> DATE RANGE and leaving the ACCOUNT as a TermQuery.
> I think I'm happy with this approach. However, the security risk Uwe
> mentioned when using ACCOUNT as a Query makes me nervous. Any suggestions?
> As for document distribution, the ACCOUNTS have a similar distribution of
> documents.
> Also, I still would like to try the multi index approach, but not sure about
> the memory, file handle burden of it (having potentially thousands of
> reades/writers/searchers) open at the same time. I use two processes one as
> indexer and one for search with the same underlying FSDirectory. As for
> search, I use writer.getReader().reopen within a SearchManager as suggested
> by Lucene in Action.
>
>
>
> On 24 October 2010 10:27, Paul Elschot <[hidden email]> wrote:
>>
>> Op zondag 24 oktober 2010 00:18:48 schreef Khash Sajadi:
>> > My index contains documents for different users. Each document has the
>> > user
>> > id as a field on it.
>> >
>> > There are about 500 different users with 3 million documents.
>> >
>> > Currently I'm calling Search with the query (parsed from user)
>> > and FieldCacheTermsFilter for the user id.
>> >
>> > It works but the performance is not great.
>> >
>> > Ideally, I would like to perform the search only on the documents that
>> > are
>> > relevant, this should make it much faster. However, it seems
>> > Search(Query,
>> > Filter) runs the query first and then applies the filter.
>> >
>> > Is there a way to improve this? (i.e. run the query only on a subset of
>> > documents)
>> >
>> > Thanks
>> >
>>
>> When running the query with the filter, the query is run at the same time
>> as the filter. Initially and after each matching document, the filter is
>> assumed to
>> be cheaper to execute and its first or next matching document is
>> determined.
>> Then the query and the filter are repeatedly advanced to each other's next
>> matching
>> document until they are at the same document (ie. there is a match),
>> similar to
>> a boolean query with two required clauses.
>> The java code doing this is in the private method
>> IndexSearcher.searchWithFilter().
>>
>> It could be that filling the field cache is the performance problem.
>> How is the performance when this search call with the
>> FieldCacheTermsFilter
>> is repeated?
>>
>> Also, for a single indexed term to be used as a filter (the user id in
>> this case)
>> there may be no need for a cache, a QueryWrapperFilter around the
>> TermQuery
>> might suffice.
>>
>> Regards,
>> Paul Elschot
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]