Slow response times using *:*

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Slow response times using *:*

Andy Blower
I'm evaluating SOLR/Lucene for our needs and currently looking at performance since 99% of the functionality we're looking for is provided. The index contains 18.4 Million records and is 58Gb in size. Most queries are acceptably quick, once the filters are cached. The filters select one or more of three subsets of the data and then intersect from around 15 other subsets of data depending on a user subscription.

We're returning facets on several fields, and sometimes a blank (q=*:*) query is run purely to get the facets for all of the data that the user can access. This information is turned into browse information and can be different for each user.

Running performance tests using jMeter sequentially with a single user, these blank queries are slower than the normal queries, but still in the 1-4sec range. Unfortunately if I increase the number of test threads so that more than one of the blank queries is submitted while one is already being processed, everything grinds to a halt and the responses to these blank queries can take up to 125secs to be returned!

This surprises me because the filter query submitted has usually already been submitted along with a normal query, and so should be cached in the filter cache. Surely all solr needs to do is return a handful of fields for the first 100 records in the list from the cache - or so I thought.

Can anyone tell me what might be causing this dramatic slowdown? Any suggestions for solutions would be gratefully received.


Thans
Andy.
Reply | Threaded
Open this post in threaded view
|

Re: Slow response times using *:*

Yonik Seeley-2
On Jan 31, 2008 10:43 AM, Andy Blower <[hidden email]> wrote:

>
> I'm evaluating SOLR/Lucene for our needs and currently looking at performance
> since 99% of the functionality we're looking for is provided. The index
> contains 18.4 Million records and is 58Gb in size. Most queries are
> acceptably quick, once the filters are cached. The filters select one or
> more of three subsets of the data and then intersect from around 15 other
> subsets of data depending on a user subscription.
>
> We're returning facets on several fields, and sometimes a blank (q=*:*)
> query is run purely to get the facets for all of the data that the user can
> access. This information is turned into browse information and can be
> different for each user.
>
> Running performance tests using jMeter sequentially with a single user,
> these blank queries are slower than the normal queries, but still in the
> 1-4sec range. Unfortunately if I increase the number of test threads so that
> more than one of the blank queries is submitted while one is already being
> processed, everything grinds to a halt and the responses to these blank
> queries can take up to 125secs to be returned!

*:* maps to MatchAllDocsQuery, which for each document needs to check
if it's deleted (that's a synchronized call, and can be a bottleneck).

A cheap workaround is that if you know of a term that is in every
document, (or a field in every document that has very few terms), then
substitute a query on that for *:*
Substituting one of your filters as the base query might also work.

> This surprises me because the filter query submitted has usually already
> been submitted along with a normal query, and so should be cached in the
> filter cache. Surely all solr needs to do is return a handful of fields for
> the first 100 records in the list from the cache - or so I thought.

To calculate the DocSet (the set of all documents matching *:* and
your filters), Solr can just use it's caches as long as *:* and the
filters have been used before.

*But*, to retrieve the top 10 documents matching *:* and your filters,
the query must be re-run.  That is probably where the time is being
spent.  Since you aren't looking for relevancy scores at all, but just
faceting, it seems like we could potentially optimize this in Solr.

In the future, we could also do some query optimization by sometimes
combining filters with the base query.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Slow response times using *:*

Shalin Shekhar Mangar
In reply to this post by Andy Blower
I can't give you a definitive answer based on the data you've provided.
However, do you really need to get *all* facets? Can't you limit them with
facet.limit field? Are you planning to run multiple *:* queries with all
facets turned on a 58GB index in a live system? I don't think that's a good
idea.

As for the 125 seconds, I think it is probably because of paging issues. Are
you faceting on multivalued or tokenized fields? In that case, Solr uses
field queries which consume a lot of memory if the number of unique terms
are large.

On Jan 31, 2008 9:13 PM, Andy Blower <[hidden email]> wrote:

>
> I'm evaluating SOLR/Lucene for our needs and currently looking at
> performance
> since 99% of the functionality we're looking for is provided. The index
> contains 18.4 Million records and is 58Gb in size. Most queries are
> acceptably quick, once the filters are cached. The filters select one or
> more of three subsets of the data and then intersect from around 15 other
> subsets of data depending on a user subscription.
>
> We're returning facets on several fields, and sometimes a blank (q=*:*)
> query is run purely to get the facets for all of the data that the user
> can
> access. This information is turned into browse information and can be
> different for each user.
>
> Running performance tests using jMeter sequentially with a single user,
> these blank queries are slower than the normal queries, but still in the
> 1-4sec range. Unfortunately if I increase the number of test threads so
> that
> more than one of the blank queries is submitted while one is already being
> processed, everything grinds to a halt and the responses to these blank
> queries can take up to 125secs to be returned!
>
> This surprises me because the filter query submitted has usually already
> been submitted along with a normal query, and so should be cached in the
> filter cache. Surely all solr needs to do is return a handful of fields
> for
> the first 100 records in the list from the cache - or so I thought.
>
> Can anyone tell me what might be causing this dramatic slowdown? Any
> suggestions for solutions would be gratefully received.
>
>
> Thans
> Andy.
> --
> View this message in context:
> http://www.nabble.com/Slow-response-times-using-*%3A*-tp15206563p15206563.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: Slow response times using *:*

Andy Blower
Actually I do need all facets for a field, although I've just realised that the tests are limited to only 100. Ooops. So it should be worse in reality... erk.

Since that's what we do with our current search engine, Solr has to be able to compete with this. The fields are a mix of non-multi, non-tokenized and others which are. I've yet to experiment with this.

Thanks.

shalinmangar wrote
I can't give you a definitive answer based on the data you've provided.
However, do you really need to get *all* facets? Can't you limit them with
facet.limit field? Are you planning to run multiple *:* queries with all
facets turned on a 58GB index in a live system? I don't think that's a good
idea.

As for the 125 seconds, I think it is probably because of paging issues. Are
you faceting on multivalued or tokenized fields? In that case, Solr uses
field queries which consume a lot of memory if the number of unique terms
are large.

On Jan 31, 2008 9:13 PM, Andy Blower <andy.blower@proquest.co.uk> wrote:

>
> I'm evaluating SOLR/Lucene for our needs and currently looking at
> performance
> since 99% of the functionality we're looking for is provided. The index
> contains 18.4 Million records and is 58Gb in size. Most queries are
> acceptably quick, once the filters are cached. The filters select one or
> more of three subsets of the data and then intersect from around 15 other
> subsets of data depending on a user subscription.
>
> We're returning facets on several fields, and sometimes a blank (q=*:*)
> query is run purely to get the facets for all of the data that the user
> can
> access. This information is turned into browse information and can be
> different for each user.
>
> Running performance tests using jMeter sequentially with a single user,
> these blank queries are slower than the normal queries, but still in the
> 1-4sec range. Unfortunately if I increase the number of test threads so
> that
> more than one of the blank queries is submitted while one is already being
> processed, everything grinds to a halt and the responses to these blank
> queries can take up to 125secs to be returned!
>
> This surprises me because the filter query submitted has usually already
> been submitted along with a normal query, and so should be cached in the
> filter cache. Surely all solr needs to do is return a handful of fields
> for
> the first 100 records in the list from the cache - or so I thought.
>
> Can anyone tell me what might be causing this dramatic slowdown? Any
> suggestions for solutions would be gratefully received.
>
>
> Thans
> Andy.
> --
> View this message in context:
> http://www.nabble.com/Slow-response-times-using-*%3A*-tp15206563p15206563.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: Slow response times using *:*

Walter Underwood, Netflix
How often does the index change? Can you use an HTTP cache and do this
once for each new index?

wunder

On 1/31/08 9:09 AM, "Andy Blower" <[hidden email]> wrote:

>
> Actually I do need all facets for a field, although I've just realised that
> the tests are limited to only 100. Ooops. So it should be worse in
> reality... erk.
>
> Since that's what we do with our current search engine, Solr has to be able
> to compete with this. The fields are a mix of non-multi, non-tokenized and
> others which are. I've yet to experiment with this.
>
> Thanks.
>
>
> shalinmangar wrote:
>>
>> I can't give you a definitive answer based on the data you've provided.
>> However, do you really need to get *all* facets? Can't you limit them with
>> facet.limit field? Are you planning to run multiple *:* queries with all
>> facets turned on a 58GB index in a live system? I don't think that's a
>> good
>> idea.
>>
>> As for the 125 seconds, I think it is probably because of paging issues.
>> Are
>> you faceting on multivalued or tokenized fields? In that case, Solr uses
>> field queries which consume a lot of memory if the number of unique terms
>> are large.
>>
>> On Jan 31, 2008 9:13 PM, Andy Blower <[hidden email]> wrote:
>>
>>>
>>> I'm evaluating SOLR/Lucene for our needs and currently looking at
>>> performance
>>> since 99% of the functionality we're looking for is provided. The index
>>> contains 18.4 Million records and is 58Gb in size. Most queries are
>>> acceptably quick, once the filters are cached. The filters select one or
>>> more of three subsets of the data and then intersect from around 15 other
>>> subsets of data depending on a user subscription.
>>>
>>> We're returning facets on several fields, and sometimes a blank (q=*:*)
>>> query is run purely to get the facets for all of the data that the user
>>> can
>>> access. This information is turned into browse information and can be
>>> different for each user.
>>>
>>> Running performance tests using jMeter sequentially with a single user,
>>> these blank queries are slower than the normal queries, but still in the
>>> 1-4sec range. Unfortunately if I increase the number of test threads so
>>> that
>>> more than one of the blank queries is submitted while one is already
>>> being
>>> processed, everything grinds to a halt and the responses to these blank
>>> queries can take up to 125secs to be returned!
>>>
>>> This surprises me because the filter query submitted has usually already
>>> been submitted along with a normal query, and so should be cached in the
>>> filter cache. Surely all solr needs to do is return a handful of fields
>>> for
>>> the first 100 records in the list from the cache - or so I thought.
>>>
>>> Can anyone tell me what might be causing this dramatic slowdown? Any
>>> suggestions for solutions would be gratefully received.
>>>
>>>
>>> Thans
>>> Andy.
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Slow-response-times-using-*%3A*-tp15206563p15206563.ht
>>> ml
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Slow response times using *:*

Andy Blower
In reply to this post by Yonik Seeley-2
Yonik Seeley wrote
*:* maps to MatchAllDocsQuery, which for each document needs to check
if it's deleted (that's a synchronized call, and can be a bottleneck).
Why does this need to check if documents are deleted if normal queries don't? Is there any way of disabling this since I can be sure this isn't the case after indexing and optimizing.

Yonik Seeley wrote
A cheap workaround is that if you know of a term that is in every
document, (or a field in every document that has very few terms), then
substitute a query on that for *:*
Substituting one of your filters as the base query might also work.
Would duplicating one of my filters cause any issues? That would be easy. Otherwise I'll try the substitution and see if it helps much.

Yonik Seeley wrote
> This surprises me because the filter query submitted has usually already
> been submitted along with a normal query, and so should be cached in the
> filter cache. Surely all solr needs to do is return a handful of fields for
> the first 100 records in the list from the cache - or so I thought.

To calculate the DocSet (the set of all documents matching *:* and
your filters), Solr can just use it's caches as long as *:* and the
filters have been used before.

*But*, to retrieve the top 10 documents matching *:* and your filters,
the query must be re-run.  That is probably where the time is being
spent.  Since you aren't looking for relevancy scores at all, but just
faceting, it seems like we could potentially optimize this in Solr.
I'm actually retrieving the first 100 in my tests, which will be necessary in one of the two scenarios we use blank queries for. The other scenario doesn't require any docs at all - just the facets, and I've not put that in my tests. What would the situation be if I specified a sort order for the facets and/or retrieved no docs at all? I'd be sorting the facets alphabetically, which is currently done by my app rather than the search engine. (since I sometimes have to merge facets from more than one field)

I had assumed that no doc would be considered more relevant than any other without any query terms - i.e. filter query terms wouldn't affect relevance. This seems sensible to me, but maybe that's only because our current search engine works that way.

Regarding optimization, I certainly think that being able to access all facets for subsets of the indexed data (defined by the filter query) is an incredibly useful feature. My search engine usage may not be very common though. What it means to us is that we can drive all aspects of our sites from the search engine, not just the obvious search forms.

Yonik Seeley wrote
In the future, we could also do some query optimization by sometimes
combining filters with the base query.

-Yonik
Sorry, that flew over my head..

Thanks very much for your help. I wish I had more time during this evaluation to delve into the code. I don't suppose there's a document with guided tour of the codebase anywhere is there? ;-)


P.S. I re-ran my tests without returning facets whilst writing this and didn't get the slowdowns with 4 or 10 threads, does this help?

Reply | Threaded
Open this post in threaded view
|

Re: Slow response times using *:*

Mike Klaas
On 31-Jan-08, at 9:41 AM, Andy Blower wrote:

> Yonik Seeley wrote:
>>
>>> This surprises me because the filter query submitted has usually  
>>> already
>>> been submitted along with a normal query, and so should be cached  
>>> in the
>>> filter cache. Surely all solr needs to do is return a handful of  
>>> fields
>>> for
>>> the first 100 records in the list from the cache - or so I thought.
>>
>> To calculate the DocSet (the set of all documents matching *:* and
>> your filters), Solr can just use it's caches as long as *:* and the
>> filters have been used before.
>>
>> *But*, to retrieve the top 10 documents matching *:* and your  
>> filters,
>> the query must be re-run.  That is probably where the time is being
>> spent.  Since you aren't looking for relevancy scores at all, but  
>> just
>> faceting, it seems like we could potentially optimize this in Solr.
>>
>
> I'm actually retrieving the first 100 in my tests, which will be  
> necessary
> in one of the two scenarios we use blank queries for. The other  
> scenario
> doesn't require any docs at all - just the facets, and I've not put  
> that in
> my tests. What would the situation be if I specified a sort order  
> for the
> facets and/or retrieved no docs at all? I'd be sorting the facets
> alphabetically, which is currently done by my app rather than the  
> search
> engine. (since I sometimes have to merge facets from more than one  
> field)

First question:  What is the use of retrieving 100 documents if there  
is no defined sort order?

The situation could be optimized in Solr, but there is a related case  
that _is_ optimized that should be almost as fast.  If you

a) don't ask for document score in field list (fl)
b) enable <useFilterForSortedQuery> in solrconfig.xml
c) specify _some_ sort order other than score

Then Solr will do cached bitset intersections only.  It will also do  
sorting, but that may not be terribly expensive.  If it is close to  
the desired performance, it would be relatively easy to patch solr to  
not do that step.

(Note: this is query sort, no facet sort).

> I had assumed that no doc would be considered more relevant than  
> any other
> without any query terms - i.e. filter query terms wouldn't affect  
> relevance.
> This seems sensible to me, but maybe that's only because our  
> current search
> engine works that way.

It won't, but it will still try to calculate the score if you ask it  
to (all docs will score the same, though).

> Regarding optimization, I certainly think that being able to access  
> all
> facets for subsets of the indexed data (defined by the filter  
> query) is an
> incredibly useful feature. My search engine usage may not be very  
> common
> though. What it means to us is that we can drive all aspects of our  
> sites
> from the search engine, not just the obvious search forms.

I also use this feature.  It would be useful to optimize the case  
where rows=0.

-Mike