Solr query performance issue

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr query performance issue

Larry He
Hi All,

We have about 100 different fields and 1 million documents we indexed with
Solr.  Many of the fields are multi-valued, and some are numbers (for range
search).  We are expecting to perform solr queries contains over 30 terms
and often the response time is well over a second.  I found that the caches
in Solr such as QueryResultCache and FilterCache does not help us much in
this case as most of the queries have combinations of terms that are
unlikely to repeat.  An example of our query would look like:

field1:(02 04 05) field2:(01 02 03) field3:(02 03 04 06) ...

My question is how can we improve performance of these queries?  Does Lucene
have to read the index file again if we first do a query containing the term
field1:01 then a second query containing field1:02?  If we have sufficient
memory, is it possible to cache certain fields so that it does not need to
read from index files at all?  Hope someone could provide me some
suggestions.

Thanks,
Larry He
Reply | Threaded
Open this post in threaded view
|

Re: Solr query performance issue

Yonik Seeley-2-2
On Tue, May 26, 2009 at 3:42 PM, Larry He <[hidden email]> wrote:

> We have about 100 different fields and 1 million documents we indexed with
> Solr.  Many of the fields are multi-valued, and some are numbers (for range
> search).  We are expecting to perform solr queries contains over 30 terms
> and often the response time is well over a second.  I found that the caches
> in Solr such as QueryResultCache and FilterCache does not help us much in
> this case as most of the queries have combinations of terms that are
> unlikely to repeat.  An example of our query would look like:
>
> field1:(02 04 05) field2:(01 02 03) field2:(01 02 03) ...
>
> My question is how can we improve performance of these queries?

filters are independently cached... but they are currently only "AND"
filters, so you could only split it up like so:

fq=field1:(02 04 05)&fq=field2:(01 02 03)&fq=field2:(01 02 03)
But that won't help unless any of the individual fq params are
repeated across different queries.

Range search can also be sped up a lot via the use of the new
TrieRange fields, or via the frange (function range query)
capabilities in Solr 1.4.... it's not clear if the range queries or
the term queries are your current bottleneck.

If the range queries aren't your bottleneck and separate filters don't
work, then a query type could be developed that would help your
situation by caching matches on term queries. Are relevancy scores
important for the clauses like field1:(02 04 05), or do you sort by
some other criteria?

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: Solr query performance issue

Development Team
Yes, those terms are important in calculating the relevancy scores so they
are not in the filter queries.  I was hoping if I can cache everything about
a field, any combinations on the field values will be read from cache. Then
it does not matter if I query for field1:(02 04 05), or field1:(01 02) or
field1:03 the response time is equally quick.  Is there anyway to achieve
that?
Yeah, the range queries are also a bottleneck too, I will give the TrieRange
fields a try.  Thanks for you advice.

Best Regards,
Shi Quan He

On Tue, May 26, 2009 at 3:55 PM, Yonik Seeley <[hidden email]>wrote:

> On Tue, May 26, 2009 at 3:42 PM, Larry He <[hidden email]> wrote:
> > We have about 100 different fields and 1 million documents we indexed
> with
> > Solr.  Many of the fields are multi-valued, and some are numbers (for
> range
> > search).  We are expecting to perform solr queries contains over 30 terms
> > and often the response time is well over a second.  I found that the
> caches
> > in Solr such as QueryResultCache and FilterCache does not help us much in
> > this case as most of the queries have combinations of terms that are
> > unlikely to repeat.  An example of our query would look like:
> >
> > field1:(02 04 05) field2:(01 02 03) field2:(01 02 03) ...
> >
> > My question is how can we improve performance of these queries?
>
> filters are independently cached... but they are currently only "AND"
> filters, so you could only split it up like so:
>
> fq=field1:(02 04 05)&fq=field2:(01 02 03)&fq=field2:(01 02 03)
> But that won't help unless any of the individual fq params are
> repeated across different queries.
>
> Range search can also be sped up a lot via the use of the new
> TrieRange fields, or via the frange (function range query)
> capabilities in Solr 1.4.... it's not clear if the range queries or
> the term queries are your current bottleneck.
>
> If the range queries aren't your bottleneck and separate filters don't
> work, then a query type could be developed that would help your
> situation by caching matches on term queries. Are relevancy scores
> important for the clauses like field1:(02 04 05), or do you sort by
> some other criteria?
>
> -Yonik
> http://www.lucidimagination.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr query performance issue

Otis Gospodnetic-2

What about field1:01 ..... field:100 being used as separate filters (that would then get ANDed) -- doable?

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Development Team <[hidden email]>
> To: [hidden email]; [hidden email]
> Sent: Tuesday, May 26, 2009 4:54:34 PM
> Subject: Re: Solr query performance issue
>
> Yes, those terms are important in calculating the relevancy scores so they
> are not in the filter queries.  I was hoping if I can cache everything about
> a field, any combinations on the field values will be read from cache. Then
> it does not matter if I query for field1:(02 04 05), or field1:(01 02) or
> field1:03 the response time is equally quick.  Is there anyway to achieve
> that?
> Yeah, the range queries are also a bottleneck too, I will give the TrieRange
> fields a try.  Thanks for you advice.
>
> Best Regards,
> Shi Quan He
>
> On Tue, May 26, 2009 at 3:55 PM, Yonik Seeley wrote:
>
> > On Tue, May 26, 2009 at 3:42 PM, Larry He wrote:
> > > We have about 100 different fields and 1 million documents we indexed
> > with
> > > Solr.  Many of the fields are multi-valued, and some are numbers (for
> > range
> > > search).  We are expecting to perform solr queries contains over 30 terms
> > > and often the response time is well over a second.  I found that the
> > caches
> > > in Solr such as QueryResultCache and FilterCache does not help us much in
> > > this case as most of the queries have combinations of terms that are
> > > unlikely to repeat.  An example of our query would look like:
> > >
> > > field1:(02 04 05) field2:(01 02 03) field2:(01 02 03) ...
> > >
> > > My question is how can we improve performance of these queries?
> >
> > filters are independently cached... but they are currently only "AND"
> > filters, so you could only split it up like so:
> >
> > fq=field1:(02 04 05)&fq=field2:(01 02 03)&fq=field2:(01 02 03)
> > But that won't help unless any of the individual fq params are
> > repeated across different queries.
> >
> > Range search can also be sped up a lot via the use of the new
> > TrieRange fields, or via the frange (function range query)
> > capabilities in Solr 1.4.... it's not clear if the range queries or
> > the term queries are your current bottleneck.
> >
> > If the range queries aren't your bottleneck and separate filters don't
> > work, then a query type could be developed that would help your
> > situation by caching matches on term queries. Are relevancy scores
> > important for the clauses like field1:(02 04 05), or do you sort by
> > some other criteria?
> >
> > -Yonik
> > http://www.lucidimagination.com
> >

Reply | Threaded
Open this post in threaded view
|

Re: Solr query performance issue

Larry He
We actually want OR operator on  those values.  Filters can only do AND,
right?

Is it better performance to have the query as field1:01 field1:02 field1:03
instead of field1:(01 02 03)?

BR,
Larry

On Tue, May 26, 2009 at 5:15 PM, Otis Gospodnetic <
[hidden email]> wrote:

>
> What about field1:01 ..... field:100 being used as separate filters (that
> would then get ANDed) -- doable?
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Development Team <[hidden email]>
> > To: [hidden email]; [hidden email]
> > Sent: Tuesday, May 26, 2009 4:54:34 PM
> > Subject: Re: Solr query performance issue
> >
> > Yes, those terms are important in calculating the relevancy scores so
> they
> > are not in the filter queries.  I was hoping if I can cache everything
> about
> > a field, any combinations on the field values will be read from cache.
> Then
> > it does not matter if I query for field1:(02 04 05), or field1:(01 02) or
> > field1:03 the response time is equally quick.  Is there anyway to achieve
> > that?
> > Yeah, the range queries are also a bottleneck too, I will give the
> TrieRange
> > fields a try.  Thanks for you advice.
> >
> > Best Regards,
> > Shi Quan He
> >
> > On Tue, May 26, 2009 at 3:55 PM, Yonik Seeley wrote:
> >
> > > On Tue, May 26, 2009 at 3:42 PM, Larry He wrote:
> > > > We have about 100 different fields and 1 million documents we indexed
> > > with
> > > > Solr.  Many of the fields are multi-valued, and some are numbers (for
> > > range
> > > > search).  We are expecting to perform solr queries contains over 30
> terms
> > > > and often the response time is well over a second.  I found that the
> > > caches
> > > > in Solr such as QueryResultCache and FilterCache does not help us
> much in
> > > > this case as most of the queries have combinations of terms that are
> > > > unlikely to repeat.  An example of our query would look like:
> > > >
> > > > field1:(02 04 05) field2:(01 02 03) field2:(01 02 03) ...
> > > >
> > > > My question is how can we improve performance of these queries?
> > >
> > > filters are independently cached... but they are currently only "AND"
> > > filters, so you could only split it up like so:
> > >
> > > fq=field1:(02 04 05)&fq=field2:(01 02 03)&fq=field2:(01 02 03)
> > > But that won't help unless any of the individual fq params are
> > > repeated across different queries.
> > >
> > > Range search can also be sped up a lot via the use of the new
> > > TrieRange fields, or via the frange (function range query)
> > > capabilities in Solr 1.4.... it's not clear if the range queries or
> > > the term queries are your current bottleneck.
> > >
> > > If the range queries aren't your bottleneck and separate filters don't
> > > work, then a query type could be developed that would help your
> > > situation by caching matches on term queries. Are relevancy scores
> > > important for the clauses like field1:(02 04 05), or do you sort by
> > > some other criteria?
> > >
> > > -Yonik
> > > http://www.lucidimagination.com
> > >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr query performance issue

Yonik Seeley-2-2
In reply to this post by Yonik Seeley-2-2
Another little optimization would be to flatten the query.
Instead of "field1:(02 04 05) field2:(01 02 03)"
use "field1:02 field1:04 field1:05 field2:01 field2:02 field2:03"

But I'd try and narrow down what queries are taking a long time, and
see if there is a common element that could be optimized.

-Yonik
http://www.lucidimagination.com



On Tue, May 26, 2009 at 4:54 PM, Development Team <[hidden email]> wrote:

> Yes, those terms are important in calculating the relevancy scores so they
> are not in the filter queries.  I was hoping if I can cache everything about
> a field, any combinations on the field values will be read from cache. Then
> it does not matter if I query for field1:(02 04 05), or field1:(01 02) or
> field1:03 the response time is equally quick.  Is there anyway to achieve
> that?
> Yeah, the range queries are also a bottleneck too, I will give the TrieRange
> fields a try.  Thanks for you advice.
> Best Regards,
> Shi Quan He
>
> On Tue, May 26, 2009 at 3:55 PM, Yonik Seeley <[hidden email]>
> wrote:
>>
>> On Tue, May 26, 2009 at 3:42 PM, Larry He <[hidden email]> wrote:
>> > We have about 100 different fields and 1 million documents we indexed
>> > with
>> > Solr.  Many of the fields are multi-valued, and some are numbers (for
>> > range
>> > search).  We are expecting to perform solr queries contains over 30
>> > terms
>> > and often the response time is well over a second.  I found that the
>> > caches
>> > in Solr such as QueryResultCache and FilterCache does not help us much
>> > in
>> > this case as most of the queries have combinations of terms that are
>> > unlikely to repeat.  An example of our query would look like:
>> >
>> > field1:(02 04 05) field2:(01 02 03) field2:(01 02 03) ...
>> >
>> > My question is how can we improve performance of these queries?
>>
>> filters are independently cached... but they are currently only "AND"
>> filters, so you could only split it up like so:
>>
>> fq=field1:(02 04 05)&fq=field2:(01 02 03)&fq=field2:(01 02 03)
>> But that won't help unless any of the individual fq params are
>> repeated across different queries.
>>
>> Range search can also be sped up a lot via the use of the new
>> TrieRange fields, or via the frange (function range query)
>> capabilities in Solr 1.4.... it's not clear if the range queries or
>> the term queries are your current bottleneck.
>>
>> If the range queries aren't your bottleneck and separate filters don't
>> work, then a query type could be developed that would help your
>> situation by caching matches on term queries. Are relevancy scores
>> important for the clauses like field1:(02 04 05), or do you sort by
>> some other criteria?
>>
>> -Yonik
>> http://www.lucidimagination.com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr query performance issue

Otis Gospodnetic-2
In reply to this post by Larry He

Aha, then no fq for you. :)
Those two queries you wrote should be equivalent under the hood.
 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Larry He <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, May 26, 2009 5:26:34 PM
> Subject: Re: Solr query performance issue
>
> We actually want OR operator on  those values.  Filters can only do AND,
> right?
>
> Is it better performance to have the query as field1:01 field1:02 field1:03
> instead of field1:(01 02 03)?
>
> BR,
> Larry
>
> On Tue, May 26, 2009 at 5:15 PM, Otis Gospodnetic <
> [hidden email]> wrote:
>
> >
> > What about field1:01 ..... field:100 being used as separate filters (that
> > would then get ANDed) -- doable?
> >
> >  Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Development Team
> > > To: [hidden email]; [hidden email]
> > > Sent: Tuesday, May 26, 2009 4:54:34 PM
> > > Subject: Re: Solr query performance issue
> > >
> > > Yes, those terms are important in calculating the relevancy scores so
> > they
> > > are not in the filter queries.  I was hoping if I can cache everything
> > about
> > > a field, any combinations on the field values will be read from cache.
> > Then
> > > it does not matter if I query for field1:(02 04 05), or field1:(01 02) or
> > > field1:03 the response time is equally quick.  Is there anyway to achieve
> > > that?
> > > Yeah, the range queries are also a bottleneck too, I will give the
> > TrieRange
> > > fields a try.  Thanks for you advice.
> > >
> > > Best Regards,
> > > Shi Quan He
> > >
> > > On Tue, May 26, 2009 at 3:55 PM, Yonik Seeley wrote:
> > >
> > > > On Tue, May 26, 2009 at 3:42 PM, Larry He wrote:
> > > > > We have about 100 different fields and 1 million documents we indexed
> > > > with
> > > > > Solr.  Many of the fields are multi-valued, and some are numbers (for
> > > > range
> > > > > search).  We are expecting to perform solr queries contains over 30
> > terms
> > > > > and often the response time is well over a second.  I found that the
> > > > caches
> > > > > in Solr such as QueryResultCache and FilterCache does not help us
> > much in
> > > > > this case as most of the queries have combinations of terms that are
> > > > > unlikely to repeat.  An example of our query would look like:
> > > > >
> > > > > field1:(02 04 05) field2:(01 02 03) field2:(01 02 03) ...
> > > > >
> > > > > My question is how can we improve performance of these queries?
> > > >
> > > > filters are independently cached... but they are currently only "AND"
> > > > filters, so you could only split it up like so:
> > > >
> > > > fq=field1:(02 04 05)&fq=field2:(01 02 03)&fq=field2:(01 02 03)
> > > > But that won't help unless any of the individual fq params are
> > > > repeated across different queries.
> > > >
> > > > Range search can also be sped up a lot via the use of the new
> > > > TrieRange fields, or via the frange (function range query)
> > > > capabilities in Solr 1.4.... it's not clear if the range queries or
> > > > the term queries are your current bottleneck.
> > > >
> > > > If the range queries aren't your bottleneck and separate filters don't
> > > > work, then a query type could be developed that would help your
> > > > situation by caching matches on term queries. Are relevancy scores
> > > > important for the clauses like field1:(02 04 05), or do you sort by
> > > > some other criteria?
> > > >
> > > > -Yonik
> > > > http://www.lucidimagination.com
> > > >
> >
> >