Querying multiple pages for same keyword at same time

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Querying multiple pages for same keyword at same time

Gael Jourdan-Weil
Hello,

We are experiencing some performance issues on Solr that seems related to requests querying multiple pages of results for the same keyword at the same time.
For instance, querying 10 pages of results (with 50 or 100 results per page) in the same second for a given keyword, and doing that on different keywords at the same time also.

The performance issues we observe is a high CPU usage and response time increasing a lot.

This doesn't seem related to the number of requests itself because we can handle a lot more requets per second when there is no such requests.

Do you think this makes sense and can be explained by the way Solr works?

Environment: SolrCloud 7.6.0

Gaël
Reply | Threaded
Open this post in threaded view
|

Re: Querying multiple pages for same keyword at same time

Erick Erickson
To return stored values, Lucene must
1> read the stored values from disk
2> decompress a minimum 16K block
3> assemble the return packet.

So you’re returning 500-1,000 documents per request, it may just be the above set of steps. Solr was never designed to _return_ large result sets. Search them, yes but not return. So if this never happens when you only return a few docs, this is probably your problem.

There are two ways of making this less work for Solr, both depend on returning only docValues="true” fields.
1> return only docValues fields. See useDocValuesAsStored.
2> use the /export handler.

Best,
Erick

> On Jan 13, 2020, at 5:31 AM, Gael Jourdan-Weil <[hidden email]> wrote:
>
> Hello,
>
> We are experiencing some performance issues on Solr that seems related to requests querying multiple pages of results for the same keyword at the same time.
> For instance, querying 10 pages of results (with 50 or 100 results per page) in the same second for a given keyword, and doing that on different keywords at the same time also.
>
> The performance issues we observe is a high CPU usage and response time increasing a lot.
>
> This doesn't seem related to the number of requests itself because we can handle a lot more requets per second when there is no such requests.
>
> Do you think this makes sense and can be explained by the way Solr works?
>
> Environment: SolrCloud 7.6.0
>
> Gaël

Reply | Threaded
Open this post in threaded view
|

RE: Querying multiple pages for same keyword at same time

Gael Jourdan-Weil
Thanks for your answer Erick.

Just to clarify something, we are not returning 1000 docs per request, we are only returning 100.
We get 10 requests to Solr querying for docs 1 to 100, then 101 to 200, ... until 901 to 1000.
But all that in the exact same second.

But I understand that to retrieve docs 901 to 1000, Solr needs to first get and sort the first 900 docs, so the request to get 901 to 1000 is as costly as asking for 1 to 1000 directly?
If the sort applies on an indexed field (isn't it mandatory?), why do Solr needs to read the first 900 docs ?

Regards,
Gaël

________________________________
De : Erick Erickson <[hidden email]>
Envoyé : lundi 13 janvier 2020 14:44
À : [hidden email] <[hidden email]>
Objet : Re: Querying multiple pages for same keyword at same time

To return stored values, Lucene must
1> read the stored values from disk
2> decompress a minimum 16K block
3> assemble the return packet.

So you’re returning 500-1,000 documents per request, it may just be the above set of steps. Solr was never designed to _return_ large result sets. Search them, yes but not return. So if this never happens when you only return a few docs, this is probably your problem.

There are two ways of making this less work for Solr, both depend on returning only docValues="true” fields.
1> return only docValues fields. See useDocValuesAsStored.
2> use the /export handler.

Best,
Erick

> On Jan 13, 2020, at 5:31 AM, Gael Jourdan-Weil <[hidden email]> wrote:
>
> Hello,
>
> We are experiencing some performance issues on Solr that seems related to requests querying multiple pages of results for the same keyword at the same time.
> For instance, querying 10 pages of results (with 50 or 100 results per page) in the same second for a given keyword, and doing that on different keywords at the same time also.
>
> The performance issues we observe is a high CPU usage and response time increasing a lot.
>
> This doesn't seem related to the number of requests itself because we can handle a lot more requets per second when there is no such requests.
>
> Do you think this makes sense and can be explained by the way Solr works?
>
> Environment: SolrCloud 7.6.0
>
> Gaël

Reply | Threaded
Open this post in threaded view
|

Re: Querying multiple pages for same keyword at same time

Shawn Heisey-2
On 1/13/2020 11:53 AM, Gael Jourdan-Weil wrote:
> Just to clarify something, we are not returning 1000 docs per request, we are only returning 100.
> We get 10 requests to Solr querying for docs 1 to 100, then 101 to 200, ... until 901 to 1000.
> But all that in the exact same second.
>
> But I understand that to retrieve docs 901 to 1000, Solr needs to first get and sort the first 900 docs, so the request to get 901 to 1000 is as costly as asking for 1 to 1000 directly?
> If the sort applies on an indexed field (isn't it mandatory?), why do Solr needs to read the first 900 docs ?

In order to get the 10th page, it must sort to determine the IDs for the
top 1000, skip 900 of them, and then retrieve the last 100.  So the
query portion (not counting document retrieval) for page 10 has nearly
the same cost as asking for all 1000 in the same request.

Asking for the first 100 involves only the top 100 documents.  Then
because the request for the next 100 must obtain the top 200, it is a
little bit slower.  The third request must obtain the top 300, so it's
slower again.  And so on.

Are those 10 requests happening simultaneously, or consecutively?  If
it's simultaneous, then they won't benefit from Solr caching.  Because
Solr can cache certain things, it would probably be faster to make 10
consecutive requests than 10 simultaneous.

What are you trying to accomplish when you make these queries?  If we
understand that, perhaps we can come up with something better.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

RE: Querying multiple pages for same keyword at same time

Gael Jourdan-Weil
Ok I understand better.
Solr does not "read" the 1 to 900 docs to retrieve 901 to 1000 but it still needs to compute some stuff (docset intersection or something like that, right?) and sort, which is costly, and then "read" the docs.

> Are those 10 requests happening simultaneously, or consecutively?  If
> it's simultaneous, then they won't benefit from Solr caching.  Because
> Solr can cache certain things, it would probably be faster to make 10
> consecutive requests than 10 simultaneous.

The 10 requests are simultaneous which is I think an explanation of the issues we encounter. If they were consecutive, I'd expect to take benefit of the cache indeed.

> What are you trying to accomplish when you make these queries?  If we
> understand that, perhaps we can come up with something better.

Actually we are exposing a search engine and it's a behavior from some of our clients.
It's not a behavior we are deliberately doing or encouraging.
But before discussing with them, we wanted to understand a bit better what in Solr explain those response times.

Regards,
Gaël

Reply | Threaded
Open this post in threaded view
|

Re: Querying multiple pages for same keyword at same time

Erick Erickson
Conceptually asking for cods 900-1000 works something like this. Solr (well, Lucene actually) has to keep a sorted list 1,000 items long of scores and doc IDs because you can’t know whether doc N+1 will be in the list, or where. So the list manipulation is what takes the extra time. For even 1,000 docs, that shouldn’t be very much overhead, when it gets up in the 10s of K (or, I’ve seen millions) it’s _very_ noticeable.

With the example you’ve talked about, I doubt this is really a problem.

FWIW,
Erick

> On Jan 14, 2020, at 1:40 PM, Gael Jourdan-Weil <[hidden email]> wrote:
>
> Ok I understand better.
> Solr does not "read" the 1 to 900 docs to retrieve 901 to 1000 but it still needs to compute some stuff (docset intersection or something like that, right?) and sort, which is costly, and then "read" the docs.
>
>> Are those 10 requests happening simultaneously, or consecutively?  If
>> it's simultaneous, then they won't benefit from Solr caching.  Because
>> Solr can cache certain things, it would probably be faster to make 10
>> consecutive requests than 10 simultaneous.
>
> The 10 requests are simultaneous which is I think an explanation of the issues we encounter. If they were consecutive, I'd expect to take benefit of the cache indeed.
>
>> What are you trying to accomplish when you make these queries?  If we
>> understand that, perhaps we can come up with something better.
>
> Actually we are exposing a search engine and it's a behavior from some of our clients.
> It's not a behavior we are deliberately doing or encouraging.
> But before discussing with them, we wanted to understand a bit better what in Solr explain those response times.
>
> Regards,
> Gaël
>

Reply | Threaded
Open this post in threaded view
|

Re: Querying multiple pages for same keyword at same time

Vincenzo D'Amore
In reply to this post by Gael Jourdan-Weil

Had you already seen Solr deep paging?

https://lucidworks.com/post/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

> On Tue, 14 Jan 2020 at 20:41, Erick Erickson <[hidden email]> wrote:
> Conceptually asking for cods 900-1000 works something like this. Solr (well, Lucene actually) has to keep a sorted list 1,000 items long of scores and doc IDs because you can’t know whether doc N+1 will be in the list, or where. So the list manipulation is what takes the extra time. For even 1,000 docs, that shouldn’t be very much overhead, when it gets up in the 10s of K (or, I’ve seen millions) it’s _very_ noticeable.
>
> With the example you’ve talked about, I doubt this is really a problem.
>
> FWIW,
> Erick
>
> > On Jan 14, 2020, at 1:40 PM, Gael Jourdan-Weil <[hidden email]> wrote:
> >
> > Ok I understand better.
> > Solr does not "read" the 1 to 900 docs to retrieve 901 to 1000 but it still needs to compute some stuff (docset intersection or something like that, right?) and sort, which is costly, and then "read" the docs.
> >
> >> Are those 10 requests happening simultaneously, or consecutively?  If
> >> it's simultaneous, then they won't benefit from Solr caching.  Because
> >> Solr can cache certain things, it would probably be faster to make 10
> >> consecutive requests than 10 simultaneous.
> >
> > The 10 requests are simultaneous which is I think an explanation of the issues we encounter. If they were consecutive, I'd expect to take benefit of the cache indeed.
> >
> >> What are you trying to accomplish when you make these queries?  If we
> >> understand that, perhaps we can come up with something better.
> >
> > Actually we are exposing a search engine and it's a behavior from some of our clients.
> > It's not a behavior we are deliberately doing or encouraging.
> > But before discussing with them, we wanted to understand a bit better what in Solr explain those response times.
> >
> > Regards,
> > Gaël
> >
Reply | Threaded
Open this post in threaded view
|

RE: Querying multiple pages for same keyword at same time

Gael Jourdan-Weil
In reply to this post by Erick Erickson
Indeed, with a max of 1K doc to be manipulated, I don't expect issues.
We are looking at other avenues to understand our issues.

Regards,
Gaël
Reply | Threaded
Open this post in threaded view
|

RE: Querying multiple pages for same keyword at same time

Gael Jourdan-Weil
In reply to this post by Vincenzo D'Amore
Yes already read this stuff but was not sure 1000 docs was considered as "deep" or not.

________________________________
De : Vincenzo D'Amore <[hidden email]>
Envoyé : mardi 14 janvier 2020 22:22
À : [hidden email] <[hidden email]>
Objet : Re: Querying multiple pages for same keyword at same time


Had you already seen Solr deep paging?

https://lucidworks.com/post/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

> On Tue, 14 Jan 2020 at 20:41, Erick Erickson <[hidden email]> wrote:
> Conceptually asking for cods 900-1000 works something like this. Solr (well, Lucene actually) has to keep a sorted list 1,000 items long of scores and doc IDs because you can’t know whether doc N+1 will be in the list, or where. So the list manipulation is what takes the extra time. For even 1,000 docs, that shouldn’t be very much overhead, when it gets up in the 10s of K (or, I’ve seen millions) it’s _very_ noticeable.
>
> With the example you’ve talked about, I doubt this is really a problem.
>
> FWIW,
> Erick
>
> > On Jan 14, 2020, at 1:40 PM, Gael Jourdan-Weil <[hidden email]> wrote:
> >
> > Ok I understand better.
> > Solr does not "read" the 1 to 900 docs to retrieve 901 to 1000 but it still needs to compute some stuff (docset intersection or something like that, right?) and sort, which is costly, and then "read" the docs.
> >
> >> Are those 10 requests happening simultaneously, or consecutively?  If
> >> it's simultaneous, then they won't benefit from Solr caching.  Because
> >> Solr can cache certain things, it would probably be faster to make 10
> >> consecutive requests than 10 simultaneous.
> >
> > The 10 requests are simultaneous which is I think an explanation of the issues we encounter. If they were consecutive, I'd expect to take benefit of the cache indeed.
> >
> >> What are you trying to accomplish when you make these queries?  If we
> >> understand that, perhaps we can come up with something better.
> >
> > Actually we are exposing a search engine and it's a behavior from some of our clients.
> > It's not a behavior we are deliberately doing or encouraging.
> > But before discussing with them, we wanted to understand a bit better what in Solr explain those response times.
> >
> > Regards,
> > Gaël
> >