SolrCloud & Paging on large indexes

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

SolrCloud & Paging on large indexes

Bram Van Dam
Hi folks,

If I understand things correctly, you can use paging & sorting in a
SolrCloud environment. However, if I request the first 10 documents, a
distributed query will be launched to all shards requesting the top 10,
and then (Shards * 10) documents will then be sorted so that only the
top 10 is returned.

This is fine.

But I'm a little worried when going beyond the first page ... This
becomes (Page * shards * 10). I'm worried that in a 50 billion document
setup paging will just explode.

Does anyone have any experience with paging on large cloud setups?
Positive or negative? Or can anyone offer some reassurances or words of
caution with this approach?

Or should I tell my users that they can never go beyond Page X (which is
fine if the alternative is hell fire and brimstone).

Thanks,

  - Bram
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud & Paging on large indexes

Mikhail Khludnev
Hello Bram,

make sure you checked the doc
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results

On Mon, Dec 22, 2014 at 12:59 PM, Bram Van Dam <[hidden email]> wrote:

>
> Hi folks,
>
> If I understand things correctly, you can use paging & sorting in a
> SolrCloud environment. However, if I request the first 10 documents, a
> distributed query will be launched to all shards requesting the top 10, and
> then (Shards * 10) documents will then be sorted so that only the top 10 is
> returned.
>
> This is fine.
>
> But I'm a little worried when going beyond the first page ... This becomes
> (Page * shards * 10). I'm worried that in a 50 billion document setup
> paging will just explode.
>
> Does anyone have any experience with paging on large cloud setups?
> Positive or negative? Or can anyone offer some reassurances or words of
> caution with this approach?
>
> Or should I tell my users that they can never go beyond Page X (which is
> fine if the alternative is hell fire and brimstone).
>
> Thanks,
>
>  - Bram
>


--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud & Paging on large indexes

heaven
In reply to this post by Bram Van Dam
I have a very bad experience with pagination on collections larger than a few millions of documents. Pagination becomes very and very slow. Just tried to switch to page 76662 and it took almost 30 seconds.

Solr now supports cursors which work fast and are useful for exports and some data processing, but I don't see how I can use those to draw page numbers and allow users to paginate through large data sets.
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud & Paging on large indexes

Bram Van Dam
On 12/22/2014 12:47 PM, heaven wrote:
> I have a very bad experience with pagination on collections larger than a few
> millions of documents. Pagination becomes very and very slow. Just tried to
> switch to page 76662 and it took almost 30 seconds.

Yeah that's pretty much my experience, and I think SolrCloud would only
exacerbate the problem (due to increased complexity of sorting). If
there's no silver bullet to be found, I guess I'll just have to disable
paging on large data sets -- which is fine, really, who the hell browses
through 50 billion documents anyway? That's what search is for, right?

Thx,

  - Bram

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud & Paging on large indexes

Erick Erickson
Have you read Hossman's blog here?
https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/#referrer=solr.pl

And how to use it here?
http://wiki.apache.org/solr/CommonQueryParameters#Deep_paging_with_cursorMark

Because if you're trying this and _still_ getting bad performance we
need to know.

Bram:
One minor pedantic clarification.. The first round-trip only returns
the id and sort criteria (score by default), not the whole document,
although the effect is the same, as you page N into the corpus, the
default implementation returns N * (pageNum + 1) entries. Even worse,
each node itself has to _sort_ that many entries.... Then a second
call is made to get the page-worth of docs...

About telling your users not to page past N... up to you, especially
if the deep paging stuff works as advertised (and I have no reason to
believe it doesn't).

That said, though, its pretty easy to argue that the 500th page is
pretty useless, nobody will ever hit the "next page" button 499 times.

The different use-case, though, is when people want to return the
entire corpus for whatever reason and _must_ page through to the
end....

Best,
Erick

On Mon, Dec 22, 2014 at 5:03 AM, Bram Van Dam <[hidden email]> wrote:

> On 12/22/2014 12:47 PM, heaven wrote:
>>
>> I have a very bad experience with pagination on collections larger than a
>> few
>> millions of documents. Pagination becomes very and very slow. Just tried
>> to
>> switch to page 76662 and it took almost 30 seconds.
>
>
> Yeah that's pretty much my experience, and I think SolrCloud would only
> exacerbate the problem (due to increased complexity of sorting). If there's
> no silver bullet to be found, I guess I'll just have to disable paging on
> large data sets -- which is fine, really, who the hell browses through 50
> billion documents anyway? That's what search is for, right?
>
> Thx,
>
>  - Bram
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud & Paging on large indexes

Bram Van Dam
On 12/22/2014 04:27 PM, Erick Erickson wrote:
> Have you read Hossman's blog here?
> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/#referrer=solr.pl

Oh thanks, that's a pretty interesting read. The scale we're
investigating is several orders of magnitude larger than what was tested
there, so I'm still a bit worried.

> Because if you're trying this and _still_ getting bad performance we
> need to know.

I'll definitely keep you posted when our test results on larger indexes
(~50 billion documents) come in, but this sadly won't be any time soon
(infrastructure sucks). The largest index I currently have access to is
about a billion documents in size. Paging there is a nightmare, but the
Solr version is too old to support cursors so I'm afraid I can't offer
any useful data.

Does anyone have any performance data on multi-billion-document indexes?
With or without SolrCloud?

> Bram:
> One minor pedantic clarification.. The first round-trip only returns
> the id and sort criteria (score by default), not the whole document,
> although the effect is the same, as you page N into the corpus, the
> default implementation returns N * (pageNum + 1) entries. Even worse,
> each node itself has to _sort_ that many entries.... Then a second
> call is made to get the page-worth of docs...

I was trying to keep it short and sweet, but yes, that's the way I think
it works ;-)

> That said, though, its pretty easy to argue that the 500th page is
> pretty useless, nobody will ever hit the "next page" button 499 times.

Nobody will hit next 499 times, but a lot of our users skip to the last
page quite often. Maybe I should make *that* as hard as possible. Hmm.

Thanks for the tips!

  - Bram
Reply | Threaded
Open this post in threaded view
|

RE: SolrCloud & Paging on large indexes

Toke Eskildsen
Bram Van Dam [[hidden email]] wrote:

[Solr cursors]

> Oh thanks, that's a pretty interesting read. The scale we're
> investigating is several orders of magnitude larger than what was tested
> there, so I'm still a bit worried.

The beauty of the cursor is that it is has little to no overhead, relative to a standard top-X sorted search. A standard search uses a sliding window over the full result set, as does a cursor-search. Same amount of work. It is just a question of limits for the window.

> The largest index I currently have access to is
> about a billion documents in size. Paging there is a nightmare, but the
> Solr version is too old to support cursors so I'm afraid I can't offer
> any useful data.

Non-cursor paging in Solr uses a sliding window sort with a heap that contains all documents up to the paging number. A heap is a very fine thing for sliding window sort, as long as it is small. But performance drops to horrible levels when it gets large as it is extremely RAM-cache unfriendly.

> Does anyone have any performance data on multi-billion-document indexes?

Sorry, no. I could do a test on our 7 billion documents index, but it would have to wait until the end of January.

>Nobody will hit next 499 times, but a lot of our users skip to the last
> page quite often. Maybe I should make *that* as hard as possible. Hmm.

Issue a search with sort in reverse order, then reverse the returned list of documents?

- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud & Paging on large indexes

Erick Erickson
> Nobody will hit next 499 times, but a lot of our users skip to the last page quite often. Maybe I should make *that* as hard as possible. Hmm

Right. I'd actually argue that providing a "last page" link in this situation is

1) useless to the user, I mean what's the point? Curiosity? If it really _must_
be supported, Toke's approach is sneaky and elegant. Sort in reverse order and
give them the first page ;).

2) dangerous as you well know...

> several orders of magnitude larger than what was tested
> there, so I'm still a bit worried.

I sympathize, but somebody has to be first ;). Besides, the
current situation is untenable from what you're saying...

Good luck!
Erick

On Tue, Dec 23, 2014 at 7:07 AM, Toke Eskildsen <[hidden email]> wrote:

> Bram Van Dam [[hidden email]] wrote:
>
> [Solr cursors]
>
>> Oh thanks, that's a pretty interesting read. The scale we're
>> investigating is several orders of magnitude larger than what was tested
>> there, so I'm still a bit worried.
>
> The beauty of the cursor is that it is has little to no overhead, relative to a standard top-X sorted search. A standard search uses a sliding window over the full result set, as does a cursor-search. Same amount of work. It is just a question of limits for the window.
>
>> The largest index I currently have access to is
>> about a billion documents in size. Paging there is a nightmare, but the
>> Solr version is too old to support cursors so I'm afraid I can't offer
>> any useful data.
>
> Non-cursor paging in Solr uses a sliding window sort with a heap that contains all documents up to the paging number. A heap is a very fine thing for sliding window sort, as long as it is small. But performance drops to horrible levels when it gets large as it is extremely RAM-cache unfriendly.
>
>> Does anyone have any performance data on multi-billion-document indexes?
>
> Sorry, no. I could do a test on our 7 billion documents index, but it would have to wait until the end of January.
>
>>Nobody will hit next 499 times, but a lot of our users skip to the last
>> page quite often. Maybe I should make *that* as hard as possible. Hmm.
>
> Issue a search with sort in reverse order, then reverse the returned list of documents?
>
> - Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud & Paging on large indexes

heaven
This post was updated on .
Would be cool to have an ability to get not only the next page cursor, but next page cursors, or a set of cursors for a given window, so we can draw page numbers. Not sure about the last page though.
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud & Paging on large indexes

Bram Van Dam
In reply to this post by Toke Eskildsen
On 12/23/2014 04:07 PM, Toke Eskildsen wrote:
> The beauty of the cursor is that it is has little to no overhead, relative to a standard top-X sorted search. A standard search uses a sliding window over the full result set, as does a cursor-search. Same amount of work. It is just a question of limits for the window.

That is very good to hear. Thanks.

>> Nobody will hit next 499 times, but a lot of our users skip to the last
>> page quite often. Maybe I should make *that* as hard as possible. Hmm.
>
> Issue a search with sort in reverse order, then reverse the returned list of documents?

Sneaky. I like it. But in the end we're simply getting rid of the
"last"-button. Solves a lot of issues. If have a billion search results,
you might as well refine your criteria!

  - Bram