Clarification on Solr Stability Wiki: Collections and Shards

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Clarification on Solr Stability Wiki: Collections and Shards

Jon Drews
In the Solr Wiki, Shawn Heisey writes the following:

"Regardless of the number of nodes or available resources, SolrCloud begins
to have stability problems when the number of collections reaches the low
hundreds. With thousands of collections, any little problem or change to
the cluster can cause a stability death spiral that may not recover for
tens of minutes. Try to keep the number of collections as low as possible.
These problems are due to how SolrCloud updates cluster state in zookeeper
in response to cluster changes. Work is underway to try and improve this
situation."
https://wiki.apache.org/solr/SolrPerformanceProblems?
action=diff&rev1=45&rev2=46

I'd like to know if this would apply to a standalone Solr system (embedded
Zk) with one collection and the low hundreds of shards (e.g. 1 node, 1
collection and 200 shards).

If there's any JIRA tickets we should track regarding the work that's
underway to resolve this situation please provide them. Thanks!

We are currently using 5.3.1
Reply | Threaded
Open this post in threaded view
|

Re: Clarification on Solr Stability Wiki: Collections and Shards

Erick Erickson
Hmmm, that page is quite a bit out of date. I think Shawn is talking
about the "old style" Solr (4.x) that put all the state information
for all the collections in a single znode "clusterstate.json". Newer
style Solr puts each collection's state in
/collections/my_collection/state.json which has very significantly
reduced this issue.

There are still some issues in the 5x code line where you can have a
ton of messages be processed by the "Overseer" at massive scales...

However, I know of installations with several 100s of K (yes hundreds
of thousands) of replicas out there, split up amongst a _lot_ of
collections. That takes quite a bit of care and feeding, mind you.

So your setup shouldn't be a problem, although I'd bring up my Solr
instances one at a time.

Whether ZK is embedded or not isn't really a problem, but I would very
seriously consider moving it to an external ensemble. It's not so much
a functional issue as administrative. You have to be careful to bring
your Solr nodes up and down carefully or you lose quorum.

Best,
Erick

On Tue, Oct 10, 2017 at 7:37 AM, Jon Drews <[hidden email]> wrote:

> In the Solr Wiki, Shawn Heisey writes the following:
>
> "Regardless of the number of nodes or available resources, SolrCloud begins
> to have stability problems when the number of collections reaches the low
> hundreds. With thousands of collections, any little problem or change to
> the cluster can cause a stability death spiral that may not recover for
> tens of minutes. Try to keep the number of collections as low as possible.
> These problems are due to how SolrCloud updates cluster state in zookeeper
> in response to cluster changes. Work is underway to try and improve this
> situation."
> https://wiki.apache.org/solr/SolrPerformanceProblems?
> action=diff&rev1=45&rev2=46
>
> I'd like to know if this would apply to a standalone Solr system (embedded
> Zk) with one collection and the low hundreds of shards (e.g. 1 node, 1
> collection and 200 shards).
>
> If there's any JIRA tickets we should track regarding the work that's
> underway to resolve this situation please provide them. Thanks!
>
> We are currently using 5.3.1
Reply | Threaded
Open this post in threaded view
|

Re: Clarification on Solr Stability Wiki: Collections and Shards

Shawn Heisey
On 10/10/2017 9:11 AM, Erick Erickson wrote:

> Hmmm, that page is quite a bit out of date. I think Shawn is talking
> about the "old style" Solr (4.x) that put all the state information
> for all the collections in a single znode "clusterstate.json". Newer
> style Solr puts each collection's state in
> /collections/my_collection/state.json which has very significantly
> reduced this issue.
>
> There are still some issues in the 5x code line where you can have a
> ton of messages be processed by the "Overseer" at massive scales...
>
> However, I know of installations with several 100s of K (yes hundreds
> of thousands) of replicas out there, split up amongst a _lot_ of
> collections. That takes quite a bit of care and feeding, mind you.
>
> So your setup shouldn't be a problem, although I'd bring up my Solr
> instances one at a time.
>
> Whether ZK is embedded or not isn't really a problem, but I would very
> seriously consider moving it to an external ensemble. It's not so much
> a functional issue as administrative. You have to be careful to bring
> your Solr nodes up and down carefully or you lose quorum.

The testing I did on SOLR-7191, which is where that statement came from,
was mostly on 5.x with the per-collection clusterstate that was new at
the time, and I still found that it would not scale well.

Some later poking around with 6.x (long after SOLR-7191 was resolved
with no commits) indicates that current versions scale even worse than
early 5.x did.  I believe the biggest source of the scalability problems
is the fact that the overseer queue gets spammed with a very large
number of operations that cannot be handled quickly.

One collection with 200 shards probably would not present much of a
scalability problem where ZK is concerned, but because a query on that
collection will consist of between 201 and 401 smaller queries, I would
not expect the single-query performance to be very good.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Clarification on Solr Stability Wiki: Collections and Shards

Erick Erickson
What about SOLR-10619 and SOLR-10983? Of the two, 10619 is probably
the most important in this respect. The way the Overseer consumed
requests from the queue was very inefficient and may particularly
affect this problem. There are a couple of other JIRAs that center
around not creating unnecessary messages in the first place, but 10619
is a major improvement.

Erick

On Tue, Oct 10, 2017 at 9:41 AM, Shawn Heisey <[hidden email]> wrote:

> On 10/10/2017 9:11 AM, Erick Erickson wrote:
>>
>> Hmmm, that page is quite a bit out of date. I think Shawn is talking
>> about the "old style" Solr (4.x) that put all the state information
>> for all the collections in a single znode "clusterstate.json". Newer
>> style Solr puts each collection's state in
>> /collections/my_collection/state.json which has very significantly
>> reduced this issue.
>>
>> There are still some issues in the 5x code line where you can have a
>> ton of messages be processed by the "Overseer" at massive scales...
>>
>> However, I know of installations with several 100s of K (yes hundreds
>> of thousands) of replicas out there, split up amongst a _lot_ of
>> collections. That takes quite a bit of care and feeding, mind you.
>>
>> So your setup shouldn't be a problem, although I'd bring up my Solr
>> instances one at a time.
>>
>> Whether ZK is embedded or not isn't really a problem, but I would very
>> seriously consider moving it to an external ensemble. It's not so much
>> a functional issue as administrative. You have to be careful to bring
>> your Solr nodes up and down carefully or you lose quorum.
>
>
> The testing I did on SOLR-7191, which is where that statement came from, was
> mostly on 5.x with the per-collection clusterstate that was new at the time,
> and I still found that it would not scale well.
>
> Some later poking around with 6.x (long after SOLR-7191 was resolved with no
> commits) indicates that current versions scale even worse than early 5.x
> did.  I believe the biggest source of the scalability problems is the fact
> that the overseer queue gets spammed with a very large number of operations
> that cannot be handled quickly.
>
> One collection with 200 shards probably would not present much of a
> scalability problem where ZK is concerned, but because a query on that
> collection will consist of between 201 and 401 smaller queries, I would not
> expect the single-query performance to be very good.
>
> Thanks,
> Shawn
>