[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (SOLR-10265) Overseer can become the bottleneck in very large clusters

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/SOLR-10265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931544#comment-15931544 ]

Erick Erickson commented on SOLR-10265:

bq: My suspicion is that it's problematic because of the number of messages in a typical queue

Well, that's certainly true, in this case the overseer work queue for at least two reasons: 1> the current method of dealing with the queue doesn't batch and 2> the state changes get written to the logs which get _huge_. Part of the bits we're exploring are being smarter about that too.

bq: I can envision types that indicate "node X just came up" and "node X just went down"

"Just went down" is being considered, but "just came up" is a different problem. Consider a single JVM hosting 400 replicas from 100 different collections each of which has 4 shards. "Node just came up" could (I think) move them all from "down" to "recovering" in one go, but all the leader election stuff still needs to be hashed out. Also consider if a leader and follower are on the same node. How do they coordinate the leader coming up and informing the follower of that fact without sending individual messages?

bq: When a collection is created, is there REALLY a need to send state changes for EVERY collection in the cloud?

IIUC this just shouldn't be happening. If it is we need to fix it. I've certainly seen some allegations that this is so.

bq: If ZK isn't used for the overseer queue, ....

Well, that's part of the proposal. The in-memory Overseer queue is OK (according to Noble's latest comment) if each client retries submitting the pending operations that aren't ack'd. So no alternate queueing mechanism is necessary.

What's really happening here IMO is the evolution of SorlCloud. So far we've gotten away with inefficiencies because we haven't really had huge clusters to deal with, now we're getting them. So we need to squeeze what efficiencies we can out of the current architecture while considering if we really need ZK as the intermediary. Of course cutting down the number of messages would help, the fear is that there's been a great deal of work put into hardening all the wonky conditions so I do worry a bit about wholesale restructuring.... Ya gotta bite the bullet sometime though.

> Overseer can become the bottleneck in very large clusters
> ---------------------------------------------------------
>                 Key: SOLR-10265
>                 URL: https://issues.apache.org/jira/browse/SOLR-10265
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public)
>            Reporter: Varun Thacker
> Let's say we have a large cluster. Some numbers:
> - To ingest the data at the volume we want to I need roughly a 600 shard collection.
> - Index into the collection for 1 hour and then create a new collection
> - For a 30 days retention window with these numbers we would end up wth  ~400k cores in the cluster
> - Just a rolling restart of this cluster can take hours because the overseer queue gets backed up. If a few nodes looses connectivity to ZooKeeper then also we can end up with lots of messages in the Overseer queue
> With some tests here are the two high level problems we have identified:
> 1> How fast can the overseer process operations:
> The rate at which the overseer processes events is too slow at this scale.
> I ran {{OverseerTest#testPerformance}} which creates 10 collections ( 1 shard 1 replica ) and generates 20k state change events. The test took 119 seconds to run on my machine which means ~170 events a second. Let's say a server can process 5x of my machine so 1k events a second.
> Total events generated by a 400k replica cluster = 400k * 4 ( state changes till replica become active ) = 1.6M / 1k events a second will be 1600 minutes.
> Second observation was that the rate at which the overseer can process events slows down when the number of items in the queue gets larger
> I ran the same {{OverseerTest#testPerformance}} but changed the number of events generated to 2000 instead. The test took only 5 seconds to run. So it was a lot faster than the test run which generated 20k events
> 2> State changes overwhelming ZK:
> For every state change Solr is writing out a big state.json to zookeeper. This can lead to the zookeeper transaction logs going out of control even with auto purging etc set .
> I haven't debugged why the transaction logs ran into terabytes without taking into snapshots but this was my assumption based on the other problems we observed

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]