Bootstrapping a Collection on SolrCloud

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Bootstrapping a Collection on SolrCloud

Frank Greguska-2
Hello,

I am trying to bootstrap a SolrCloud installation and I ran into an issue
that seems rather odd. I see it is possible to bootstrap a configuration
set from an existing SOLR_HOME using

./server/scripts/cloud-scripts/zkcli.sh -zkhost ${ZK_HOST} -cmd bootstrap
-solrhome ${SOLR_HOME}

but this does not create a collection, it just uploads a configuration set.

Furthermore, I can not use

bin/solr create

to create a collection and link it to my bootstrapped configuration set
because it requires Solr to already be running.

I'm hoping someone can shed some light on why this is the case? It seems
like a collection is just some znodes stored in zookeeper that contain
configuration settings and such. Why should I not be able to create those
nodes before Solr is running?

I'd like to open a feature request for this if one does not already exist
and if I am not missing something obvious.

Thank you,

Frank Greguska
Reply | Threaded
Open this post in threaded view
|

Re: Bootstrapping a Collection on SolrCloud

Erick Erickson
How would you envision that working? When would the
replicas actually be created and under what heuristics?

Imagine this is possible, and there are a bunch of
placeholders in ZK for a 10-shard, collection with
a replication factor of 10 (100 replicas all told). Now
I bring up a single Solr instance. Should all 100 replicas
be created immediately? Wait for N Solr nodes to be
brought online? On some command?

My gut feel is that this would be fraught with problems
and not very valuable to many people. If you could create
the "template" in ZK without any replicas actually being created,
then at some other point say "make it so", I don't see the advantage
over just the current setup. And I do think that it would be
considerable effort.

Net-net is I'd like to see a much stronger justification
before anyone embarks on something like this. First as
I mentioned above I think it'd be a lot of effort, second I
virtually guarantee it'd introduce significant bugs. How
would it interact with autoscaling for instance?

Best,
Erick

On Wed, Jan 9, 2019 at 9:59 AM Frank Greguska <[hidden email]> wrote:

>
> Hello,
>
> I am trying to bootstrap a SolrCloud installation and I ran into an issue
> that seems rather odd. I see it is possible to bootstrap a configuration
> set from an existing SOLR_HOME using
>
> ./server/scripts/cloud-scripts/zkcli.sh -zkhost ${ZK_HOST} -cmd bootstrap
> -solrhome ${SOLR_HOME}
>
> but this does not create a collection, it just uploads a configuration set.
>
> Furthermore, I can not use
>
> bin/solr create
>
> to create a collection and link it to my bootstrapped configuration set
> because it requires Solr to already be running.
>
> I'm hoping someone can shed some light on why this is the case? It seems
> like a collection is just some znodes stored in zookeeper that contain
> configuration settings and such. Why should I not be able to create those
> nodes before Solr is running?
>
> I'd like to open a feature request for this if one does not already exist
> and if I am not missing something obvious.
>
> Thank you,
>
> Frank Greguska
Reply | Threaded
Open this post in threaded view
|

Re: Bootstrapping a Collection on SolrCloud

Frank Greguska-2
Thanks for the response. You do raise good points.

Say I reverse your example and I have a 10 node cluster with a 10-shard
collection and a replication factor of 10. Now I kill 9 of my nodes, do all
100 replicas move to the one remaining node? I believe the answer is, well
that depends on the configuration.

I'm thinking about it from the initial cluster planning side of things. The
decisions about auto-scaling, how many replicas, and even how many shards
are at least partially dependent on the available hardware. So at
deployment time I would think there would be a way of defining what the
collection *should* look like based on the hardware I am deploying to.
Obviously this could change during runtime and I may need to add nodes,
split shards, etc...

As it is now it seems like I need to deploy my cluster then write a custom
script to ensure each node I expect to be there is running and only then
create my collection with desired shards and replication.

- Frank

On Wed, Jan 9, 2019 at 2:14 PM Erick Erickson <[hidden email]>
wrote:

> How would you envision that working? When would the
> replicas actually be created and under what heuristics?
>
> Imagine this is possible, and there are a bunch of
> placeholders in ZK for a 10-shard, collection with
> a replication factor of 10 (100 replicas all told). Now
> I bring up a single Solr instance. Should all 100 replicas
> be created immediately? Wait for N Solr nodes to be
> brought online? On some command?
>
> My gut feel is that this would be fraught with problems
> and not very valuable to many people. If you could create
> the "template" in ZK without any replicas actually being created,
> then at some other point say "make it so", I don't see the advantage
> over just the current setup. And I do think that it would be
> considerable effort.
>
> Net-net is I'd like to see a much stronger justification
> before anyone embarks on something like this. First as
> I mentioned above I think it'd be a lot of effort, second I
> virtually guarantee it'd introduce significant bugs. How
> would it interact with autoscaling for instance?
>
> Best,
> Erick
>
> On Wed, Jan 9, 2019 at 9:59 AM Frank Greguska <[hidden email]> wrote:
> >
> > Hello,
> >
> > I am trying to bootstrap a SolrCloud installation and I ran into an issue
> > that seems rather odd. I see it is possible to bootstrap a configuration
> > set from an existing SOLR_HOME using
> >
> > ./server/scripts/cloud-scripts/zkcli.sh -zkhost ${ZK_HOST} -cmd bootstrap
> > -solrhome ${SOLR_HOME}
> >
> > but this does not create a collection, it just uploads a configuration
> set.
> >
> > Furthermore, I can not use
> >
> > bin/solr create
> >
> > to create a collection and link it to my bootstrapped configuration set
> > because it requires Solr to already be running.
> >
> > I'm hoping someone can shed some light on why this is the case? It seems
> > like a collection is just some znodes stored in zookeeper that contain
> > configuration settings and such. Why should I not be able to create those
> > nodes before Solr is running?
> >
> > I'd like to open a feature request for this if one does not already exist
> > and if I am not missing something obvious.
> >
> > Thank you,
> >
> > Frank Greguska
>
Reply | Threaded
Open this post in threaded view
|

Re: Bootstrapping a Collection on SolrCloud

Erick Erickson
bq.  do all 100 replicas move to the one remaining node?

No. The replicas are in a "down" state the Solr instances
are brought back up (I'm skipping autoscaling here, but
even that wouldn't move all the replicas to the one remaining
node).

bq.  what the collection *should* look like based on the
hardware I am deploying to.

With the caveat that the Solr instances have to be up, this
is entirely possible. First of all, you can provide a "createNodeSet"
to the create command to specify exactly what Solr nodes you
want used for your collection. There's a special "EMPTY"
value that _almost_ does what you want, that is it creates
no replicas, just the configuration in ZooKeeper. Thereafter,
though, you have to ADDREPLICA (which you can do with
"node" parameter to place it exactly where you want.

bq. how many shards are at least partially dependent on the
available hardware

Not if you're using compositeID routing. The number of shards
is fixed at creation time, although you can split them later.

I don't  think you can use bin/solr create_collection with the
EMPTY createNodeSet, so you need at least one
Solr node running to create your skeleton collection.

I think the thing I'm getting stuck on is how in the world the
Solr code could know enough to "do the right thing". How many
docs do you have? How big are they? How much to you expect
to grow? What kinds of searches do you want to support?

But more power to you if you can figure out how to support the kind
of thing you want. Personally I think it's harder than you might
think and not broadly useful. I've been wrong more times than I like
to recall, so maybe you have an approach that would get around
the tigers hiding in the grass I think are out there...

Best,
Erick


On Wed, Jan 9, 2019 at 3:04 PM Frank Greguska <[hidden email]> wrote:

>
> Thanks for the response. You do raise good points.
>
> Say I reverse your example and I have a 10 node cluster with a 10-shard
> collection and a replication factor of 10. Now I kill 9 of my nodes, do all
> 100 replicas move to the one remaining node? I believe the answer is, well
> that depends on the configuration.
>
> I'm thinking about it from the initial cluster planning side of things. The
> decisions about auto-scaling, how many replicas, and even how many shards
> are at least partially dependent on the available hardware. So at
> deployment time I would think there would be a way of defining what the
> collection *should* look like based on the hardware I am deploying to.
> Obviously this could change during runtime and I may need to add nodes,
> split shards, etc...
>
> As it is now it seems like I need to deploy my cluster then write a custom
> script to ensure each node I expect to be there is running and only then
> create my collection with desired shards and replication.
>
> - Frank
>
> On Wed, Jan 9, 2019 at 2:14 PM Erick Erickson <[hidden email]>
> wrote:
>
> > How would you envision that working? When would the
> > replicas actually be created and under what heuristics?
> >
> > Imagine this is possible, and there are a bunch of
> > placeholders in ZK for a 10-shard, collection with
> > a replication factor of 10 (100 replicas all told). Now
> > I bring up a single Solr instance. Should all 100 replicas
> > be created immediately? Wait for N Solr nodes to be
> > brought online? On some command?
> >
> > My gut feel is that this would be fraught with problems
> > and not very valuable to many people. If you could create
> > the "template" in ZK without any replicas actually being created,
> > then at some other point say "make it so", I don't see the advantage
> > over just the current setup. And I do think that it would be
> > considerable effort.
> >
> > Net-net is I'd like to see a much stronger justification
> > before anyone embarks on something like this. First as
> > I mentioned above I think it'd be a lot of effort, second I
> > virtually guarantee it'd introduce significant bugs. How
> > would it interact with autoscaling for instance?
> >
> > Best,
> > Erick
> >
> > On Wed, Jan 9, 2019 at 9:59 AM Frank Greguska <[hidden email]> wrote:
> > >
> > > Hello,
> > >
> > > I am trying to bootstrap a SolrCloud installation and I ran into an issue
> > > that seems rather odd. I see it is possible to bootstrap a configuration
> > > set from an existing SOLR_HOME using
> > >
> > > ./server/scripts/cloud-scripts/zkcli.sh -zkhost ${ZK_HOST} -cmd bootstrap
> > > -solrhome ${SOLR_HOME}
> > >
> > > but this does not create a collection, it just uploads a configuration
> > set.
> > >
> > > Furthermore, I can not use
> > >
> > > bin/solr create
> > >
> > > to create a collection and link it to my bootstrapped configuration set
> > > because it requires Solr to already be running.
> > >
> > > I'm hoping someone can shed some light on why this is the case? It seems
> > > like a collection is just some znodes stored in zookeeper that contain
> > > configuration settings and such. Why should I not be able to create those
> > > nodes before Solr is running?
> > >
> > > I'd like to open a feature request for this if one does not already exist
> > > and if I am not missing something obvious.
> > >
> > > Thank you,
> > >
> > > Frank Greguska
> >
Reply | Threaded
Open this post in threaded view
|

Re: Bootstrapping a Collection on SolrCloud

Frank Greguska-2
Thanks, I am no Solr expert so I may be over-simplifying things a bit in my
ignorance.

"No. The replicas are in a "down" state the Solr instances are brought back
up" Why can't I dictate (at least initially) the "up" state somehow? It
seems Solr keeps track of where replicas were deployed so that the cluster
'heals' itself when all nodes are back. At deployment, I know which nodes
should be available so the collection could be unavailable until all
expected nodes are up.

Thank you for the pointer to the createNodeSet parameter, that might prove
useful.

"I think the thing I'm getting stuck on is how in the world the
Solr code could know enough to "do the right thing". How many
docs do you have? How big are they? How much to you expect
to grow? What kinds of searches do you want to support?"

Solr can't know these things. But me as the deployer/developer might.
For example say I know my initial data size and can say the index will be
10 TB. If I have 2 nodes with 5 TB disks well then I have to have 2 shards
because it won't fit on one node. If instead I have 4 nodes with 5 TB
disks, well I could still have 2 shards but with replicas. Or I could
choose no replicas but more shards. This is what I mean by the
shard/replica decision being partially dependent on available hardware;
there are some decisions I could make knowing my planned deployment so that
when I start the cluster it can be immediately functional. Rather than
first starting the cluster, then creating the collection, then making it
available.

You may be right that it is a small and complicated concern because I
really only need to care about it once when I am first deploying my
cluster. But everyone who needs to stand up a SolrCloud cluster needs to do
it. My guess is most people either do it manually as a one-time operations
thing or they write a custom script to do it for them automatically as I am
attempting. Seems like a good candidate for a new feature.

- Frank

On Wed, Jan 9, 2019 at 4:18 PM Erick Erickson <[hidden email]>
wrote:

> bq.  do all 100 replicas move to the one remaining node?
>
> No. The replicas are in a "down" state the Solr instances
> are brought back up (I'm skipping autoscaling here, but
> even that wouldn't move all the replicas to the one remaining
> node).
>
> bq.  what the collection *should* look like based on the
> hardware I am deploying to.
>
> With the caveat that the Solr instances have to be up, this
> is entirely possible. First of all, you can provide a "createNodeSet"
> to the create command to specify exactly what Solr nodes you
> want used for your collection. There's a special "EMPTY"
> value that _almost_ does what you want, that is it creates
> no replicas, just the configuration in ZooKeeper. Thereafter,
> though, you have to ADDREPLICA (which you can do with
> "node" parameter to place it exactly where you want.
>
> bq. how many shards are at least partially dependent on the
> available hardware
>
> Not if you're using compositeID routing. The number of shards
> is fixed at creation time, although you can split them later.
>
> I don't  think you can use bin/solr create_collection with the
> EMPTY createNodeSet, so you need at least one
> Solr node running to create your skeleton collection.
>
> I think the thing I'm getting stuck on is how in the world the
> Solr code could know enough to "do the right thing". How many
> docs do you have? How big are they? How much to you expect
> to grow? What kinds of searches do you want to support?
>
> But more power to you if you can figure out how to support the kind
> of thing you want. Personally I think it's harder than you might
> think and not broadly useful. I've been wrong more times than I like
> to recall, so maybe you have an approach that would get around
> the tigers hiding in the grass I think are out there...
>
> Best,
> Erick
>
>
> On Wed, Jan 9, 2019 at 3:04 PM Frank Greguska <[hidden email]> wrote:
> >
> > Thanks for the response. You do raise good points.
> >
> > Say I reverse your example and I have a 10 node cluster with a 10-shard
> > collection and a replication factor of 10. Now I kill 9 of my nodes, do
> all
> > 100 replicas move to the one remaining node? I believe the answer is,
> well
> > that depends on the configuration.
> >
> > I'm thinking about it from the initial cluster planning side of things.
> The
> > decisions about auto-scaling, how many replicas, and even how many shards
> > are at least partially dependent on the available hardware. So at
> > deployment time I would think there would be a way of defining what the
> > collection *should* look like based on the hardware I am deploying to.
> > Obviously this could change during runtime and I may need to add nodes,
> > split shards, etc...
> >
> > As it is now it seems like I need to deploy my cluster then write a
> custom
> > script to ensure each node I expect to be there is running and only then
> > create my collection with desired shards and replication.
> >
> > - Frank
> >
> > On Wed, Jan 9, 2019 at 2:14 PM Erick Erickson <[hidden email]>
> > wrote:
> >
> > > How would you envision that working? When would the
> > > replicas actually be created and under what heuristics?
> > >
> > > Imagine this is possible, and there are a bunch of
> > > placeholders in ZK for a 10-shard, collection with
> > > a replication factor of 10 (100 replicas all told). Now
> > > I bring up a single Solr instance. Should all 100 replicas
> > > be created immediately? Wait for N Solr nodes to be
> > > brought online? On some command?
> > >
> > > My gut feel is that this would be fraught with problems
> > > and not very valuable to many people. If you could create
> > > the "template" in ZK without any replicas actually being created,
> > > then at some other point say "make it so", I don't see the advantage
> > > over just the current setup. And I do think that it would be
> > > considerable effort.
> > >
> > > Net-net is I'd like to see a much stronger justification
> > > before anyone embarks on something like this. First as
> > > I mentioned above I think it'd be a lot of effort, second I
> > > virtually guarantee it'd introduce significant bugs. How
> > > would it interact with autoscaling for instance?
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Jan 9, 2019 at 9:59 AM Frank Greguska <[hidden email]>
> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I am trying to bootstrap a SolrCloud installation and I ran into an
> issue
> > > > that seems rather odd. I see it is possible to bootstrap a
> configuration
> > > > set from an existing SOLR_HOME using
> > > >
> > > > ./server/scripts/cloud-scripts/zkcli.sh -zkhost ${ZK_HOST} -cmd
> bootstrap
> > > > -solrhome ${SOLR_HOME}
> > > >
> > > > but this does not create a collection, it just uploads a
> configuration
> > > set.
> > > >
> > > > Furthermore, I can not use
> > > >
> > > > bin/solr create
> > > >
> > > > to create a collection and link it to my bootstrapped configuration
> set
> > > > because it requires Solr to already be running.
> > > >
> > > > I'm hoping someone can shed some light on why this is the case? It
> seems
> > > > like a collection is just some znodes stored in zookeeper that
> contain
> > > > configuration settings and such. Why should I not be able to create
> those
> > > > nodes before Solr is running?
> > > >
> > > > I'd like to open a feature request for this if one does not already
> exist
> > > > and if I am not missing something obvious.
> > > >
> > > > Thank you,
> > > >
> > > > Frank Greguska
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: Bootstrapping a Collection on SolrCloud

Erick Erickson
First, for a given data set, I can easily double or halve
the size of the index on disk depending on what options
I choose for my fields; things like how many times I may
need to copy fields to support various use-cases,
whether I need to store the input for some, all or no
fields, whether I enable docValues, whether I need to
support phrase queries and on and on....

Even assuming you can estimate the eventual size,
it doesn't help much. As one example, if you choose
stored="true", the index size will grow by roughly 50% of
the raw data size. But that data doesn't really affect
searching that much in that it doesn't need to be
RAM resident in the same way your terms data needs
to be. So In  order to be performant I may need anywhere
from a fraction of the raw index size on disk to multiples
of the index size on disk in terms of RAM.

So you see where this is going. I'm not against your
suggestion, but I have strong doubts as to its
feasibility give all the variables I've seen. We can revisit
this after you've had a chance to kick the tires, I suspect
we'll have more shared context on which to base
the discussion.

Best,
Erick

On Wed, Jan 9, 2019 at 5:12 PM Frank Greguska <[hidden email]> wrote:

>
> Thanks, I am no Solr expert so I may be over-simplifying things a bit in my
> ignorance.
>
> "No. The replicas are in a "down" state the Solr instances are brought back
> up" Why can't I dictate (at least initially) the "up" state somehow? It
> seems Solr keeps track of where replicas were deployed so that the cluster
> 'heals' itself when all nodes are back. At deployment, I know which nodes
> should be available so the collection could be unavailable until all
> expected nodes are up.
>
> Thank you for the pointer to the createNodeSet parameter, that might prove
> useful.
>
> "I think the thing I'm getting stuck on is how in the world the
> Solr code could know enough to "do the right thing". How many
> docs do you have? How big are they? How much to you expect
> to grow? What kinds of searches do you want to support?"
>
> Solr can't know these things. But me as the deployer/developer might.
> For example say I know my initial data size and can say the index will be
> 10 TB. If I have 2 nodes with 5 TB disks well then I have to have 2 shards
> because it won't fit on one node. If instead I have 4 nodes with 5 TB
> disks, well I could still have 2 shards but with replicas. Or I could
> choose no replicas but more shards. This is what I mean by the
> shard/replica decision being partially dependent on available hardware;
> there are some decisions I could make knowing my planned deployment so that
> when I start the cluster it can be immediately functional. Rather than
> first starting the cluster, then creating the collection, then making it
> available.
>
> You may be right that it is a small and complicated concern because I
> really only need to care about it once when I am first deploying my
> cluster. But everyone who needs to stand up a SolrCloud cluster needs to do
> it. My guess is most people either do it manually as a one-time operations
> thing or they write a custom script to do it for them automatically as I am
> attempting. Seems like a good candidate for a new feature.
>
> - Frank
>
> On Wed, Jan 9, 2019 at 4:18 PM Erick Erickson <[hidden email]>
> wrote:
>
> > bq.  do all 100 replicas move to the one remaining node?
> >
> > No. The replicas are in a "down" state the Solr instances
> > are brought back up (I'm skipping autoscaling here, but
> > even that wouldn't move all the replicas to the one remaining
> > node).
> >
> > bq.  what the collection *should* look like based on the
> > hardware I am deploying to.
> >
> > With the caveat that the Solr instances have to be up, this
> > is entirely possible. First of all, you can provide a "createNodeSet"
> > to the create command to specify exactly what Solr nodes you
> > want used for your collection. There's a special "EMPTY"
> > value that _almost_ does what you want, that is it creates
> > no replicas, just the configuration in ZooKeeper. Thereafter,
> > though, you have to ADDREPLICA (which you can do with
> > "node" parameter to place it exactly where you want.
> >
> > bq. how many shards are at least partially dependent on the
> > available hardware
> >
> > Not if you're using compositeID routing. The number of shards
> > is fixed at creation time, although you can split them later.
> >
> > I don't  think you can use bin/solr create_collection with the
> > EMPTY createNodeSet, so you need at least one
> > Solr node running to create your skeleton collection.
> >
> > I think the thing I'm getting stuck on is how in the world the
> > Solr code could know enough to "do the right thing". How many
> > docs do you have? How big are they? How much to you expect
> > to grow? What kinds of searches do you want to support?
> >
> > But more power to you if you can figure out how to support the kind
> > of thing you want. Personally I think it's harder than you might
> > think and not broadly useful. I've been wrong more times than I like
> > to recall, so maybe you have an approach that would get around
> > the tigers hiding in the grass I think are out there...
> >
> > Best,
> > Erick
> >
> >
> > On Wed, Jan 9, 2019 at 3:04 PM Frank Greguska <[hidden email]> wrote:
> > >
> > > Thanks for the response. You do raise good points.
> > >
> > > Say I reverse your example and I have a 10 node cluster with a 10-shard
> > > collection and a replication factor of 10. Now I kill 9 of my nodes, do
> > all
> > > 100 replicas move to the one remaining node? I believe the answer is,
> > well
> > > that depends on the configuration.
> > >
> > > I'm thinking about it from the initial cluster planning side of things.
> > The
> > > decisions about auto-scaling, how many replicas, and even how many shards
> > > are at least partially dependent on the available hardware. So at
> > > deployment time I would think there would be a way of defining what the
> > > collection *should* look like based on the hardware I am deploying to.
> > > Obviously this could change during runtime and I may need to add nodes,
> > > split shards, etc...
> > >
> > > As it is now it seems like I need to deploy my cluster then write a
> > custom
> > > script to ensure each node I expect to be there is running and only then
> > > create my collection with desired shards and replication.
> > >
> > > - Frank
> > >
> > > On Wed, Jan 9, 2019 at 2:14 PM Erick Erickson <[hidden email]>
> > > wrote:
> > >
> > > > How would you envision that working? When would the
> > > > replicas actually be created and under what heuristics?
> > > >
> > > > Imagine this is possible, and there are a bunch of
> > > > placeholders in ZK for a 10-shard, collection with
> > > > a replication factor of 10 (100 replicas all told). Now
> > > > I bring up a single Solr instance. Should all 100 replicas
> > > > be created immediately? Wait for N Solr nodes to be
> > > > brought online? On some command?
> > > >
> > > > My gut feel is that this would be fraught with problems
> > > > and not very valuable to many people. If you could create
> > > > the "template" in ZK without any replicas actually being created,
> > > > then at some other point say "make it so", I don't see the advantage
> > > > over just the current setup. And I do think that it would be
> > > > considerable effort.
> > > >
> > > > Net-net is I'd like to see a much stronger justification
> > > > before anyone embarks on something like this. First as
> > > > I mentioned above I think it'd be a lot of effort, second I
> > > > virtually guarantee it'd introduce significant bugs. How
> > > > would it interact with autoscaling for instance?
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > On Wed, Jan 9, 2019 at 9:59 AM Frank Greguska <[hidden email]>
> > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I am trying to bootstrap a SolrCloud installation and I ran into an
> > issue
> > > > > that seems rather odd. I see it is possible to bootstrap a
> > configuration
> > > > > set from an existing SOLR_HOME using
> > > > >
> > > > > ./server/scripts/cloud-scripts/zkcli.sh -zkhost ${ZK_HOST} -cmd
> > bootstrap
> > > > > -solrhome ${SOLR_HOME}
> > > > >
> > > > > but this does not create a collection, it just uploads a
> > configuration
> > > > set.
> > > > >
> > > > > Furthermore, I can not use
> > > > >
> > > > > bin/solr create
> > > > >
> > > > > to create a collection and link it to my bootstrapped configuration
> > set
> > > > > because it requires Solr to already be running.
> > > > >
> > > > > I'm hoping someone can shed some light on why this is the case? It
> > seems
> > > > > like a collection is just some znodes stored in zookeeper that
> > contain
> > > > > configuration settings and such. Why should I not be able to create
> > those
> > > > > nodes before Solr is running?
> > > > >
> > > > > I'd like to open a feature request for this if one does not already
> > exist
> > > > > and if I am not missing something obvious.
> > > > >
> > > > > Thank you,
> > > > >
> > > > > Frank Greguska
> > > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: Bootstrapping a Collection on SolrCloud

Frank Greguska-2
I've decided to take the approach to wait for the expected number of nodes
to become available before initializing the collection. Here is the script
I am using:

https://github.com/apache/incubator-sdap-nexus/blob/91b15ce0b123d652eaa1f5eb589a835ae3e77ceb/docker/solr/cloud-init/create-collection.py

This script will be deployed (using kubernetes) alongside every Solr node
and started at the same time as Solr. I utilize a lock in zookeeper to
ensure that only one node ever attempts to create the collection.

I still think this could be done without any actual nodes running so that
when the cluster starts the collection is immediately ready but this seems
to fit my purpose for now.

- Frank

On Wed, Jan 9, 2019 at 7:22 PM Erick Erickson <[hidden email]>
wrote:

> First, for a given data set, I can easily double or halve
> the size of the index on disk depending on what options
> I choose for my fields; things like how many times I may
> need to copy fields to support various use-cases,
> whether I need to store the input for some, all or no
> fields, whether I enable docValues, whether I need to
> support phrase queries and on and on....
>
> Even assuming you can estimate the eventual size,
> it doesn't help much. As one example, if you choose
> stored="true", the index size will grow by roughly 50% of
> the raw data size. But that data doesn't really affect
> searching that much in that it doesn't need to be
> RAM resident in the same way your terms data needs
> to be. So In  order to be performant I may need anywhere
> from a fraction of the raw index size on disk to multiples
> of the index size on disk in terms of RAM.
>
> So you see where this is going. I'm not against your
> suggestion, but I have strong doubts as to its
> feasibility give all the variables I've seen. We can revisit
> this after you've had a chance to kick the tires, I suspect
> we'll have more shared context on which to base
> the discussion.
>
> Best,
> Erick
>
> On Wed, Jan 9, 2019 at 5:12 PM Frank Greguska <[hidden email]> wrote:
> >
> > Thanks, I am no Solr expert so I may be over-simplifying things a bit in
> my
> > ignorance.
> >
> > "No. The replicas are in a "down" state the Solr instances are brought
> back
> > up" Why can't I dictate (at least initially) the "up" state somehow? It
> > seems Solr keeps track of where replicas were deployed so that the
> cluster
> > 'heals' itself when all nodes are back. At deployment, I know which nodes
> > should be available so the collection could be unavailable until all
> > expected nodes are up.
> >
> > Thank you for the pointer to the createNodeSet parameter, that might
> prove
> > useful.
> >
> > "I think the thing I'm getting stuck on is how in the world the
> > Solr code could know enough to "do the right thing". How many
> > docs do you have? How big are they? How much to you expect
> > to grow? What kinds of searches do you want to support?"
> >
> > Solr can't know these things. But me as the deployer/developer might.
> > For example say I know my initial data size and can say the index will be
> > 10 TB. If I have 2 nodes with 5 TB disks well then I have to have 2
> shards
> > because it won't fit on one node. If instead I have 4 nodes with 5 TB
> > disks, well I could still have 2 shards but with replicas. Or I could
> > choose no replicas but more shards. This is what I mean by the
> > shard/replica decision being partially dependent on available hardware;
> > there are some decisions I could make knowing my planned deployment so
> that
> > when I start the cluster it can be immediately functional. Rather than
> > first starting the cluster, then creating the collection, then making it
> > available.
> >
> > You may be right that it is a small and complicated concern because I
> > really only need to care about it once when I am first deploying my
> > cluster. But everyone who needs to stand up a SolrCloud cluster needs to
> do
> > it. My guess is most people either do it manually as a one-time
> operations
> > thing or they write a custom script to do it for them automatically as I
> am
> > attempting. Seems like a good candidate for a new feature.
> >
> > - Frank
> >
> > On Wed, Jan 9, 2019 at 4:18 PM Erick Erickson <[hidden email]>
> > wrote:
> >
> > > bq.  do all 100 replicas move to the one remaining node?
> > >
> > > No. The replicas are in a "down" state the Solr instances
> > > are brought back up (I'm skipping autoscaling here, but
> > > even that wouldn't move all the replicas to the one remaining
> > > node).
> > >
> > > bq.  what the collection *should* look like based on the
> > > hardware I am deploying to.
> > >
> > > With the caveat that the Solr instances have to be up, this
> > > is entirely possible. First of all, you can provide a "createNodeSet"
> > > to the create command to specify exactly what Solr nodes you
> > > want used for your collection. There's a special "EMPTY"
> > > value that _almost_ does what you want, that is it creates
> > > no replicas, just the configuration in ZooKeeper. Thereafter,
> > > though, you have to ADDREPLICA (which you can do with
> > > "node" parameter to place it exactly where you want.
> > >
> > > bq. how many shards are at least partially dependent on the
> > > available hardware
> > >
> > > Not if you're using compositeID routing. The number of shards
> > > is fixed at creation time, although you can split them later.
> > >
> > > I don't  think you can use bin/solr create_collection with the
> > > EMPTY createNodeSet, so you need at least one
> > > Solr node running to create your skeleton collection.
> > >
> > > I think the thing I'm getting stuck on is how in the world the
> > > Solr code could know enough to "do the right thing". How many
> > > docs do you have? How big are they? How much to you expect
> > > to grow? What kinds of searches do you want to support?
> > >
> > > But more power to you if you can figure out how to support the kind
> > > of thing you want. Personally I think it's harder than you might
> > > think and not broadly useful. I've been wrong more times than I like
> > > to recall, so maybe you have an approach that would get around
> > > the tigers hiding in the grass I think are out there...
> > >
> > > Best,
> > > Erick
> > >
> > >
> > > On Wed, Jan 9, 2019 at 3:04 PM Frank Greguska <[hidden email]>
> wrote:
> > > >
> > > > Thanks for the response. You do raise good points.
> > > >
> > > > Say I reverse your example and I have a 10 node cluster with a
> 10-shard
> > > > collection and a replication factor of 10. Now I kill 9 of my nodes,
> do
> > > all
> > > > 100 replicas move to the one remaining node? I believe the answer is,
> > > well
> > > > that depends on the configuration.
> > > >
> > > > I'm thinking about it from the initial cluster planning side of
> things.
> > > The
> > > > decisions about auto-scaling, how many replicas, and even how many
> shards
> > > > are at least partially dependent on the available hardware. So at
> > > > deployment time I would think there would be a way of defining what
> the
> > > > collection *should* look like based on the hardware I am deploying
> to.
> > > > Obviously this could change during runtime and I may need to add
> nodes,
> > > > split shards, etc...
> > > >
> > > > As it is now it seems like I need to deploy my cluster then write a
> > > custom
> > > > script to ensure each node I expect to be there is running and only
> then
> > > > create my collection with desired shards and replication.
> > > >
> > > > - Frank
> > > >
> > > > On Wed, Jan 9, 2019 at 2:14 PM Erick Erickson <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > How would you envision that working? When would the
> > > > > replicas actually be created and under what heuristics?
> > > > >
> > > > > Imagine this is possible, and there are a bunch of
> > > > > placeholders in ZK for a 10-shard, collection with
> > > > > a replication factor of 10 (100 replicas all told). Now
> > > > > I bring up a single Solr instance. Should all 100 replicas
> > > > > be created immediately? Wait for N Solr nodes to be
> > > > > brought online? On some command?
> > > > >
> > > > > My gut feel is that this would be fraught with problems
> > > > > and not very valuable to many people. If you could create
> > > > > the "template" in ZK without any replicas actually being created,
> > > > > then at some other point say "make it so", I don't see the
> advantage
> > > > > over just the current setup. And I do think that it would be
> > > > > considerable effort.
> > > > >
> > > > > Net-net is I'd like to see a much stronger justification
> > > > > before anyone embarks on something like this. First as
> > > > > I mentioned above I think it'd be a lot of effort, second I
> > > > > virtually guarantee it'd introduce significant bugs. How
> > > > > would it interact with autoscaling for instance?
> > > > >
> > > > > Best,
> > > > > Erick
> > > > >
> > > > > On Wed, Jan 9, 2019 at 9:59 AM Frank Greguska <[hidden email]>
> > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am trying to bootstrap a SolrCloud installation and I ran into
> an
> > > issue
> > > > > > that seems rather odd. I see it is possible to bootstrap a
> > > configuration
> > > > > > set from an existing SOLR_HOME using
> > > > > >
> > > > > > ./server/scripts/cloud-scripts/zkcli.sh -zkhost ${ZK_HOST} -cmd
> > > bootstrap
> > > > > > -solrhome ${SOLR_HOME}
> > > > > >
> > > > > > but this does not create a collection, it just uploads a
> > > configuration
> > > > > set.
> > > > > >
> > > > > > Furthermore, I can not use
> > > > > >
> > > > > > bin/solr create
> > > > > >
> > > > > > to create a collection and link it to my bootstrapped
> configuration
> > > set
> > > > > > because it requires Solr to already be running.
> > > > > >
> > > > > > I'm hoping someone can shed some light on why this is the case?
> It
> > > seems
> > > > > > like a collection is just some znodes stored in zookeeper that
> > > contain
> > > > > > configuration settings and such. Why should I not be able to
> create
> > > those
> > > > > > nodes before Solr is running?
> > > > > >
> > > > > > I'd like to open a feature request for this if one does not
> already
> > > exist
> > > > > > and if I am not missing something obvious.
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > Frank Greguska
> > > > >
> > >
>