Setting up MiniSolrCloudCluster to use pre-built index

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Setting up MiniSolrCloudCluster to use pre-built index

kkrugler
Hi all,

Wondering if anyone has experience (this is with Solr 6.6) in setting up MiniSolrCloudCluster for unit testing, where we want to use an existing index.

Note that this index wasn’t built with SolrCloud, as it’s generated by a distributed (Hadoop) workflow.

So there’s no “restore from backup” option, or swapping collection aliases, etc.

We can push our configset to Zookeeper and create the collection as per other unit tests in Solr, but what’s the right way to set up data dirs for the cores such that Solr is running with this existing index (or indexes, for our sharded test case)?

Thanks!

— Ken

PS - yes, we’re aware of the routing issue with generating our own shards….

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply | Threaded
Open this post in threaded view
|

Re: Setting up MiniSolrCloudCluster to use pre-built index

Mark Miller-3
You create MiniSolrCloudCluster with a base directory and then each Jetty
instance created gets a SolrHome in a subfolder called node{i}. So if
legacyCloud=true you can just preconfigure a core and index under the right
node{i} subfolder. legacyCloud=true should not even exist anymore though,
so the long term way to do this would be to create a collection and then
use the merge API or something to merge your index into the empty
collection.

 - Mark

On Sat, May 19, 2018 at 5:25 PM Ken Krugler <[hidden email]>
wrote:

> Hi all,
>
> Wondering if anyone has experience (this is with Solr 6.6) in setting up
> MiniSolrCloudCluster for unit testing, where we want to use an existing
> index.
>
> Note that this index wasn’t built with SolrCloud, as it’s generated by a
> distributed (Hadoop) workflow.
>
> So there’s no “restore from backup” option, or swapping collection
> aliases, etc.
>
> We can push our configset to Zookeeper and create the collection as per
> other unit tests in Solr, but what’s the right way to set up data dirs for
> the cores such that Solr is running with this existing index (or indexes,
> for our sharded test case)?
>
> Thanks!
>
> — Ken
>
> PS - yes, we’re aware of the routing issue with generating our own shards….
>
> --------------------------
> Ken Krugler
> +1 530-210-6378 <(530)%20210-6378>
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
> --
- Mark
about.me/markrmiller
Reply | Threaded
Open this post in threaded view
|

Re: Setting up MiniSolrCloudCluster to use pre-built index

kkrugler
Hi Mark,

I’ll have a completely new, rebuilt index that’s (a) large, and (b) already sharded appropriately.

In that case, using the merge API isn’t great, in that it would take significant time and temporarily use double (or more) disk space.

E.g. I’ve got an index with 250M+ records, and about 200GB. There are other indexes, still big but not quite as large as this one.

So I’m still wondering if there’s any robust way to swap in a fresh set of shards, especially without relying on legacy cloud mode.

I think I can figure out where the data is being stored for an existing (empty) collection, shut that down, swap in the new files, and reload.

But I’m wondering if that’s really the best (or even sane) approach.

Thanks,

— Ken

> On May 19, 2018, at 6:24 PM, Mark Miller <[hidden email]> wrote:
>
> You create MiniSolrCloudCluster with a base directory and then each Jetty
> instance created gets a SolrHome in a subfolder called node{i}. So if
> legacyCloud=true you can just preconfigure a core and index under the right
> node{i} subfolder. legacyCloud=true should not even exist anymore though,
> so the long term way to do this would be to create a collection and then
> use the merge API or something to merge your index into the empty
> collection.
>
> - Mark
>
> On Sat, May 19, 2018 at 5:25 PM Ken Krugler <[hidden email]>
> wrote:
>
>> Hi all,
>>
>> Wondering if anyone has experience (this is with Solr 6.6) in setting up
>> MiniSolrCloudCluster for unit testing, where we want to use an existing
>> index.
>>
>> Note that this index wasn’t built with SolrCloud, as it’s generated by a
>> distributed (Hadoop) workflow.
>>
>> So there’s no “restore from backup” option, or swapping collection
>> aliases, etc.
>>
>> We can push our configset to Zookeeper and create the collection as per
>> other unit tests in Solr, but what’s the right way to set up data dirs for
>> the cores such that Solr is running with this existing index (or indexes,
>> for our sharded test case)?
>>
>> Thanks!
>>
>> — Ken
>>
>> PS - yes, we’re aware of the routing issue with generating our own shards….
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378 <(530)%20210-6378>
>> http://www.scaleunlimited.com
>> Custom big data solutions & training
>> Flink, Solr, Hadoop, Cascading & Cassandra
>>
>> --
> - Mark
> about.me/markrmiller

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply | Threaded
Open this post in threaded view
|

Re: Setting up MiniSolrCloudCluster to use pre-built index

Mark Miller-3
The merge can be really fast - it can just dump in the new segments and
rewrite the segments file basically.

I guess for you want, that's perhaps not the ideal route though. You could
maybe try and use collection aliases.

I thought about adding shard aliases way back, but never got to it.

On Tue, Oct 23, 2018 at 7:10 PM Ken Krugler <[hidden email]>
wrote:

> Hi Mark,
>
> I’ll have a completely new, rebuilt index that’s (a) large, and (b)
> already sharded appropriately.
>
> In that case, using the merge API isn’t great, in that it would take
> significant time and temporarily use double (or more) disk space.
>
> E.g. I’ve got an index with 250M+ records, and about 200GB. There are
> other indexes, still big but not quite as large as this one.
>
> So I’m still wondering if there’s any robust way to swap in a fresh set of
> shards, especially without relying on legacy cloud mode.
>
> I think I can figure out where the data is being stored for an existing
> (empty) collection, shut that down, swap in the new files, and reload.
>
> But I’m wondering if that’s really the best (or even sane) approach.
>
> Thanks,
>
> — Ken
>
> On May 19, 2018, at 6:24 PM, Mark Miller <[hidden email]> wrote:
>
> You create MiniSolrCloudCluster with a base directory and then each Jetty
> instance created gets a SolrHome in a subfolder called node{i}. So if
> legacyCloud=true you can just preconfigure a core and index under the right
> node{i} subfolder. legacyCloud=true should not even exist anymore though,
> so the long term way to do this would be to create a collection and then
> use the merge API or something to merge your index into the empty
> collection.
>
> - Mark
>
> On Sat, May 19, 2018 at 5:25 PM Ken Krugler <[hidden email]>
> wrote:
>
> Hi all,
>
> Wondering if anyone has experience (this is with Solr 6.6) in setting up
> MiniSolrCloudCluster for unit testing, where we want to use an existing
> index.
>
> Note that this index wasn’t built with SolrCloud, as it’s generated by a
> distributed (Hadoop) workflow.
>
> So there’s no “restore from backup” option, or swapping collection
> aliases, etc.
>
> We can push our configset to Zookeeper and create the collection as per
> other unit tests in Solr, but what’s the right way to set up data dirs for
> the cores such that Solr is running with this existing index (or indexes,
> for our sharded test case)?
>
> Thanks!
>
> — Ken
>
> PS - yes, we’re aware of the routing issue with generating our own shards….
>
> --------------------------
> Ken Krugler
> +1 530-210-6378 <(530)%20210-6378>
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
> --
>
> - Mark
> about.me/markrmiller
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>

--
- Mark

http://about.me/markrmiller