What's the deal with dataimporthandler overwriting indexes?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

What's the deal with dataimporthandler overwriting indexes?

Joakim Hansson
Hi!
We are currently upgrading from solr 6.2 master slave setup to solr 7.6
running solrcloud.
I dont know if I've missed something really trivial, but everytime I start
a full import (dataimport?command=full-import&clean=true&optimize=true) the
old index gets overwritten by the new import.

In 6.2 this wasn't really a problem since I could disable replication in
the API on the master and enable it once the import was completed.
With 7.6 and solrcloud we use NRT-shards and replicas since those are the
only ones that support rule-based replica placement and whenever I start a
new import the old index is overwritten all over the solrcloud cluster.

I have tried changing to clean=false, but that makes the import finish
without adding any docs.
Doesn't matter if I use soft or hard commits.

I don't get the logic in this. Why would you ever want to delete an
existing index before there is a new one in place? What is it I'm missing
here?

Please enlighten me.
Reply | Threaded
Open this post in threaded view
|

Re: What's the deal with dataimporthandler overwriting indexes?

Emir Arnautović
Hi Joakim,
This might not be what you expect but it is expected behaviour. When you do clean=true, DIH will first delete all records. That is how it works in both M/S and Cloud. The diff might be that you disabled replication or disabled auto commits in your old setup so it is not visible. You can disable auto commits in Cloud and you will have your old index until the next commit, but that is not recommended way. What is usually done when you want to control what becomes active index is using aliases and do full import into new collection. After you verify that everything is ok, you update alias to new collection and it becomes the active one. You can keep the old one so you can roll back in case you notice some issues or you simply drop it when alias is updated.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 12 Feb 2019, at 10:46, Joakim Hansson <[hidden email]> wrote:
>
> Hi!
> We are currently upgrading from solr 6.2 master slave setup to solr 7.6
> running solrcloud.
> I dont know if I've missed something really trivial, but everytime I start
> a full import (dataimport?command=full-import&clean=true&optimize=true) the
> old index gets overwritten by the new import.
>
> In 6.2 this wasn't really a problem since I could disable replication in
> the API on the master and enable it once the import was completed.
> With 7.6 and solrcloud we use NRT-shards and replicas since those are the
> only ones that support rule-based replica placement and whenever I start a
> new import the old index is overwritten all over the solrcloud cluster.
>
> I have tried changing to clean=false, but that makes the import finish
> without adding any docs.
> Doesn't matter if I use soft or hard commits.
>
> I don't get the logic in this. Why would you ever want to delete an
> existing index before there is a new one in place? What is it I'm missing
> here?
>
> Please enlighten me.

Reply | Threaded
Open this post in threaded view
|

RE: What's the deal with dataimporthandler overwriting indexes?

Vadim Ivanov
In reply to this post by Joakim Hansson
Hi!
If clean=true then index will be replaced completely by the new import. That is how it is supposed to work.
If you don't want preemptively delete your index set &clean=false. And set &commit=true instead of &optimize=true
Are you sure about optimize? Do you really need it? Usually it's very costly.
So, I'd try:
dataimport?command=full-import&clean=false&commit=true

If nevertheless nothing imported, please check the log
--
Vadim



> -----Original Message-----
> From: Joakim Hansson [mailto:[hidden email]]
> Sent: Tuesday, February 12, 2019 12:47 PM
> To: [hidden email]
> Subject: What's the deal with dataimporthandler overwriting indexes?
>
> Hi!
> We are currently upgrading from solr 6.2 master slave setup to solr 7.6
> running solrcloud.
> I dont know if I've missed something really trivial, but everytime I start
> a full import (dataimport?command=full-import&clean=true&optimize=true)
> the
> old index gets overwritten by the new import.
>
> In 6.2 this wasn't really a problem since I could disable replication in
> the API on the master and enable it once the import was completed.
> With 7.6 and solrcloud we use NRT-shards and replicas since those are the
> only ones that support rule-based replica placement and whenever I start a
> new import the old index is overwritten all over the solrcloud cluster.
>
> I have tried changing to clean=false, but that makes the import finish
> without adding any docs.
> Doesn't matter if I use soft or hard commits.
>
> I don't get the logic in this. Why would you ever want to delete an
> existing index before there is a new one in place? What is it I'm missing
> here?
>
> Please enlighten me.

Reply | Threaded
Open this post in threaded view
|

Re: What's the deal with dataimporthandler overwriting indexes?

eaph
I've run into this also; it is a key difference between a master-slave
setup and a solrCloud setup.

clean=true has always deleted the index on the first commit, but in older
versions of Solr, the workaround was to disable replication until the full
reindex had completed.

This is a convenient practice for a number of reasons, especially for small
indices.  It really isn't supported in SolrCloud, because of the difference
in how writes are processed for Master/Slave vs. SolrCloud.  With a
Master/Slave setup, all writes are going to the same location, so disabling
replication lets you buffer them up all in one go.   With a SolrCloud
setup,  the data is distributed across the nodes in the cluster.  So it
would need to know to blow away at the 'master' node for each shard to
support the 'clean', serve traffic from the slaves only for each shard,
until the re-index completes, do the replications, and then resume normal
operation.

Note that in Solr 7.x if you revert to the master/slave setup, you need to
disable polling at the slaves.  Disabling replication at the master will
also cause index deletion at the slaves (SOLR-11938).

Elizabeth

On Tue, Feb 12, 2019 at 11:42 AM Vadim Ivanov <
[hidden email]> wrote:

> Hi!
> If clean=true then index will be replaced completely by the new import.
> That is how it is supposed to work.
> If you don't want preemptively delete your index set &clean=false. And set
> &commit=true instead of &optimize=true
> Are you sure about optimize? Do you really need it? Usually it's very
> costly.
> So, I'd try:
> dataimport?command=full-import&clean=false&commit=true
>
> If nevertheless nothing imported, please check the log
> --
> Vadim
>
>
>
> > -----Original Message-----
> > From: Joakim Hansson [mailto:[hidden email]]
> > Sent: Tuesday, February 12, 2019 12:47 PM
> > To: [hidden email]
> > Subject: What's the deal with dataimporthandler overwriting indexes?
> >
> > Hi!
> > We are currently upgrading from solr 6.2 master slave setup to solr 7.6
> > running solrcloud.
> > I dont know if I've missed something really trivial, but everytime I
> start
> > a full import (dataimport?command=full-import&clean=true&optimize=true)
> > the
> > old index gets overwritten by the new import.
> >
> > In 6.2 this wasn't really a problem since I could disable replication in
> > the API on the master and enable it once the import was completed.
> > With 7.6 and solrcloud we use NRT-shards and replicas since those are the
> > only ones that support rule-based replica placement and whenever I start
> a
> > new import the old index is overwritten all over the solrcloud cluster.
> >
> > I have tried changing to clean=false, but that makes the import finish
> > without adding any docs.
> > Doesn't matter if I use soft or hard commits.
> >
> > I don't get the logic in this. Why would you ever want to delete an
> > existing index before there is a new one in place? What is it I'm missing
> > here?
> >
> > Please enlighten me.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: What's the deal with dataimporthandler overwriting indexes?

Joakim Hansson
Thank you all for helping me with this.
I have started implementing aliases and that seems like the proper way to
go.
Thanks again and all the best!



Den tis 12 feb. 2019 kl 18:16 skrev Elizabeth Haubert <
[hidden email]>:

> I've run into this also; it is a key difference between a master-slave
> setup and a solrCloud setup.
>
> clean=true has always deleted the index on the first commit, but in older
> versions of Solr, the workaround was to disable replication until the full
> reindex had completed.
>
> This is a convenient practice for a number of reasons, especially for small
> indices.  It really isn't supported in SolrCloud, because of the difference
> in how writes are processed for Master/Slave vs. SolrCloud.  With a
> Master/Slave setup, all writes are going to the same location, so disabling
> replication lets you buffer them up all in one go.   With a SolrCloud
> setup,  the data is distributed across the nodes in the cluster.  So it
> would need to know to blow away at the 'master' node for each shard to
> support the 'clean', serve traffic from the slaves only for each shard,
> until the re-index completes, do the replications, and then resume normal
> operation.
>
> Note that in Solr 7.x if you revert to the master/slave setup, you need to
> disable polling at the slaves.  Disabling replication at the master will
> also cause index deletion at the slaves (SOLR-11938).
>
> Elizabeth
>
> On Tue, Feb 12, 2019 at 11:42 AM Vadim Ivanov <
> [hidden email]> wrote:
>
> > Hi!
> > If clean=true then index will be replaced completely by the new import.
> > That is how it is supposed to work.
> > If you don't want preemptively delete your index set &clean=false. And
> set
> > &commit=true instead of &optimize=true
> > Are you sure about optimize? Do you really need it? Usually it's very
> > costly.
> > So, I'd try:
> > dataimport?command=full-import&clean=false&commit=true
> >
> > If nevertheless nothing imported, please check the log
> > --
> > Vadim
> >
> >
> >
> > > -----Original Message-----
> > > From: Joakim Hansson [mailto:[hidden email]]
> > > Sent: Tuesday, February 12, 2019 12:47 PM
> > > To: [hidden email]
> > > Subject: What's the deal with dataimporthandler overwriting indexes?
> > >
> > > Hi!
> > > We are currently upgrading from solr 6.2 master slave setup to solr 7.6
> > > running solrcloud.
> > > I dont know if I've missed something really trivial, but everytime I
> > start
> > > a full import (dataimport?command=full-import&clean=true&optimize=true)
> > > the
> > > old index gets overwritten by the new import.
> > >
> > > In 6.2 this wasn't really a problem since I could disable replication
> in
> > > the API on the master and enable it once the import was completed.
> > > With 7.6 and solrcloud we use NRT-shards and replicas since those are
> the
> > > only ones that support rule-based replica placement and whenever I
> start
> > a
> > > new import the old index is overwritten all over the solrcloud cluster.
> > >
> > > I have tried changing to clean=false, but that makes the import finish
> > > without adding any docs.
> > > Doesn't matter if I use soft or hard commits.
> > >
> > > I don't get the logic in this. Why would you ever want to delete an
> > > existing index before there is a new one in place? What is it I'm
> missing
> > > here?
> > >
> > > Please enlighten me.
> >
> >
>