Replication issue with version 0 index in SOLR 7.5

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Replication issue with version 0 index in SOLR 7.5

Patrick Bordelon
Hi,

We recently upgraded to SOLR 7.5 in AWS, we had previously been running SOLR
6.5. In our current configuration we have our applications broken into a
single instance primary environment and a multi-instance replica environment
separated behind a load balancer for each environment.

Until recently we've been able to reload the primary without the replicas
updating until there was a full index. However when we upgraded to 7.5 we
started noticing that after terminating and rebuilding a primary instance
that the associated replicas would all start showing 0 documents in all
indexes. After some research we believe we've tracked down the issue.
SOLR-11293.

SOLR-11293 changes
<https://issues.apache.org/jira/browse/SOLR-11293?focusedCommentId=16182379&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16182379>  

This fix changed the way the replication handler checks before updating a
replica when the primary has an empty index. Whether it's from deleting the
old index or from terminating the instance.

This is the code as it was in 6.5 replication handler

      if (latestVersion == 0L) {
        if (forceReplication && commit.getGeneration() != 0) {
          // since we won't get the files for an empty index,
          // we just clear ours and commit
          RefCounted<IndexWriter> iw =
solrCore.getUpdateHandler().getSolrCoreState().getIndexWriter(solrCore);
          try {
            iw.get().deleteAll();
          } finally {
            iw.decref();
          }
          SolrQueryRequest req = new LocalSolrQueryRequest(solrCore, new
ModifiableSolrParams());
          solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req,
false));
        }
               
               
Without forced replication the index on the replica won't perform the
deletaAll operation and will keep the old index until a new index version is
created.

However in 7.5 the code was changed to this.

      if (latestVersion == 0L) {
        if (commit.getGeneration() != 0) {
          // since we won't get the files for an empty index,
          // we just clear ours and commit
          log.info("New index in Master. Deleting mine...");
          RefCounted<IndexWriter> iw =
solrCore.getUpdateHandler().getSolrCoreState().getIndexWriter(solrCore);
          try {
            iw.get().deleteAll();
          } finally {
            iw.decref();
          }
          assert TestInjection.injectDelayBeforeSlaveCommitRefresh();
          if (skipCommitOnMasterVersionZero) {
            openNewSearcherAndUpdateCommitPoint();
          } else {
            SolrQueryRequest req = new LocalSolrQueryRequest(solrCore, new
ModifiableSolrParams());
            solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req,
false));
          }
        }
               
With the removal of the forceReplication check we believe the replica always
deletes it's index when it detects that a new version 0 index is created.

This is a problem as we can't afford to have active replicas to have 0
documents on them in the event of a failure of the primary. Since we can't
control the termination on AWS instances this opens up a problem as any
primary outage has a chance of jeopardizing the replicas viability.

Is there a way to restore this functionality in the current or future
releases? We are willing to upgrade to a later version including the latest
if it will help resolve this problem.

If you suggest we use a load balancer health check to prevent this we
already are. However the load balancer type we are using (application) has a
feature that allows access through it when all instances under it are
failing. This bypasses our health check and still allows the replicas to
poll from the primary even when it's not fully loaded. We can't change load
balancer types as there are other features that we are taking advantage of
and can't change currently.
               



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Replication issue with version 0 index in SOLR 7.5

Mikhail Khludnev-2
Hello, Patrick.
Can <str name="replicateAfter">commit</str> help you?

On Tue, Jun 25, 2019 at 12:55 AM Patrick Bordelon <
[hidden email]> wrote:

> Hi,
>
> We recently upgraded to SOLR 7.5 in AWS, we had previously been running
> SOLR
> 6.5. In our current configuration we have our applications broken into a
> single instance primary environment and a multi-instance replica
> environment
> separated behind a load balancer for each environment.
>
> Until recently we've been able to reload the primary without the replicas
> updating until there was a full index. However when we upgraded to 7.5 we
> started noticing that after terminating and rebuilding a primary instance
> that the associated replicas would all start showing 0 documents in all
> indexes. After some research we believe we've tracked down the issue.
> SOLR-11293.
>
> SOLR-11293 changes
> <
> https://issues.apache.org/jira/browse/SOLR-11293?focusedCommentId=16182379&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16182379>
>
>
> This fix changed the way the replication handler checks before updating a
> replica when the primary has an empty index. Whether it's from deleting the
> old index or from terminating the instance.
>
> This is the code as it was in 6.5 replication handler
>
>       if (latestVersion == 0L) {
>         if (forceReplication && commit.getGeneration() != 0) {
>           // since we won't get the files for an empty index,
>           // we just clear ours and commit
>           RefCounted<IndexWriter> iw =
> solrCore.getUpdateHandler().getSolrCoreState().getIndexWriter(solrCore);
>           try {
>             iw.get().deleteAll();
>           } finally {
>             iw.decref();
>           }
>           SolrQueryRequest req = new LocalSolrQueryRequest(solrCore, new
> ModifiableSolrParams());
>           solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req,
> false));
>         }
>
>
> Without forced replication the index on the replica won't perform the
> deletaAll operation and will keep the old index until a new index version
> is
> created.
>
> However in 7.5 the code was changed to this.
>
>       if (latestVersion == 0L) {
>         if (commit.getGeneration() != 0) {
>           // since we won't get the files for an empty index,
>           // we just clear ours and commit
>           log.info("New index in Master. Deleting mine...");
>           RefCounted<IndexWriter> iw =
> solrCore.getUpdateHandler().getSolrCoreState().getIndexWriter(solrCore);
>           try {
>             iw.get().deleteAll();
>           } finally {
>             iw.decref();
>           }
>           assert TestInjection.injectDelayBeforeSlaveCommitRefresh();
>           if (skipCommitOnMasterVersionZero) {
>             openNewSearcherAndUpdateCommitPoint();
>           } else {
>             SolrQueryRequest req = new LocalSolrQueryRequest(solrCore, new
> ModifiableSolrParams());
>             solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req,
> false));
>           }
>         }
>
> With the removal of the forceReplication check we believe the replica
> always
> deletes it's index when it detects that a new version 0 index is created.
>
> This is a problem as we can't afford to have active replicas to have 0
> documents on them in the event of a failure of the primary. Since we can't
> control the termination on AWS instances this opens up a problem as any
> primary outage has a chance of jeopardizing the replicas viability.
>
> Is there a way to restore this functionality in the current or future
> releases? We are willing to upgrade to a later version including the latest
> if it will help resolve this problem.
>
> If you suggest we use a load balancer health check to prevent this we
> already are. However the load balancer type we are using (application) has
> a
> feature that allows access through it when all instances under it are
> failing. This bypasses our health check and still allows the replicas to
> poll from the primary even when it's not fully loaded. We can't change load
> balancer types as there are other features that we are taking advantage of
> and can't change currently.
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Replication issue with version 0 index in SOLR 7.5

Mikhail Khludnev-2
Note, it seems like the current Solr's logic relies on persistent master
disks.
https://github.com/apache/lucene-solr/blob/master/solr/core/src/test/org/apache/solr/handler/TestReplicationHandler.java#L615


On Tue, Jun 25, 2019 at 3:16 PM Mikhail Khludnev <[hidden email]> wrote:

> Hello, Patrick.
> Can <str name="replicateAfter">commit</str> help you?
>
> On Tue, Jun 25, 2019 at 12:55 AM Patrick Bordelon <
> [hidden email]> wrote:
>
>> Hi,
>>
>> We recently upgraded to SOLR 7.5 in AWS, we had previously been running
>> SOLR
>> 6.5. In our current configuration we have our applications broken into a
>> single instance primary environment and a multi-instance replica
>> environment
>> separated behind a load balancer for each environment.
>>
>> Until recently we've been able to reload the primary without the replicas
>> updating until there was a full index. However when we upgraded to 7.5 we
>> started noticing that after terminating and rebuilding a primary instance
>> that the associated replicas would all start showing 0 documents in all
>> indexes. After some research we believe we've tracked down the issue.
>> SOLR-11293.
>>
>> SOLR-11293 changes
>> <
>> https://issues.apache.org/jira/browse/SOLR-11293?focusedCommentId=16182379&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16182379>
>>
>>
>> This fix changed the way the replication handler checks before updating a
>> replica when the primary has an empty index. Whether it's from deleting
>> the
>> old index or from terminating the instance.
>>
>> This is the code as it was in 6.5 replication handler
>>
>>       if (latestVersion == 0L) {
>>         if (forceReplication && commit.getGeneration() != 0) {
>>           // since we won't get the files for an empty index,
>>           // we just clear ours and commit
>>           RefCounted<IndexWriter> iw =
>> solrCore.getUpdateHandler().getSolrCoreState().getIndexWriter(solrCore);
>>           try {
>>             iw.get().deleteAll();
>>           } finally {
>>             iw.decref();
>>           }
>>           SolrQueryRequest req = new LocalSolrQueryRequest(solrCore, new
>> ModifiableSolrParams());
>>           solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req,
>> false));
>>         }
>>
>>
>> Without forced replication the index on the replica won't perform the
>> deletaAll operation and will keep the old index until a new index version
>> is
>> created.
>>
>> However in 7.5 the code was changed to this.
>>
>>       if (latestVersion == 0L) {
>>         if (commit.getGeneration() != 0) {
>>           // since we won't get the files for an empty index,
>>           // we just clear ours and commit
>>           log.info("New index in Master. Deleting mine...");
>>           RefCounted<IndexWriter> iw =
>> solrCore.getUpdateHandler().getSolrCoreState().getIndexWriter(solrCore);
>>           try {
>>             iw.get().deleteAll();
>>           } finally {
>>             iw.decref();
>>           }
>>           assert TestInjection.injectDelayBeforeSlaveCommitRefresh();
>>           if (skipCommitOnMasterVersionZero) {
>>             openNewSearcherAndUpdateCommitPoint();
>>           } else {
>>             SolrQueryRequest req = new LocalSolrQueryRequest(solrCore, new
>> ModifiableSolrParams());
>>             solrCore.getUpdateHandler().commit(new
>> CommitUpdateCommand(req,
>> false));
>>           }
>>         }
>>
>> With the removal of the forceReplication check we believe the replica
>> always
>> deletes it's index when it detects that a new version 0 index is created.
>>
>> This is a problem as we can't afford to have active replicas to have 0
>> documents on them in the event of a failure of the primary. Since we can't
>> control the termination on AWS instances this opens up a problem as any
>> primary outage has a chance of jeopardizing the replicas viability.
>>
>> Is there a way to restore this functionality in the current or future
>> releases? We are willing to upgrade to a later version including the
>> latest
>> if it will help resolve this problem.
>>
>> If you suggest we use a load balancer health check to prevent this we
>> already are. However the load balancer type we are using (application)
>> has a
>> feature that allows access through it when all instances under it are
>> failing. This bypasses our health check and still allows the replicas to
>> poll from the primary even when it's not fully loaded. We can't change
>> load
>> balancer types as there are other features that we are taking advantage of
>> and can't change currently.
>>
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Replication issue with version 0 index in SOLR 7.5

Patrick Bordelon
We are currently using the replicate after commit and startup

        <lst name="master">
            <str name="enable">${replication.enable.master:false}</str>
            <str name="replicateAfter">commit</str>
            <str name="replicateAfter">startup</str>
            <str name="confFiles">schema.xml,stopwords.txt</str>
        </lst>



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Replication issue with version 0 index in SOLR 7.5

Mikhail Khludnev-2
Ok. probable dropping  startup will help. Another idea
set replication.enable.master=false and enable it when master index is
build after restart.

On Tue, Jun 25, 2019 at 6:18 PM Patrick Bordelon <
[hidden email]> wrote:

> We are currently using the replicate after commit and startup
>
>         <lst name="master">
>             <str name="enable">${replication.enable.master:false}</str>
>             <str name="replicateAfter">commit</str>
>             <str name="replicateAfter">startup</str>
>             <str name="confFiles">schema.xml,stopwords.txt</str>
>         </lst>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: Replication issue with version 0 index in SOLR 7.5

Patrick Bordelon
I removed the replicate after startup from our solrconfig.xml file. However
that didn't solve the issue. When I rebuilt the primary, the associated
replicas all went to 0 documents.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Replication issue with version 0 index in SOLR 7.5

Patrick Bordelon
One other question related to this.

I know the change was made for a specific problem that was occurring but has
this caused a similar problem as mine with anyone else?

We're looking to try changing the second 'if' statement to add an extra
conditional to prevent it from performing the "deleteAll" operation unless
absolutely specified.

The idea is to use the skipCommitOnMasterVersionZero and set it so that the
if statement will never be true on a new generation index on the primary.

We're going to try some modifications on our polling strategy as a temporary
solution while we test out changing that section of the index fetcher.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html