Solr Cloud Intermittent Backup Failure

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Solr Cloud Intermittent Backup Failure

jwrenn
Hello Everyone,

I have a Solr Cloud cluster running 6.4.2 supporting a large Sitecore
install. As part of our CI/CD pipeline we deploy many times a day to a
blue/green set-up in AWS. In order to keep our Solr collections in sync
across these deployments we create a temporary snapshot at the beginning of
the deployment. Each pod refers to their indexes using collection aliases
e.g.<collection>_green vs <collection>_blue. During the deployment, we can
switch where these aliases are pointed so that essentially we "pause" an
index by pointing the alias at a snapshot version of the collection. We can
then "resume" indexing by pointing the collection alias back to the live
index. This exposes a bug ( SOLR-11616
<https://issues.apache.org/jira/browse/SOLR-11616>  ) that causes our
deployments to break and get stuck until someone manually goes to recreate
the affected collections. According to that ticket it was fixed and patched
in Solr 7.2 and on. Unfortunately we cannot upgrade as Solr 7.x is not
compatible with our application (Sitecore CMS).

It seems to occur randomly and I haven't been able to get a solid repro case
yet. I did notice that forcing a leader election in the cloud can heal the
problem (by restarting the current leader host). I wasn't able to figure out
how to change the leader in a satisfactory way via the Solr API so I would
just have to cycle all of my Solr Cloud nodes in order to automate a
solution, which isn't ideal.

Has anyone else wrestled with that bug or have a suggestion for how to
minimize the impact?

Thanks,
Joe

*Here are some of the error messages:*

org.apache.solr.common.SolrException: Exception while restoring the backup
index
        at org.apache.solr.handler.RestoreCore.doRestore(RestoreCore.java:130)
        at
org.apache.solr.handler.admin.RestoreCoreOp.execute(RestoreCoreOp.java:65)
        at
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:377)
        at
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:379)
        at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:165)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)
        at
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:445)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:296)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at org.eclipse.jetty.server.Server.handle(Server.java:534)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
        at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
        at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
        at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException:
/var/solr/data/itembuckets_commerce_products_web_index_snapshot_shard1_replica0/data/restore.20180627124334892/_10n.si
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
        at java.nio.channels.FileChannel.open(FileChannel.java:287)
        at java.nio.channels.FileChannel.open(FileChannel.java:335)
        at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
        at
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:192)
        at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:137)
        at
org.apache.lucene.codecs.lucene62.Lucene62SegmentInfoFormat.read(Lucene62SegmentInfoFormat.java:89)
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:355)
        at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:286)
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:938)
        at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:125)
        at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:100)
        at
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:240)
        at
org.apache.solr.update.DefaultSolrCoreState.changeWriter(DefaultSolrCoreState.java:203)
        at
org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:212)
        at
org.apache.solr.update.DirectUpdateHandler2.newIndexWriter(DirectUpdateHandler2.java:686)
        at org.apache.solr.handler.RestoreCore.doRestore(RestoreCore.java:108)



org.apache.solr.common.SolrException: Failed to backup
core=sitecore_web_index_shard1_replica1 because
java.nio.file.NoSuchFileException:
/var/solr/data/sitecore_web_index_shard1_replica1/data/index/segments_268
        at org.apache.solr.handler.admin.BackupCoreOp.execute(BackupCoreOp.java:80)
        at
org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:377)
        at
org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:379)
        at
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:165)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)
        at
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:445)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:296)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
        at org.eclipse.jetty.server.Server.handle(Server.java:534)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
        at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
        at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
        at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException:
/var/solr/data/sitecore_web_index_shard1_replica1/data/index/segments_268
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
        at java.nio.channels.FileChannel.open(FileChannel.java:287)
        at java.nio.channels.FileChannel.open(FileChannel.java:335)
        at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
        at
org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:192)
        at org.apache.lucene.store.Directory.copyFrom(Directory.java:177)
        at
org.apache.solr.core.backup.repository.LocalFileSystemRepository.copyFileFrom(LocalFileSystemRepository.java:145)
        at org.apache.solr.handler.SnapShooter.createSnapshot(SnapShooter.java:218)
        at org.apache.solr.handler.SnapShooter.createSnapshot(SnapShooter.java:170)
        at org.apache.solr.handler.admin.BackupCoreOp.execute(BackupCoreOp.java:78)
        ... 33 more

*I also saw these but I'm not sure if they are related:*

        it is unusual to create a collection without cores

        solrindexwriter was not closed prior to finalize() indicates a bug
--possible resource leak



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html