checksum failed (hardware problem?)

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

checksum failed (hardware problem?)

Susheel Kumar-3
Hello,

I am still trying to understand the corrupt index exception we saw in our
logs. What does the hardware problem comment indicates here?  Does that
mean it caused most likely due to hardware issue?

We never had this problem in last couple of months. The Solr is 6.6.2 and
ZK: 3.4.10.

Please share your thoughts.

Thanks,
Susheel

Caused by: org.apache.lucene.index.CorruptIndexException: checksum
failed *(hardware
problem?)* : expected=db243d1a actual=7a00d3d2
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
[slice=_i27s_Lucene50_0.tim])

It suddenly started in the logs and before which there was no such error.
Searches & ingestions all seems to be working prior to that.

----

2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL s:shard1
r:core_node1 x:COLL_shard1_replica1]
o.a.s.u.p.StatelessScriptUpdateProcessorFactory update-script#processAdd:
newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US
2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL s:shard1
r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Exception writing document id
G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US to the
index; possible analysis error.
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:206)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:979)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1192)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:748)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at
org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProcessorFactory.java:380)
at
org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:98)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
at
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:306)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
at
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:271)
at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:173)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:187)
at
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:108)
at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter
is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1567)
at
org.apache.solr.update.DirectUpdateHandler2.updateDocument(DirectUpdateHandler2.java:924)
at
org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:913)
at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:302)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)
... 54 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=db243d1a actual=7a00d3d2
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
[slice=_i27s_Lucene50_0.tim]))
at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:419)
at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:526)
at
org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.checkIntegrity(BlockTreeTermsReader.java:336)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
at
org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterFieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:164)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931)
at org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661)

2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL s:shard1
r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
[COLL_shard1_replica1]  webapp=/solr path=/update
params={wt=javabin&version=2} status=400 QTime=69
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Erick Erickson
There are several of reasons this would "suddenly" start appearing.
1> Your disk went bad and some sector is no longer faithfully
recording the bits. In this case the checksum will be wrong
2> You ran out of disk space sometime and the index was corrupted.
This isn't really a hardware problem.
3> Your disk controller is going wonky and not reading reliably.

The "possible hardware issue" message is to alert you that this is
highly unusual and you should at leasts consider doing integrity
checks on your disk before assuming it's a Solr/Lucene problem

Best,
Erick
On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <[hidden email]> wrote:

>
> Hello,
>
> I am still trying to understand the corrupt index exception we saw in our
> logs. What does the hardware problem comment indicates here?  Does that
> mean it caused most likely due to hardware issue?
>
> We never had this problem in last couple of months. The Solr is 6.6.2 and
> ZK: 3.4.10.
>
> Please share your thoughts.
>
> Thanks,
> Susheel
>
> Caused by: org.apache.lucene.index.CorruptIndexException: checksum
> failed *(hardware
> problem?)* : expected=db243d1a actual=7a00d3d2
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> [slice=_i27s_Lucene50_0.tim])
>
> It suddenly started in the logs and before which there was no such error.
> Searches & ingestions all seems to be working prior to that.
>
> ----
>
> 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL s:shard1
> r:core_node1 x:COLL_shard1_replica1]
> o.a.s.u.p.StatelessScriptUpdateProcessorFactory update-script#processAdd:
> newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US
> 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL s:shard1
> r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
> org.apache.solr.common.SolrException: Exception writing document id
> G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US to the
> index; possible analysis error.
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:206)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:979)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1192)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:748)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> at
> org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProcessorFactory.java:380)
> at
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:98)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> at
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:306)
> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> at
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:271)
> at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
> at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:173)
> at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> at
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:108)
> at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter
> is closed
> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1567)
> at
> org.apache.solr.update.DirectUpdateHandler2.updateDocument(DirectUpdateHandler2.java:924)
> at
> org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:913)
> at
> org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:302)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
> at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)
> ... 54 more
> Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
> (hardware problem?) : expected=db243d1a actual=7a00d3d2
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> [slice=_i27s_Lucene50_0.tim]))
> at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:419)
> at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:526)
> at
> org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.checkIntegrity(BlockTreeTermsReader.java:336)
> at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> at
> org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterFieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:164)
> at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931)
> at org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661)
>
> 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL s:shard1
> r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> [COLL_shard1_replica1]  webapp=/solr path=/update
> params={wt=javabin&version=2} status=400 QTime=69
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Susheel Kumar-3
Hi Erick,

Thanks so much for your reply.  I'll now look mostly into any possible
hardware issues than Solr/Lucene.

Thanks again.

On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <[hidden email]>
wrote:

> There are several of reasons this would "suddenly" start appearing.
> 1> Your disk went bad and some sector is no longer faithfully
> recording the bits. In this case the checksum will be wrong
> 2> You ran out of disk space sometime and the index was corrupted.
> This isn't really a hardware problem.
> 3> Your disk controller is going wonky and not reading reliably.
>
> The "possible hardware issue" message is to alert you that this is
> highly unusual and you should at leasts consider doing integrity
> checks on your disk before assuming it's a Solr/Lucene problem
>
> Best,
> Erick
> On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <[hidden email]>
> wrote:
> >
> > Hello,
> >
> > I am still trying to understand the corrupt index exception we saw in our
> > logs. What does the hardware problem comment indicates here?  Does that
> > mean it caused most likely due to hardware issue?
> >
> > We never had this problem in last couple of months. The Solr is 6.6.2 and
> > ZK: 3.4.10.
> >
> > Please share your thoughts.
> >
> > Thanks,
> > Susheel
> >
> > Caused by: org.apache.lucene.index.CorruptIndexException: checksum
> > failed *(hardware
> > problem?)* : expected=db243d1a actual=7a00d3d2
> >
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > [slice=_i27s_Lucene50_0.tim])
> >
> > It suddenly started in the logs and before which there was no such error.
> > Searches & ingestions all seems to be working prior to that.
> >
> > ----
> >
> > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL s:shard1
> > r:core_node1 x:COLL_shard1_replica1]
> > o.a.s.u.p.StatelessScriptUpdateProcessorFactory update-script#processAdd:
> > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US
> > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL s:shard1
> > r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
> > org.apache.solr.common.SolrException: Exception writing document id
> > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US to
> the
> > index; possible analysis error.
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:206)
> > at
> >
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
> > at
> >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:979)
> > at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1192)
> > at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:748)
> > at
> >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > at
> >
> org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProcessorFactory.java:380)
> > at
> >
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:98)
> > at
> >
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > at
> >
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > at
> >
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:306)
> > at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
> > at
> >
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > at
> >
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:271)
> > at
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
> > at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:173)
> > at
> >
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > at
> >
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:108)
> > at
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > at
> >
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
> > at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> > at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> > at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> > at
> >
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> > at
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> > at
> > org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> > at
> > org.eclipse.jetty.io
> .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > at
> >
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> > at
> >
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> > at
> >
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> > at java.lang.Thread.run(Thread.java:748)
> > Caused by: org.apache.lucene.store.AlreadyClosedException: this
> IndexWriter
> > is closed
> > at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1567)
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.updateDocument(DirectUpdateHandler2.java:924)
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:913)
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:302)
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
> > at
> >
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)
> > ... 54 more
> > Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
> > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> >
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > [slice=_i27s_Lucene50_0.tim]))
> > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:419)
> > at
> org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:526)
> > at
> >
> org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.checkIntegrity(BlockTreeTermsReader.java:336)
> > at
> >
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > at
> >
> org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterFieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > at
> >
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:164)
> > at
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931)
> > at org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > at
> >
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
> > at
> >
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661)
> >
> > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL s:shard1
> > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > [COLL_shard1_replica1]  webapp=/solr path=/update
> > params={wt=javabin&version=2} status=400 QTime=69
>
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Erick Erickson
Mind you it could _still_ be Solr/Lucene, but let's check the hardware first ;)
On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <[hidden email]> wrote:

>
> Hi Erick,
>
> Thanks so much for your reply.  I'll now look mostly into any possible
> hardware issues than Solr/Lucene.
>
> Thanks again.
>
> On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <[hidden email]>
> wrote:
>
> > There are several of reasons this would "suddenly" start appearing.
> > 1> Your disk went bad and some sector is no longer faithfully
> > recording the bits. In this case the checksum will be wrong
> > 2> You ran out of disk space sometime and the index was corrupted.
> > This isn't really a hardware problem.
> > 3> Your disk controller is going wonky and not reading reliably.
> >
> > The "possible hardware issue" message is to alert you that this is
> > highly unusual and you should at leasts consider doing integrity
> > checks on your disk before assuming it's a Solr/Lucene problem
> >
> > Best,
> > Erick
> > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <[hidden email]>
> > wrote:
> > >
> > > Hello,
> > >
> > > I am still trying to understand the corrupt index exception we saw in our
> > > logs. What does the hardware problem comment indicates here?  Does that
> > > mean it caused most likely due to hardware issue?
> > >
> > > We never had this problem in last couple of months. The Solr is 6.6.2 and
> > > ZK: 3.4.10.
> > >
> > > Please share your thoughts.
> > >
> > > Thanks,
> > > Susheel
> > >
> > > Caused by: org.apache.lucene.index.CorruptIndexException: checksum
> > > failed *(hardware
> > > problem?)* : expected=db243d1a actual=7a00d3d2
> > >
> > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > [slice=_i27s_Lucene50_0.tim])
> > >
> > > It suddenly started in the logs and before which there was no such error.
> > > Searches & ingestions all seems to be working prior to that.
> > >
> > > ----
> > >
> > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL s:shard1
> > > r:core_node1 x:COLL_shard1_replica1]
> > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory update-script#processAdd:
> > > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US
> > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL s:shard1
> > > r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
> > > org.apache.solr.common.SolrException: Exception writing document id
> > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US to
> > the
> > > index; possible analysis error.
> > > at
> > >
> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:206)
> > > at
> > >
> > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
> > > at
> > >
> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > > at
> > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:979)
> > > at
> > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1192)
> > > at
> > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:748)
> > > at
> > >
> > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > > at
> > >
> > org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProcessorFactory.java:380)
> > > at
> > >
> > org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:98)
> > > at
> > >
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > > at
> > >
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > > at
> > >
> > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:306)
> > > at
> > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
> > > at
> > >
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > > at
> > >
> > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:271)
> > > at
> > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
> > > at
> > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:173)
> > > at
> > >
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > > at
> > >
> > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:108)
> > > at
> > org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > > at
> > >
> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
> > > at
> > >
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> > > at
> > >
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> > > at
> > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> > > at
> > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> > > at
> > >
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> > > at
> > >
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> > > at
> > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > > at
> > >
> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> > > at
> > >
> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> > > at
> > >
> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> > > at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> > > at
> > >
> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> > > at
> > >
> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> > > at
> > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> > > at
> > >
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> > > at
> > >
> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> > > at
> > >
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> > > at
> > >
> > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> > > at
> > >
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> > > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> > > at
> > >
> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> > > at
> > > org.eclipse.jetty.io
> > .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> > > at
> > > org.eclipse.jetty.io
> > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > > at
> > >
> > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> > > at
> > >
> > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> > > at
> > >
> > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> > > at
> > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> > > at
> > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> > > at java.lang.Thread.run(Thread.java:748)
> > > Caused by: org.apache.lucene.store.AlreadyClosedException: this
> > IndexWriter
> > > is closed
> > > at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > > at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > > at
> > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1567)
> > > at
> > >
> > org.apache.solr.update.DirectUpdateHandler2.updateDocument(DirectUpdateHandler2.java:924)
> > > at
> > >
> > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:913)
> > > at
> > >
> > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:302)
> > > at
> > >
> > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
> > > at
> > >
> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)
> > > ... 54 more
> > > Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
> > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> > >
> > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > [slice=_i27s_Lucene50_0.tim]))
> > > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:419)
> > > at
> > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:526)
> > > at
> > >
> > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.checkIntegrity(BlockTreeTermsReader.java:336)
> > > at
> > >
> > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > > at
> > >
> > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterFieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > > at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > > at
> > >
> > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:164)
> > > at
> > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931)
> > > at org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > > at
> > >
> > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
> > > at
> > >
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661)
> > >
> > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL s:shard1
> > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > > [COLL_shard1_replica1]  webapp=/solr path=/update
> > > params={wt=javabin&version=2} status=400 QTime=69
> >
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Susheel Kumar-3
Got it. I'll have first hardware folks check and if they don't see/find
anything suspicious then i'll return here.

Wondering if any body has seen similar error and if they were able to
confirm if it was hardware fault or so.

Thnx

On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <[hidden email]>
wrote:

> Mind you it could _still_ be Solr/Lucene, but let's check the hardware
> first ;)
> On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <[hidden email]>
> wrote:
> >
> > Hi Erick,
> >
> > Thanks so much for your reply.  I'll now look mostly into any possible
> > hardware issues than Solr/Lucene.
> >
> > Thanks again.
> >
> > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <[hidden email]
> >
> > wrote:
> >
> > > There are several of reasons this would "suddenly" start appearing.
> > > 1> Your disk went bad and some sector is no longer faithfully
> > > recording the bits. In this case the checksum will be wrong
> > > 2> You ran out of disk space sometime and the index was corrupted.
> > > This isn't really a hardware problem.
> > > 3> Your disk controller is going wonky and not reading reliably.
> > >
> > > The "possible hardware issue" message is to alert you that this is
> > > highly unusual and you should at leasts consider doing integrity
> > > checks on your disk before assuming it's a Solr/Lucene problem
> > >
> > > Best,
> > > Erick
> > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <[hidden email]>
> > > wrote:
> > > >
> > > > Hello,
> > > >
> > > > I am still trying to understand the corrupt index exception we saw
> in our
> > > > logs. What does the hardware problem comment indicates here?  Does
> that
> > > > mean it caused most likely due to hardware issue?
> > > >
> > > > We never had this problem in last couple of months. The Solr is
> 6.6.2 and
> > > > ZK: 3.4.10.
> > > >
> > > > Please share your thoughts.
> > > >
> > > > Thanks,
> > > > Susheel
> > > >
> > > > Caused by: org.apache.lucene.index.CorruptIndexException: checksum
> > > > failed *(hardware
> > > > problem?)* : expected=db243d1a actual=7a00d3d2
> > > >
> > >
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > [slice=_i27s_Lucene50_0.tim])
> > > >
> > > > It suddenly started in the logs and before which there was no such
> error.
> > > > Searches & ingestions all seems to be working prior to that.
> > > >
> > > > ----
> > > >
> > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL s:shard1
> > > > r:core_node1 x:COLL_shard1_replica1]
> > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> update-script#processAdd:
> > > >
> newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US
> > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL s:shard1
> > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
> > > > org.apache.solr.common.SolrException: Exception writing document id
> > > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_1-en_US
> to
> > > the
> > > > index; possible analysis error.
> > > > at
> > > >
> > >
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:206)
> > > > at
> > > >
> > >
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:67)
> > > > at
> > > >
> > >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > > > at
> > > >
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:979)
> > > > at
> > > >
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1192)
> > > > at
> > > >
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:748)
> > > > at
> > > >
> > >
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> > > > at
> > > >
> > >
> org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProcessorFactory.java:380)
> > > > at
> > > >
> > >
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:98)
> > > > at
> > > >
> > >
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > > > at
> > > >
> > >
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > > > at
> > > >
> > >
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:306)
> > > > at
> > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
> > > > at
> > > >
> > >
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > > > at
> > > >
> > >
> org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:271)
> > > > at
> > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:251)
> > > > at
> > >
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:173)
> > > > at
> > > >
> > >
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > > > at
> > > >
> > >
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:108)
> > > > at
> > >
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > > > at
> > > >
> > >
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
> > > > at
> > > >
> > >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> > > > at
> > > >
> > >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > > > at
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> > > > at
> > > >
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> > > > at
> > > >
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> > > > at
> > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> > > > at
> > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > > > at
> > > >
> > >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> > > > at
> > >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> > > > at
> > > >
> > >
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> > > > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> > > > at
> > > >
> > >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> > > > at
> > > > org.eclipse.jetty.io
> > > .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> > > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> > > > at
> > > > org.eclipse.jetty.io
> > > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > > > at
> > > >
> > >
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> > > > at
> > > >
> > >
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> > > > at
> > > >
> > >
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> > > > at
> > > >
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> > > > at
> > > >
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> > > > at java.lang.Thread.run(Thread.java:748)
> > > > Caused by: org.apache.lucene.store.AlreadyClosedException: this
> > > IndexWriter
> > > > is closed
> > > > at
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > > > at
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > > > at
> > >
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1567)
> > > > at
> > > >
> > >
> org.apache.solr.update.DirectUpdateHandler2.updateDocument(DirectUpdateHandler2.java:924)
> > > > at
> > > >
> > >
> org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:913)
> > > > at
> > > >
> > >
> org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:302)
> > > > at
> > > >
> > >
> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
> > > > at
> > > >
> > >
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:194)
> > > > ... 54 more
> > > > Caused by: org.apache.lucene.index.CorruptIndexException: checksum
> failed
> > > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> > > >
> > >
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > [slice=_i27s_Lucene50_0.tim]))
> > > > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:419)
> > > > at
> > >
> org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:526)
> > > > at
> > > >
> > >
> org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.checkIntegrity(BlockTreeTermsReader.java:336)
> > > > at
> > > >
> > >
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > > > at
> > > >
> > >
> org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterFieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > > > at
> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > > > at
> > > >
> > >
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:164)
> > > > at
> > >
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > > > at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > > > at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931)
> > > > at
> org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > > > at
> > > >
> > >
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
> > > > at
> > > >
> > >
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661)
> > > >
> > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL s:shard1
> > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > > > [COLL_shard1_replica1]  webapp=/solr path=/update
> > > > params={wt=javabin&version=2} status=400 QTime=69
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

simon-2
I saw something like this a year ago which i reported as a possible bug  (
https://issues.apache.org/jira/browse/SOLR-10840, which has  a full
description and stack traces)

This occurred very randomly on an AWS instance; moving the index directory
to a different file system did not fix the problem Eventually I cloned our
environment to a new AWS instance, which proved to be the solution. Why, I
have no idea...

-Simon

On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar <[hidden email]>
wrote:

> Got it. I'll have first hardware folks check and if they don't see/find
> anything suspicious then i'll return here.
>
> Wondering if any body has seen similar error and if they were able to
> confirm if it was hardware fault or so.
>
> Thnx
>
> On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <[hidden email]>
> wrote:
>
> > Mind you it could _still_ be Solr/Lucene, but let's check the hardware
> > first ;)
> > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <[hidden email]>
> > wrote:
> > >
> > > Hi Erick,
> > >
> > > Thanks so much for your reply.  I'll now look mostly into any possible
> > > hardware issues than Solr/Lucene.
> > >
> > > Thanks again.
> > >
> > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > There are several of reasons this would "suddenly" start appearing.
> > > > 1> Your disk went bad and some sector is no longer faithfully
> > > > recording the bits. In this case the checksum will be wrong
> > > > 2> You ran out of disk space sometime and the index was corrupted.
> > > > This isn't really a hardware problem.
> > > > 3> Your disk controller is going wonky and not reading reliably.
> > > >
> > > > The "possible hardware issue" message is to alert you that this is
> > > > highly unusual and you should at leasts consider doing integrity
> > > > checks on your disk before assuming it's a Solr/Lucene problem
> > > >
> > > > Best,
> > > > Erick
> > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <[hidden email]
> >
> > > > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I am still trying to understand the corrupt index exception we saw
> > in our
> > > > > logs. What does the hardware problem comment indicates here?  Does
> > that
> > > > > mean it caused most likely due to hardware issue?
> > > > >
> > > > > We never had this problem in last couple of months. The Solr is
> > 6.6.2 and
> > > > > ZK: 3.4.10.
> > > > >
> > > > > Please share your thoughts.
> > > > >
> > > > > Thanks,
> > > > > Susheel
> > > > >
> > > > > Caused by: org.apache.lucene.index.CorruptIndexException: checksum
> > > > > failed *(hardware
> > > > > problem?)* : expected=db243d1a actual=7a00d3d2
> > > > >
> > > >
> > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > > [slice=_i27s_Lucene50_0.tim])
> > > > >
> > > > > It suddenly started in the logs and before which there was no such
> > error.
> > > > > Searches & ingestions all seems to be working prior to that.
> > > > >
> > > > > ----
> > > > >
> > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL
> s:shard1
> > > > > r:core_node1 x:COLL_shard1_replica1]
> > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> > update-script#processAdd:
> > > > >
> > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
> 08480_1-en_US
> > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL
> s:shard1
> > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
> > > > > org.apache.solr.common.SolrException: Exception writing document
> id
> > > > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
> 1-en_US
> > to
> > > > the
> > > > > index; possible analysis error.
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> ateHandler2.java:206)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.processor.RunUpdateProcessor.processA
> dd(RunUpdateProcessorFactory.java:67)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> essAdd(UpdateRequestProcessor.java:55)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.
> doLocalAdd(DistributedUpdateProcessor.java:979)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.
> versionAdd(DistributedUpdateProcessor.java:1192)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.
> processAdd(DistributedUpdateProcessor.java:748)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> essAdd(UpdateRequestProcessor.java:55)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$
> ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProces
> sorFactory.java:380)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.handler.loader.JavabinLoader$1.update(Javabi
> nLoader.java:98)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> ec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> ec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> odec.java:306)
> > > > > at
> > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> c.java:251)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> ec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> odec.java:271)
> > > > > at
> > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> c.java:251)
> > > > > at
> > > >
> > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCo
> dec.java:173)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> ec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDoc
> s(JavabinLoader.java:108)
> > > > > at
> > > >
> > org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRe
> questHandler.java:97)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.handler.ContentStreamHandlerBase.handleReque
> stBody(ContentStreamHandlerBase.java:68)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> uestHandlerBase.java:173)
> > > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > > > > at
> > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > > > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> 529)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> atchFilter.java:361)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> atchFilter.java:305)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> r(ServletHandler.java:1691)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> dler.java:582)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> Handler.java:143)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> ndler.java:548)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.session.SessionHandler.doHandle(
> SessionHandler.java:226)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> ContextHandler.java:1180)
> > > > > at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> ler.java:512)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.session.SessionHandler.doScope(
> SessionHandler.java:185)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.handler.ContextHandler.doScope(
> ContextHandler.java:1112)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> Handler.java:141)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> ndle(ContextHandlerCollection.java:213)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.handler.HandlerCollection.handle(
> HandlerCollection.java:119)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> erWrapper.java:134)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(Rewr
> iteHandler.java:335)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> erWrapper.java:134)
> > > > > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > > > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> java:320)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> ction.java:251)
> > > > > at
> > > > > org.eclipse.jetty.io
> > > > .AbstractConnection$ReadCallback.succeeded(AbstractConnectio
> n.java:273)
> > > > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.
> java:95)
> > > > > at
> > > > > org.eclipse.jetty.io
> > > > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> .executeProduceConsume(ExecuteProduceConsume.java:303)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> .produceConsume(ExecuteProduceConsume.java:148)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> .run(ExecuteProduceConsume.java:136)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> ThreadPool.java:671)
> > > > > at
> > > > >
> > > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedT
> hreadPool.java:589)
> > > > > at java.lang.Thread.run(Thread.java:748)
> > > > > Caused by: org.apache.lucene.store.AlreadyClosedException: this
> > > > IndexWriter
> > > > > is closed
> > > > > at
> > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > > > > at
> > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > > > > at
> > > >
> > org.apache.lucene.index.IndexWriter.updateDocument(IndexWrit
> er.java:1567)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.DirectUpdateHandler2.updateDocument(D
> irectUpdateHandler2.java:924)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocVa
> lues(DirectUpdateHandler2.java:913)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
> irectUpdateHandler2.java:302)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
> dateHandler2.java:239)
> > > > > at
> > > > >
> > > >
> > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> ateHandler2.java:194)
> > > > > ... 54 more
> > > > > Caused by: org.apache.lucene.index.CorruptIndexException: checksum
> > failed
> > > > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> > > > >
> > > >
> > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > > [slice=_i27s_Lucene50_0.tim]))
> > > > > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.
> java:419)
> > > > > at
> > > >
> > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecU
> til.java:526)
> > > > > at
> > > > >
> > > >
> > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.chec
> kIntegrity(BlockTreeTermsReader.java:336)
> > > > > at
> > > > >
> > > >
> > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> ldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > > > > at
> > > > >
> > > >
> > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterF
> ieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > > > > at
> > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > > > > at
> > > > >
> > > >
> > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> ldsWriter.merge(PerFieldPostingsFormat.java:164)
> > > > > at
> > > >
> > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > > > > at
> > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > > > > at
> > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > > > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:
> 3931)
> > > > > at
> > org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > > > > at
> > > > >
> > > >
> > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Con
> currentMergeScheduler.java:624)
> > > > > at
> > > > >
> > > >
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread
> .run(ConcurrentMergeScheduler.java:661)
> > > > >
> > > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL
> s:shard1
> > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > > > > [COLL_shard1_replica1]  webapp=/solr path=/update
> > > > > params={wt=javabin&version=2} status=400 QTime=69
> > > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Susheel Kumar-3
Thank you, Simon. Which basically points that something related to env and
was causing the checksum failures than any lucene/solr issue.

Eric - I did check with hardware folks and they are aware of some VMware
issue where the VM hosted in HCI environment is coming into some halt state
for minute or so and may be loosing connections to disk/network.  So that
probably may be the reason of index corruption though they have not been
able to find anything specific from logs during the time Solr run into issue

Also I had again issue where Solr is loosing the connection with zookeeper
(Client session timed out, have not heard from server in 8367ms for
sessionid 0x0)  Does that points to similar hardware issue, Any suggestions?

Thanks,
Susheel

2018-09-29 17:30:44.070 INFO
(searcherExecutor-7-thread-1-processing-n:server54:8080_solr
x:COLL_shard4_replica2 s:shard4 c:COLL r:core_node8) [c:COLL s:shard4
r:core_node8 x:COLL_shard4_replica2] o.a.s.c.SolrCore
[COLL_shard4_replica2] Registered new searcher
Searcher@7a4465b1[COLL_shard4_replica2]
main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_7x3f(6.6.2):C826923/317917:delGen=2523)
Uninverting(_83pb(6.6.2):C805451/172968:delGen=2957)
Uninverting(_3ywj(6.6.2):C727978/334529:delGen=2962)
Uninverting(_7vsw(6.6.2):C872110/385178:delGen=2020)
Uninverting(_8n89(6.6.2):C741293/109260:delGen=3863)
Uninverting(_7zkq(6.6.2):C720666/101205:delGen=3151)
Uninverting(_825d(6.6.2):C707731/112410:delGen=3168)
Uninverting(_dgwu(6.6.2):C760421/295964:delGen=4624)
Uninverting(_gs5x(6.6.2):C540942/138952:delGen=1623)
Uninverting(_gu6a(6.6.2):c75213/35640:delGen=1110)
Uninverting(_h33i(6.6.2):c131276/40356:delGen=706)
Uninverting(_h5tc(6.6.2):c44320/11080:delGen=380)
Uninverting(_h9d9(6.6.2):c35088/3188:delGen=104)
Uninverting(_h80h(6.6.2):c11927/3412:delGen=153)
Uninverting(_h7ll(6.6.2):c11284/1368:delGen=205)
Uninverting(_h8bs(6.6.2):c11518/2103:delGen=149)
Uninverting(_h9r3(6.6.2):c16439/1018:delGen=52)
Uninverting(_h9z1(6.6.2):c9428/823:delGen=27)
Uninverting(_h9v2(6.6.2):c933/33:delGen=12)
Uninverting(_ha1c(6.6.2):c1056/1:delGen=1)
Uninverting(_ha6i(6.6.2):c1883/124:delGen=8)
Uninverting(_ha3x(6.6.2):c807/14:delGen=3)
Uninverting(_ha47(6.6.2):c1229/133:delGen=6)
Uninverting(_hapk(6.6.2):c523) Uninverting(_haoq(6.6.2):c279)
Uninverting(_hamr(6.6.2):c311) Uninverting(_hap0(6.6.2):c338)
Uninverting(_hapu(6.6.2):c275) Uninverting(_hapv(6.6.2):C4/2:delGen=1)
Uninverting(_hapw(6.6.2):C5/2:delGen=1)
Uninverting(_hapx(6.6.2):C2/1:delGen=1)
Uninverting(_hapy(6.6.2):C2/1:delGen=1)
Uninverting(_hapz(6.6.2):C3/1:delGen=1)
Uninverting(_haq0(6.6.2):C6/3:delGen=1)
Uninverting(_haq1(6.6.2):C1)))}
2018-09-29 17:30:52.390 WARN
(zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server117:2182))
[   ] o.a.z.ClientCnxn Client session timed out, have not heard from
server in 8367ms for sessionid 0x0
2018-09-29 17:31:01.302 WARN
(zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server120:2182))
[   ] o.a.z.ClientCnxn Client session timed out, have not heard from
server in 8812ms for sessionid 0x0
2018-09-29 17:31:14.049 INFO
(zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
  ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper
reestablished.
2018-09-29 17:31:14.049 INFO
(zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
  ] o.a.s.c.ZkController ZooKeeper session re-connected ... refreshing
core states after session expiration.
2018-09-29 17:31:14.051 INFO
(zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
  ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (16)
-> (15)
2018-09-29 17:31:14.144 INFO  (qtp834133664-520378) [c:COLL s:shard4
r:core_node8 x:COLL_shard4_replica2] o.a.s.c.S.Request
[COLL_shard4_replica2]  webapp=/solr path=/admin/ping
params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin}
webapp=/solr path=/admin/ping
params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin}
hits=4989979 status=0 QTime=0




On Wed, Sep 26, 2018 at 9:44 AM simon <[hidden email]> wrote:

> I saw something like this a year ago which i reported as a possible bug  (
> https://issues.apache.org/jira/browse/SOLR-10840, which has  a full
> description and stack traces)
>
> This occurred very randomly on an AWS instance; moving the index directory
> to a different file system did not fix the problem Eventually I cloned our
> environment to a new AWS instance, which proved to be the solution. Why, I
> have no idea...
>
> -Simon
>
> On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar <[hidden email]>
> wrote:
>
> > Got it. I'll have first hardware folks check and if they don't see/find
> > anything suspicious then i'll return here.
> >
> > Wondering if any body has seen similar error and if they were able to
> > confirm if it was hardware fault or so.
> >
> > Thnx
> >
> > On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <[hidden email]>
> > wrote:
> >
> > > Mind you it could _still_ be Solr/Lucene, but let's check the hardware
> > > first ;)
> > > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <[hidden email]>
> > > wrote:
> > > >
> > > > Hi Erick,
> > > >
> > > > Thanks so much for your reply.  I'll now look mostly into any
> possible
> > > > hardware issues than Solr/Lucene.
> > > >
> > > > Thanks again.
> > > >
> > > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > There are several of reasons this would "suddenly" start appearing.
> > > > > 1> Your disk went bad and some sector is no longer faithfully
> > > > > recording the bits. In this case the checksum will be wrong
> > > > > 2> You ran out of disk space sometime and the index was corrupted.
> > > > > This isn't really a hardware problem.
> > > > > 3> Your disk controller is going wonky and not reading reliably.
> > > > >
> > > > > The "possible hardware issue" message is to alert you that this is
> > > > > highly unusual and you should at leasts consider doing integrity
> > > > > checks on your disk before assuming it's a Solr/Lucene problem
> > > > >
> > > > > Best,
> > > > > Erick
> > > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <
> [hidden email]
> > >
> > > > > wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am still trying to understand the corrupt index exception we
> saw
> > > in our
> > > > > > logs. What does the hardware problem comment indicates here?
> Does
> > > that
> > > > > > mean it caused most likely due to hardware issue?
> > > > > >
> > > > > > We never had this problem in last couple of months. The Solr is
> > > 6.6.2 and
> > > > > > ZK: 3.4.10.
> > > > > >
> > > > > > Please share your thoughts.
> > > > > >
> > > > > > Thanks,
> > > > > > Susheel
> > > > > >
> > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> checksum
> > > > > > failed *(hardware
> > > > > > problem?)* : expected=db243d1a actual=7a00d3d2
> > > > > >
> > > > >
> > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > > > [slice=_i27s_Lucene50_0.tim])
> > > > > >
> > > > > > It suddenly started in the logs and before which there was no
> such
> > > error.
> > > > > > Searches & ingestions all seems to be working prior to that.
> > > > > >
> > > > > > ----
> > > > > >
> > > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL
> > s:shard1
> > > > > > r:core_node1 x:COLL_shard1_replica1]
> > > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> > > update-script#processAdd:
> > > > > >
> > > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
> > 08480_1-en_US
> > > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL
> > s:shard1
> > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
> > > > > > org.apache.solr.common.SolrException: Exception writing document
> > id
> > > > > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
> > 1-en_US
> > > to
> > > > > the
> > > > > > index; possible analysis error.
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > ateHandler2.java:206)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.RunUpdateProcessor.processA
> > dd(RunUpdateProcessorFactory.java:67)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > essAdd(UpdateRequestProcessor.java:55)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > doLocalAdd(DistributedUpdateProcessor.java:979)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > versionAdd(DistributedUpdateProcessor.java:1192)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > processAdd(DistributedUpdateProcessor.java:748)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > essAdd(UpdateRequestProcessor.java:55)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$
> > ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProces
> > sorFactory.java:380)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.loader.JavabinLoader$1.update(Javabi
> > nLoader.java:98)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > ec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > ec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > odec.java:306)
> > > > > > at
> > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > c.java:251)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > ec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > odec.java:271)
> > > > > > at
> > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > c.java:251)
> > > > > > at
> > > > >
> > > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCo
> > dec.java:173)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > ec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDoc
> > s(JavabinLoader.java:108)
> > > > > > at
> > > > >
> > >
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRe
> > questHandler.java:97)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.ContentStreamHandlerBase.handleReque
> > stBody(ContentStreamHandlerBase.java:68)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> > uestHandlerBase.java:173)
> > > > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > > > > > at
> > > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > > > > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> > 529)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > atchFilter.java:361)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > atchFilter.java:305)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> > r(ServletHandler.java:1691)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> > dler.java:582)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > Handler.java:143)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> > ndler.java:548)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.session.SessionHandler.doHandle(
> > SessionHandler.java:226)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> > ContextHandler.java:1180)
> > > > > > at
> > > > >
> > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> > ler.java:512)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.session.SessionHandler.doScope(
> > SessionHandler.java:185)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ContextHandler.doScope(
> > ContextHandler.java:1112)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > Handler.java:141)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> > ndle(ContextHandlerCollection.java:213)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.HandlerCollection.handle(
> > HandlerCollection.java:119)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > erWrapper.java:134)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(Rewr
> > iteHandler.java:335)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > erWrapper.java:134)
> > > > > > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > > > > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> > java:320)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> > ction.java:251)
> > > > > > at
> > > > > > org.eclipse.jetty.io
> > > > > .AbstractConnection$ReadCallback.succeeded(AbstractConnectio
> > n.java:273)
> > > > > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.
> > java:95)
> > > > > > at
> > > > > > org.eclipse.jetty.io
> > > > > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > .executeProduceConsume(ExecuteProduceConsume.java:303)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > .produceConsume(ExecuteProduceConsume.java:148)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > .run(ExecuteProduceConsume.java:136)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> > ThreadPool.java:671)
> > > > > > at
> > > > > >
> > > > >
> > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedT
> > hreadPool.java:589)
> > > > > > at java.lang.Thread.run(Thread.java:748)
> > > > > > Caused by: org.apache.lucene.store.AlreadyClosedException: this
> > > > > IndexWriter
> > > > > > is closed
> > > > > > at
> > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > > > > > at
> > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > > > > > at
> > > > >
> > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWrit
> > er.java:1567)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.updateDocument(D
> > irectUpdateHandler2.java:924)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocVa
> > lues(DirectUpdateHandler2.java:913)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
> > irectUpdateHandler2.java:302)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
> > dateHandler2.java:239)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > ateHandler2.java:194)
> > > > > > ... 54 more
> > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> checksum
> > > failed
> > > > > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> > > > > >
> > > > >
> > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > > > [slice=_i27s_Lucene50_0.tim]))
> > > > > > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.
> > java:419)
> > > > > > at
> > > > >
> > > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecU
> > til.java:526)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.chec
> > kIntegrity(BlockTreeTermsReader.java:336)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > ldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterF
> > ieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > > > > > at
> > > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > ldsWriter.merge(PerFieldPostingsFormat.java:164)
> > > > > > at
> > > > >
> > >
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > > > > > at
> > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > > > > > at
> > > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > > > > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:
> > 3931)
> > > > > > at
> > > org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Con
> > currentMergeScheduler.java:624)
> > > > > > at
> > > > > >
> > > > >
> > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread
> > .run(ConcurrentMergeScheduler.java:661)
> > > > > >
> > > > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL
> > s:shard1
> > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > > > > > [COLL_shard1_replica1]  webapp=/solr path=/update
> > > > > > params={wt=javabin&version=2} status=400 QTime=69
> > > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Stephen Bianamara
Hello All --

As it would happen, we've seen this error on version 6.6.2 very recently.
This is also on an AWS instance, like Simon's report. The drive doesn't
show any sign of being unhealthy, either from cursory investigation. FWIW,
this occurred during a collection backup.

Erick, is there some diagnostic data we can find to help pin this down?

Thanks!
Stephen

On Sun, Sep 30, 2018 at 12:48 PM Susheel Kumar <[hidden email]>
wrote:

> Thank you, Simon. Which basically points that something related to env and
> was causing the checksum failures than any lucene/solr issue.
>
> Eric - I did check with hardware folks and they are aware of some VMware
> issue where the VM hosted in HCI environment is coming into some halt state
> for minute or so and may be loosing connections to disk/network.  So that
> probably may be the reason of index corruption though they have not been
> able to find anything specific from logs during the time Solr run into
> issue
>
> Also I had again issue where Solr is loosing the connection with zookeeper
> (Client session timed out, have not heard from server in 8367ms for
> sessionid 0x0)  Does that points to similar hardware issue, Any
> suggestions?
>
> Thanks,
> Susheel
>
> 2018-09-29 17:30:44.070 INFO
> (searcherExecutor-7-thread-1-processing-n:server54:8080_solr
> x:COLL_shard4_replica2 s:shard4 c:COLL r:core_node8) [c:COLL s:shard4
> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.SolrCore
> [COLL_shard4_replica2] Registered new searcher
> Searcher@7a4465b1[COLL_shard4_replica2]
>
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_7x3f(6.6.2):C826923/317917:delGen=2523)
> Uninverting(_83pb(6.6.2):C805451/172968:delGen=2957)
> Uninverting(_3ywj(6.6.2):C727978/334529:delGen=2962)
> Uninverting(_7vsw(6.6.2):C872110/385178:delGen=2020)
> Uninverting(_8n89(6.6.2):C741293/109260:delGen=3863)
> Uninverting(_7zkq(6.6.2):C720666/101205:delGen=3151)
> Uninverting(_825d(6.6.2):C707731/112410:delGen=3168)
> Uninverting(_dgwu(6.6.2):C760421/295964:delGen=4624)
> Uninverting(_gs5x(6.6.2):C540942/138952:delGen=1623)
> Uninverting(_gu6a(6.6.2):c75213/35640:delGen=1110)
> Uninverting(_h33i(6.6.2):c131276/40356:delGen=706)
> Uninverting(_h5tc(6.6.2):c44320/11080:delGen=380)
> Uninverting(_h9d9(6.6.2):c35088/3188:delGen=104)
> Uninverting(_h80h(6.6.2):c11927/3412:delGen=153)
> Uninverting(_h7ll(6.6.2):c11284/1368:delGen=205)
> Uninverting(_h8bs(6.6.2):c11518/2103:delGen=149)
> Uninverting(_h9r3(6.6.2):c16439/1018:delGen=52)
> Uninverting(_h9z1(6.6.2):c9428/823:delGen=27)
> Uninverting(_h9v2(6.6.2):c933/33:delGen=12)
> Uninverting(_ha1c(6.6.2):c1056/1:delGen=1)
> Uninverting(_ha6i(6.6.2):c1883/124:delGen=8)
> Uninverting(_ha3x(6.6.2):c807/14:delGen=3)
> Uninverting(_ha47(6.6.2):c1229/133:delGen=6)
> Uninverting(_hapk(6.6.2):c523) Uninverting(_haoq(6.6.2):c279)
> Uninverting(_hamr(6.6.2):c311) Uninverting(_hap0(6.6.2):c338)
> Uninverting(_hapu(6.6.2):c275) Uninverting(_hapv(6.6.2):C4/2:delGen=1)
> Uninverting(_hapw(6.6.2):C5/2:delGen=1)
> Uninverting(_hapx(6.6.2):C2/1:delGen=1)
> Uninverting(_hapy(6.6.2):C2/1:delGen=1)
> Uninverting(_hapz(6.6.2):C3/1:delGen=1)
> Uninverting(_haq0(6.6.2):C6/3:delGen=1)
> Uninverting(_haq1(6.6.2):C1)))}
> 2018-09-29 17:30:52.390 WARN
>
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server117:2182))
> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
> server in 8367ms for sessionid 0x0
> 2018-09-29 17:31:01.302 WARN
>
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server120:2182))
> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
> server in 8812ms for sessionid 0x0
> 2018-09-29 17:31:14.049 INFO
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
>   ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper
> reestablished.
> 2018-09-29 17:31:14.049 INFO
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
>   ] o.a.s.c.ZkController ZooKeeper session re-connected ... refreshing
> core states after session expiration.
> 2018-09-29 17:31:14.051 INFO
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
>   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (16)
> -> (15)
> 2018-09-29 17:31:14.144 INFO  (qtp834133664-520378) [c:COLL s:shard4
> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.S.Request
> [COLL_shard4_replica2]  webapp=/solr path=/admin/ping
>
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> }
> webapp=/solr path=/admin/ping
>
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> }
> hits=4989979 status=0 QTime=0
>
>
>
>
> On Wed, Sep 26, 2018 at 9:44 AM simon <[hidden email]> wrote:
>
> > I saw something like this a year ago which i reported as a possible bug
> (
> > https://issues.apache.org/jira/browse/SOLR-10840, which has  a full
> > description and stack traces)
> >
> > This occurred very randomly on an AWS instance; moving the index
> directory
> > to a different file system did not fix the problem Eventually I cloned
> our
> > environment to a new AWS instance, which proved to be the solution. Why,
> I
> > have no idea...
> >
> > -Simon
> >
> > On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar <[hidden email]>
> > wrote:
> >
> > > Got it. I'll have first hardware folks check and if they don't see/find
> > > anything suspicious then i'll return here.
> > >
> > > Wondering if any body has seen similar error and if they were able to
> > > confirm if it was hardware fault or so.
> > >
> > > Thnx
> > >
> > > On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <
> [hidden email]>
> > > wrote:
> > >
> > > > Mind you it could _still_ be Solr/Lucene, but let's check the
> hardware
> > > > first ;)
> > > > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <[hidden email]
> >
> > > > wrote:
> > > > >
> > > > > Hi Erick,
> > > > >
> > > > > Thanks so much for your reply.  I'll now look mostly into any
> > possible
> > > > > hardware issues than Solr/Lucene.
> > > > >
> > > > > Thanks again.
> > > > >
> > > > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > There are several of reasons this would "suddenly" start
> appearing.
> > > > > > 1> Your disk went bad and some sector is no longer faithfully
> > > > > > recording the bits. In this case the checksum will be wrong
> > > > > > 2> You ran out of disk space sometime and the index was
> corrupted.
> > > > > > This isn't really a hardware problem.
> > > > > > 3> Your disk controller is going wonky and not reading reliably.
> > > > > >
> > > > > > The "possible hardware issue" message is to alert you that this
> is
> > > > > > highly unusual and you should at leasts consider doing integrity
> > > > > > checks on your disk before assuming it's a Solr/Lucene problem
> > > > > >
> > > > > > Best,
> > > > > > Erick
> > > > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I am still trying to understand the corrupt index exception we
> > saw
> > > > in our
> > > > > > > logs. What does the hardware problem comment indicates here?
> > Does
> > > > that
> > > > > > > mean it caused most likely due to hardware issue?
> > > > > > >
> > > > > > > We never had this problem in last couple of months. The Solr is
> > > > 6.6.2 and
> > > > > > > ZK: 3.4.10.
> > > > > > >
> > > > > > > Please share your thoughts.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Susheel
> > > > > > >
> > > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> > checksum
> > > > > > > failed *(hardware
> > > > > > > problem?)* : expected=db243d1a actual=7a00d3d2
> > > > > > >
> > > > > >
> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > > > > [slice=_i27s_Lucene50_0.tim])
> > > > > > >
> > > > > > > It suddenly started in the logs and before which there was no
> > such
> > > > error.
> > > > > > > Searches & ingestions all seems to be working prior to that.
> > > > > > >
> > > > > > > ----
> > > > > > >
> > > > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL
> > > s:shard1
> > > > > > > r:core_node1 x:COLL_shard1_replica1]
> > > > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> > > > update-script#processAdd:
> > > > > > >
> > > > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
> > > 08480_1-en_US
> > > > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL
> > > s:shard1
> > > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.h.RequestHandlerBase
> > > > > > > org.apache.solr.common.SolrException: Exception writing
> document
> > > id
> > > > > > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
> > > 1-en_US
> > > > to
> > > > > > the
> > > > > > > index; possible analysis error.
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > > ateHandler2.java:206)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.processor.RunUpdateProcessor.processA
> > > dd(RunUpdateProcessorFactory.java:67)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > > essAdd(UpdateRequestProcessor.java:55)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > > doLocalAdd(DistributedUpdateProcessor.java:979)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > > versionAdd(DistributedUpdateProcessor.java:1192)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > > processAdd(DistributedUpdateProcessor.java:748)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > > essAdd(UpdateRequestProcessor.java:55)
> > > > > > > at
> > > > > > >
> > > > > >
> > > >
> org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$
> > > ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProces
> > > sorFactory.java:380)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.handler.loader.JavabinLoader$1.update(Javabi
> > > nLoader.java:98)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > ec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > ec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > > odec.java:306)
> > > > > > > at
> > > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > > c.java:251)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > ec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > > odec.java:271)
> > > > > > > at
> > > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > > c.java:251)
> > > > > > > at
> > > > > >
> > > > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCo
> > > dec.java:173)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > ec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDoc
> > > s(JavabinLoader.java:108)
> > > > > > > at
> > > > > >
> > > >
> > org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRe
> > > questHandler.java:97)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.handler.ContentStreamHandlerBase.handleReque
> > > stBody(ContentStreamHandlerBase.java:68)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> > > uestHandlerBase.java:173)
> > > > > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > > > > > > at
> > > > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > > > > > > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> > > 529)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > > atchFilter.java:361)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > > atchFilter.java:305)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > r(ServletHandler.java:1691)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> > > dler.java:582)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > > Handler.java:143)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> > > ndler.java:548)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(
> > > SessionHandler.java:226)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> > > ContextHandler.java:1180)
> > > > > > > at
> > > > > >
> > > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> > > ler.java:512)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.session.SessionHandler.doScope(
> > > SessionHandler.java:185)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(
> > > ContextHandler.java:1112)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > > Handler.java:141)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> > > ndle(ContextHandlerCollection.java:213)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(
> > > HandlerCollection.java:119)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > > erWrapper.java:134)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(Rewr
> > > iteHandler.java:335)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > > erWrapper.java:134)
> > > > > > > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > > > > > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> > > java:320)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> > > ction.java:251)
> > > > > > > at
> > > > > > > org.eclipse.jetty.io
> > > > > > .AbstractConnection$ReadCallback.succeeded(AbstractConnectio
> > > n.java:273)
> > > > > > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.
> > > java:95)
> > > > > > > at
> > > > > > > org.eclipse.jetty.io
> > > > > > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > > .executeProduceConsume(ExecuteProduceConsume.java:303)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > > .produceConsume(ExecuteProduceConsume.java:148)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > > .run(ExecuteProduceConsume.java:136)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> > > ThreadPool.java:671)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedT
> > > hreadPool.java:589)
> > > > > > > at java.lang.Thread.run(Thread.java:748)
> > > > > > > Caused by: org.apache.lucene.store.AlreadyClosedException: this
> > > > > > IndexWriter
> > > > > > > is closed
> > > > > > > at
> > > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > > > > > > at
> > > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > > > > > > at
> > > > > >
> > > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWrit
> > > er.java:1567)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocument(D
> > > irectUpdateHandler2.java:924)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocVa
> > > lues(DirectUpdateHandler2.java:913)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
> > > irectUpdateHandler2.java:302)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
> > > dateHandler2.java:239)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > > ateHandler2.java:194)
> > > > > > > ... 54 more
> > > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> > checksum
> > > > failed
> > > > > > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> > > > > > >
> > > > > >
> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > > > > [slice=_i27s_Lucene50_0.tim]))
> > > > > > > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.
> > > java:419)
> > > > > > > at
> > > > > >
> > > > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecU
> > > til.java:526)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.chec
> > > kIntegrity(BlockTreeTermsReader.java:336)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > > ldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterF
> > > ieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > > > > > > at
> > > > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > > ldsWriter.merge(PerFieldPostingsFormat.java:164)
> > > > > > > at
> > > > > >
> > > >
> > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > > > > > > at
> > > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > > > > > > at
> > > >
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > > > > > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:
> > > 3931)
> > > > > > > at
> > > >
> org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Con
> > > currentMergeScheduler.java:624)
> > > > > > > at
> > > > > > >
> > > > > >
> > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread
> > > .run(ConcurrentMergeScheduler.java:661)
> > > > > > >
> > > > > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL
> > > s:shard1
> > > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > > > > > > [COLL_shard1_replica1]  webapp=/solr path=/update
> > > > > > > params={wt=javabin&version=2} status=400 QTime=69
> > > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Stephen Bianamara
To be more concrete: Is the definitive test of whether or not a core's
index is corrupt to copy it onto a new set of hardware and attempt to write
to it? If this is a definitive test, we can run the experiment and update
the report so you have a sense of how often this happens.

Since this is a SOLR cloud node, which is already removed but whose data
dir was preserved, I believe I can just copy the data directory to a fresh
machine and start a regular non-cloud solr node hosting this core. Can you
please confirm that this will be a definitive test, or whether there is
some aspect needed to make it definitive?

Thanks!

On Wed, Oct 3, 2018 at 2:10 AM Stephen Bianamara <[hidden email]>
wrote:

> Hello All --
>
> As it would happen, we've seen this error on version 6.6.2 very recently.
> This is also on an AWS instance, like Simon's report. The drive doesn't
> show any sign of being unhealthy, either from cursory investigation. FWIW,
> this occurred during a collection backup.
>
> Erick, is there some diagnostic data we can find to help pin this down?
>
> Thanks!
> Stephen
>
> On Sun, Sep 30, 2018 at 12:48 PM Susheel Kumar <[hidden email]>
> wrote:
>
>> Thank you, Simon. Which basically points that something related to env and
>> was causing the checksum failures than any lucene/solr issue.
>>
>> Eric - I did check with hardware folks and they are aware of some VMware
>> issue where the VM hosted in HCI environment is coming into some halt
>> state
>> for minute or so and may be loosing connections to disk/network.  So that
>> probably may be the reason of index corruption though they have not been
>> able to find anything specific from logs during the time Solr run into
>> issue
>>
>> Also I had again issue where Solr is loosing the connection with zookeeper
>> (Client session timed out, have not heard from server in 8367ms for
>> sessionid 0x0)  Does that points to similar hardware issue, Any
>> suggestions?
>>
>> Thanks,
>> Susheel
>>
>> 2018-09-29 17:30:44.070 INFO
>> (searcherExecutor-7-thread-1-processing-n:server54:8080_solr
>> x:COLL_shard4_replica2 s:shard4 c:COLL r:core_node8) [c:COLL s:shard4
>> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.SolrCore
>> [COLL_shard4_replica2] Registered new searcher
>> Searcher@7a4465b1[COLL_shard4_replica2]
>>
>> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_7x3f(6.6.2):C826923/317917:delGen=2523)
>> Uninverting(_83pb(6.6.2):C805451/172968:delGen=2957)
>> Uninverting(_3ywj(6.6.2):C727978/334529:delGen=2962)
>> Uninverting(_7vsw(6.6.2):C872110/385178:delGen=2020)
>> Uninverting(_8n89(6.6.2):C741293/109260:delGen=3863)
>> Uninverting(_7zkq(6.6.2):C720666/101205:delGen=3151)
>> Uninverting(_825d(6.6.2):C707731/112410:delGen=3168)
>> Uninverting(_dgwu(6.6.2):C760421/295964:delGen=4624)
>> Uninverting(_gs5x(6.6.2):C540942/138952:delGen=1623)
>> Uninverting(_gu6a(6.6.2):c75213/35640:delGen=1110)
>> Uninverting(_h33i(6.6.2):c131276/40356:delGen=706)
>> Uninverting(_h5tc(6.6.2):c44320/11080:delGen=380)
>> Uninverting(_h9d9(6.6.2):c35088/3188:delGen=104)
>> Uninverting(_h80h(6.6.2):c11927/3412:delGen=153)
>> Uninverting(_h7ll(6.6.2):c11284/1368:delGen=205)
>> Uninverting(_h8bs(6.6.2):c11518/2103:delGen=149)
>> Uninverting(_h9r3(6.6.2):c16439/1018:delGen=52)
>> Uninverting(_h9z1(6.6.2):c9428/823:delGen=27)
>> Uninverting(_h9v2(6.6.2):c933/33:delGen=12)
>> Uninverting(_ha1c(6.6.2):c1056/1:delGen=1)
>> Uninverting(_ha6i(6.6.2):c1883/124:delGen=8)
>> Uninverting(_ha3x(6.6.2):c807/14:delGen=3)
>> Uninverting(_ha47(6.6.2):c1229/133:delGen=6)
>> Uninverting(_hapk(6.6.2):c523) Uninverting(_haoq(6.6.2):c279)
>> Uninverting(_hamr(6.6.2):c311) Uninverting(_hap0(6.6.2):c338)
>> Uninverting(_hapu(6.6.2):c275) Uninverting(_hapv(6.6.2):C4/2:delGen=1)
>> Uninverting(_hapw(6.6.2):C5/2:delGen=1)
>> Uninverting(_hapx(6.6.2):C2/1:delGen=1)
>> Uninverting(_hapy(6.6.2):C2/1:delGen=1)
>> Uninverting(_hapz(6.6.2):C3/1:delGen=1)
>> Uninverting(_haq0(6.6.2):C6/3:delGen=1)
>> Uninverting(_haq1(6.6.2):C1)))}
>> 2018-09-29 17:30:52.390 WARN
>>
>> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server117:2182))
>> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
>> server in 8367ms for sessionid 0x0
>> 2018-09-29 17:31:01.302 WARN
>>
>> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server120:2182))
>> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
>> server in 8812ms for sessionid 0x0
>> 2018-09-29 17:31:14.049 INFO
>> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
>>   ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper
>> reestablished.
>> 2018-09-29 17:31:14.049 INFO
>> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
>>   ] o.a.s.c.ZkController ZooKeeper session re-connected ... refreshing
>> core states after session expiration.
>> 2018-09-29 17:31:14.051 INFO
>> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
>>   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (16)
>> -> (15)
>> 2018-09-29 17:31:14.144 INFO  (qtp834133664-520378) [c:COLL s:shard4
>> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.S.Request
>> [COLL_shard4_replica2]  webapp=/solr path=/admin/ping
>>
>> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
>> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
>> }
>> webapp=/solr path=/admin/ping
>>
>> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
>> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
>> }
>> hits=4989979 status=0 QTime=0
>>
>>
>>
>>
>> On Wed, Sep 26, 2018 at 9:44 AM simon <[hidden email]> wrote:
>>
>> > I saw something like this a year ago which i reported as a possible
>> bug  (
>> > https://issues.apache.org/jira/browse/SOLR-10840, which has  a full
>> > description and stack traces)
>> >
>> > This occurred very randomly on an AWS instance; moving the index
>> directory
>> > to a different file system did not fix the problem Eventually I cloned
>> our
>> > environment to a new AWS instance, which proved to be the solution.
>> Why, I
>> > have no idea...
>> >
>> > -Simon
>> >
>> > On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar <[hidden email]>
>> > wrote:
>> >
>> > > Got it. I'll have first hardware folks check and if they don't
>> see/find
>> > > anything suspicious then i'll return here.
>> > >
>> > > Wondering if any body has seen similar error and if they were able to
>> > > confirm if it was hardware fault or so.
>> > >
>> > > Thnx
>> > >
>> > > On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <
>> [hidden email]>
>> > > wrote:
>> > >
>> > > > Mind you it could _still_ be Solr/Lucene, but let's check the
>> hardware
>> > > > first ;)
>> > > > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <
>> [hidden email]>
>> > > > wrote:
>> > > > >
>> > > > > Hi Erick,
>> > > > >
>> > > > > Thanks so much for your reply.  I'll now look mostly into any
>> > possible
>> > > > > hardware issues than Solr/Lucene.
>> > > > >
>> > > > > Thanks again.
>> > > > >
>> > > > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
>> > > [hidden email]
>> > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > There are several of reasons this would "suddenly" start
>> appearing.
>> > > > > > 1> Your disk went bad and some sector is no longer faithfully
>> > > > > > recording the bits. In this case the checksum will be wrong
>> > > > > > 2> You ran out of disk space sometime and the index was
>> corrupted.
>> > > > > > This isn't really a hardware problem.
>> > > > > > 3> Your disk controller is going wonky and not reading reliably.
>> > > > > >
>> > > > > > The "possible hardware issue" message is to alert you that this
>> is
>> > > > > > highly unusual and you should at leasts consider doing integrity
>> > > > > > checks on your disk before assuming it's a Solr/Lucene problem
>> > > > > >
>> > > > > > Best,
>> > > > > > Erick
>> > > > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <
>> > [hidden email]
>> > > >
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > Hello,
>> > > > > > >
>> > > > > > > I am still trying to understand the corrupt index exception we
>> > saw
>> > > > in our
>> > > > > > > logs. What does the hardware problem comment indicates here?
>> > Does
>> > > > that
>> > > > > > > mean it caused most likely due to hardware issue?
>> > > > > > >
>> > > > > > > We never had this problem in last couple of months. The Solr
>> is
>> > > > 6.6.2 and
>> > > > > > > ZK: 3.4.10.
>> > > > > > >
>> > > > > > > Please share your thoughts.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > > Susheel
>> > > > > > >
>> > > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
>> > checksum
>> > > > > > > failed *(hardware
>> > > > > > > problem?)* : expected=db243d1a actual=7a00d3d2
>> > > > > > >
>> > > > > >
>> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
>> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
>> > > > > > > [slice=_i27s_Lucene50_0.tim])
>> > > > > > >
>> > > > > > > It suddenly started in the logs and before which there was no
>> > such
>> > > > error.
>> > > > > > > Searches & ingestions all seems to be working prior to that.
>> > > > > > >
>> > > > > > > ----
>> > > > > > >
>> > > > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL
>> > > s:shard1
>> > > > > > > r:core_node1 x:COLL_shard1_replica1]
>> > > > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
>> > > > update-script#processAdd:
>> > > > > > >
>> > > > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
>> > > 08480_1-en_US
>> > > > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL
>> > > s:shard1
>> > > > > > > r:core_node1 x:COLL_shard1_replica1]
>> o.a.s.h.RequestHandlerBase
>> > > > > > > org.apache.solr.common.SolrException: Exception writing
>> document
>> > > id
>> > > > > > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
>> > > 1-en_US
>> > > > to
>> > > > > > the
>> > > > > > > index; possible analysis error.
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
>> > > ateHandler2.java:206)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.processor.RunUpdateProcessor.processA
>> > > dd(RunUpdateProcessorFactory.java:67)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
>> > > essAdd(UpdateRequestProcessor.java:55)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
>> > > doLocalAdd(DistributedUpdateProcessor.java:979)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
>> > > versionAdd(DistributedUpdateProcessor.java:1192)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
>> > > processAdd(DistributedUpdateProcessor.java:748)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
>> > > essAdd(UpdateRequestProcessor.java:55)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > >
>> org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$
>> > > ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProces
>> > > sorFactory.java:380)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.handler.loader.JavabinLoader$1.update(Javabi
>> > > nLoader.java:98)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
>> > > ec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
>> > > ec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
>> > > odec.java:306)
>> > > > > > > at
>> > > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
>> > > c.java:251)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
>> > > ec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
>> > > odec.java:271)
>> > > > > > > at
>> > > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
>> > > c.java:251)
>> > > > > > > at
>> > > > > >
>> > > > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCo
>> > > dec.java:173)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
>> > > ec.unmarshal(JavaBinUpdateRequestCodec.java:187)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDoc
>> > > s(JavabinLoader.java:108)
>> > > > > > > at
>> > > > > >
>> > > >
>> > org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRe
>> > > questHandler.java:97)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.handler.ContentStreamHandlerBase.handleReque
>> > > stBody(ContentStreamHandlerBase.java:68)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
>> > > uestHandlerBase.java:173)
>> > > > > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
>> > > > > > > at
>> > > > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
>> > > > > > > at
>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
>> > > 529)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
>> > > atchFilter.java:361)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
>> > > atchFilter.java:305)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
>> > > r(ServletHandler.java:1691)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
>> > > dler.java:582)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
>> > > Handler.java:143)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
>> > > ndler.java:548)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(
>> > > SessionHandler.java:226)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(
>> > > ContextHandler.java:1180)
>> > > > > > > at
>> > > > > >
>> > > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
>> > > ler.java:512)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.session.SessionHandler.doScope(
>> > > SessionHandler.java:185)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(
>> > > ContextHandler.java:1112)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
>> > > Handler.java:141)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
>> > > ndle(ContextHandlerCollection.java:213)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(
>> > > HandlerCollection.java:119)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
>> > > erWrapper.java:134)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(Rewr
>> > > iteHandler.java:335)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
>> > > erWrapper.java:134)
>> > > > > > > at org.eclipse.jetty.server.Server.handle(Server.java:534)
>> > > > > > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
>> > > java:320)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
>> > > ction.java:251)
>> > > > > > > at
>> > > > > > > org.eclipse.jetty.io
>> > > > > > .AbstractConnection$ReadCallback.succeeded(AbstractConnectio
>> > > n.java:273)
>> > > > > > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.
>> > > java:95)
>> > > > > > > at
>> > > > > > > org.eclipse.jetty.io
>> > > > > > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
>> > > .executeProduceConsume(ExecuteProduceConsume.java:303)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
>> > > .produceConsume(ExecuteProduceConsume.java:148)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
>> > > .run(ExecuteProduceConsume.java:136)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
>> > > ThreadPool.java:671)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedT
>> > > hreadPool.java:589)
>> > > > > > > at java.lang.Thread.run(Thread.java:748)
>> > > > > > > Caused by: org.apache.lucene.store.AlreadyClosedException:
>> this
>> > > > > > IndexWriter
>> > > > > > > is closed
>> > > > > > > at
>> > > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
>> > > > > > > at
>> > > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
>> > > > > > > at
>> > > > > >
>> > > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWrit
>> > > er.java:1567)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocument(D
>> > > irectUpdateHandler2.java:924)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocVa
>> > > lues(DirectUpdateHandler2.java:913)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
>> > > irectUpdateHandler2.java:302)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
>> > > dateHandler2.java:239)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
>> > > ateHandler2.java:194)
>> > > > > > > ... 54 more
>> > > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
>> > checksum
>> > > > failed
>> > > > > > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
>> > > > > > >
>> > > > > >
>> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
>> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
>> > > > > > > [slice=_i27s_Lucene50_0.tim]))
>> > > > > > > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.
>> > > java:419)
>> > > > > > > at
>> > > > > >
>> > > > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecU
>> > > til.java:526)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.chec
>> > > kIntegrity(BlockTreeTermsReader.java:336)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
>> > > ldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterF
>> > > ieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
>> > > > > > > at
>> > > >
>> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
>> > > ldsWriter.merge(PerFieldPostingsFormat.java:164)
>> > > > > > > at
>> > > > > >
>> > > >
>> > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
>> > > > > > > at
>> > > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
>> > > > > > > at
>> > > >
>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
>> > > > > > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:
>> > > 3931)
>> > > > > > > at
>> > > >
>> org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Con
>> > > currentMergeScheduler.java:624)
>> > > > > > > at
>> > > > > > >
>> > > > > >
>> > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread
>> > > .run(ConcurrentMergeScheduler.java:661)
>> > > > > > >
>> > > > > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL
>> > > s:shard1
>> > > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
>> > > > > > > [COLL_shard1_replica1]  webapp=/solr path=/update
>> > > > > > > params={wt=javabin&version=2} status=400 QTime=69
>> > > > > >
>> > > >
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Susheel Kumar-3
My understanding is once the index is corrupt, the only way to fix is using
checkindex utility which will remove some bad segments and then only we can
use it.

This is bit scary that you see similar error on 6.6.2 though in our case we
know we are going thru some hardware problem which likely would have caused
the corruption but there is no concrete evidence which can be used to
confirm if it is hardware or Solr/Lucene.  Are you able to use another AWS
instance similar to Simon's case.

Thanks,
Susheel

On Thu, Oct 4, 2018 at 7:11 PM Stephen Bianamara <[hidden email]>
wrote:

> To be more concrete: Is the definitive test of whether or not a core's
> index is corrupt to copy it onto a new set of hardware and attempt to write
> to it? If this is a definitive test, we can run the experiment and update
> the report so you have a sense of how often this happens.
>
> Since this is a SOLR cloud node, which is already removed but whose data
> dir was preserved, I believe I can just copy the data directory to a fresh
> machine and start a regular non-cloud solr node hosting this core. Can you
> please confirm that this will be a definitive test, or whether there is
> some aspect needed to make it definitive?
>
> Thanks!
>
> On Wed, Oct 3, 2018 at 2:10 AM Stephen Bianamara <[hidden email]>
> wrote:
>
> > Hello All --
> >
> > As it would happen, we've seen this error on version 6.6.2 very recently.
> > This is also on an AWS instance, like Simon's report. The drive doesn't
> > show any sign of being unhealthy, either from cursory investigation.
> FWIW,
> > this occurred during a collection backup.
> >
> > Erick, is there some diagnostic data we can find to help pin this down?
> >
> > Thanks!
> > Stephen
> >
> > On Sun, Sep 30, 2018 at 12:48 PM Susheel Kumar <[hidden email]>
> > wrote:
> >
> >> Thank you, Simon. Which basically points that something related to env
> and
> >> was causing the checksum failures than any lucene/solr issue.
> >>
> >> Eric - I did check with hardware folks and they are aware of some VMware
> >> issue where the VM hosted in HCI environment is coming into some halt
> >> state
> >> for minute or so and may be loosing connections to disk/network.  So
> that
> >> probably may be the reason of index corruption though they have not been
> >> able to find anything specific from logs during the time Solr run into
> >> issue
> >>
> >> Also I had again issue where Solr is loosing the connection with
> zookeeper
> >> (Client session timed out, have not heard from server in 8367ms for
> >> sessionid 0x0)  Does that points to similar hardware issue, Any
> >> suggestions?
> >>
> >> Thanks,
> >> Susheel
> >>
> >> 2018-09-29 17:30:44.070 INFO
> >> (searcherExecutor-7-thread-1-processing-n:server54:8080_solr
> >> x:COLL_shard4_replica2 s:shard4 c:COLL r:core_node8) [c:COLL s:shard4
> >> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.SolrCore
> >> [COLL_shard4_replica2] Registered new searcher
> >> Searcher@7a4465b1[COLL_shard4_replica2]
> >>
> >>
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_7x3f(6.6.2):C826923/317917:delGen=2523)
> >> Uninverting(_83pb(6.6.2):C805451/172968:delGen=2957)
> >> Uninverting(_3ywj(6.6.2):C727978/334529:delGen=2962)
> >> Uninverting(_7vsw(6.6.2):C872110/385178:delGen=2020)
> >> Uninverting(_8n89(6.6.2):C741293/109260:delGen=3863)
> >> Uninverting(_7zkq(6.6.2):C720666/101205:delGen=3151)
> >> Uninverting(_825d(6.6.2):C707731/112410:delGen=3168)
> >> Uninverting(_dgwu(6.6.2):C760421/295964:delGen=4624)
> >> Uninverting(_gs5x(6.6.2):C540942/138952:delGen=1623)
> >> Uninverting(_gu6a(6.6.2):c75213/35640:delGen=1110)
> >> Uninverting(_h33i(6.6.2):c131276/40356:delGen=706)
> >> Uninverting(_h5tc(6.6.2):c44320/11080:delGen=380)
> >> Uninverting(_h9d9(6.6.2):c35088/3188:delGen=104)
> >> Uninverting(_h80h(6.6.2):c11927/3412:delGen=153)
> >> Uninverting(_h7ll(6.6.2):c11284/1368:delGen=205)
> >> Uninverting(_h8bs(6.6.2):c11518/2103:delGen=149)
> >> Uninverting(_h9r3(6.6.2):c16439/1018:delGen=52)
> >> Uninverting(_h9z1(6.6.2):c9428/823:delGen=27)
> >> Uninverting(_h9v2(6.6.2):c933/33:delGen=12)
> >> Uninverting(_ha1c(6.6.2):c1056/1:delGen=1)
> >> Uninverting(_ha6i(6.6.2):c1883/124:delGen=8)
> >> Uninverting(_ha3x(6.6.2):c807/14:delGen=3)
> >> Uninverting(_ha47(6.6.2):c1229/133:delGen=6)
> >> Uninverting(_hapk(6.6.2):c523) Uninverting(_haoq(6.6.2):c279)
> >> Uninverting(_hamr(6.6.2):c311) Uninverting(_hap0(6.6.2):c338)
> >> Uninverting(_hapu(6.6.2):c275) Uninverting(_hapv(6.6.2):C4/2:delGen=1)
> >> Uninverting(_hapw(6.6.2):C5/2:delGen=1)
> >> Uninverting(_hapx(6.6.2):C2/1:delGen=1)
> >> Uninverting(_hapy(6.6.2):C2/1:delGen=1)
> >> Uninverting(_hapz(6.6.2):C3/1:delGen=1)
> >> Uninverting(_haq0(6.6.2):C6/3:delGen=1)
> >> Uninverting(_haq1(6.6.2):C1)))}
> >> 2018-09-29 17:30:52.390 WARN
> >>
> >>
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server117:2182))
> >> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
> >> server in 8367ms for sessionid 0x0
> >> 2018-09-29 17:31:01.302 WARN
> >>
> >>
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server120:2182))
> >> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
> >> server in 8812ms for sessionid 0x0
> >> 2018-09-29 17:31:14.049 INFO
> >> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> >>   ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper
> >> reestablished.
> >> 2018-09-29 17:31:14.049 INFO
> >> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> >>   ] o.a.s.c.ZkController ZooKeeper session re-connected ... refreshing
> >> core states after session expiration.
> >> 2018-09-29 17:31:14.051 INFO
> >> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> >>   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (16)
> >> -> (15)
> >> 2018-09-29 17:31:14.144 INFO  (qtp834133664-520378) [c:COLL s:shard4
> >> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.S.Request
> >> [COLL_shard4_replica2]  webapp=/solr path=/admin/ping
> >>
> >>
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> >>
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> >> }
> >> webapp=/solr path=/admin/ping
> >>
> >>
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> >>
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> >> }
> >> hits=4989979 status=0 QTime=0
> >>
> >>
> >>
> >>
> >> On Wed, Sep 26, 2018 at 9:44 AM simon <[hidden email]> wrote:
> >>
> >> > I saw something like this a year ago which i reported as a possible
> >> bug  (
> >> > https://issues.apache.org/jira/browse/SOLR-10840, which has  a full
> >> > description and stack traces)
> >> >
> >> > This occurred very randomly on an AWS instance; moving the index
> >> directory
> >> > to a different file system did not fix the problem Eventually I cloned
> >> our
> >> > environment to a new AWS instance, which proved to be the solution.
> >> Why, I
> >> > have no idea...
> >> >
> >> > -Simon
> >> >
> >> > On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar <[hidden email]
> >
> >> > wrote:
> >> >
> >> > > Got it. I'll have first hardware folks check and if they don't
> >> see/find
> >> > > anything suspicious then i'll return here.
> >> > >
> >> > > Wondering if any body has seen similar error and if they were able
> to
> >> > > confirm if it was hardware fault or so.
> >> > >
> >> > > Thnx
> >> > >
> >> > > On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <
> >> [hidden email]>
> >> > > wrote:
> >> > >
> >> > > > Mind you it could _still_ be Solr/Lucene, but let's check the
> >> hardware
> >> > > > first ;)
> >> > > > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <
> >> [hidden email]>
> >> > > > wrote:
> >> > > > >
> >> > > > > Hi Erick,
> >> > > > >
> >> > > > > Thanks so much for your reply.  I'll now look mostly into any
> >> > possible
> >> > > > > hardware issues than Solr/Lucene.
> >> > > > >
> >> > > > > Thanks again.
> >> > > > >
> >> > > > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
> >> > > [hidden email]
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > There are several of reasons this would "suddenly" start
> >> appearing.
> >> > > > > > 1> Your disk went bad and some sector is no longer faithfully
> >> > > > > > recording the bits. In this case the checksum will be wrong
> >> > > > > > 2> You ran out of disk space sometime and the index was
> >> corrupted.
> >> > > > > > This isn't really a hardware problem.
> >> > > > > > 3> Your disk controller is going wonky and not reading
> reliably.
> >> > > > > >
> >> > > > > > The "possible hardware issue" message is to alert you that
> this
> >> is
> >> > > > > > highly unusual and you should at leasts consider doing
> integrity
> >> > > > > > checks on your disk before assuming it's a Solr/Lucene problem
> >> > > > > >
> >> > > > > > Best,
> >> > > > > > Erick
> >> > > > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <
> >> > [hidden email]
> >> > > >
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > Hello,
> >> > > > > > >
> >> > > > > > > I am still trying to understand the corrupt index exception
> we
> >> > saw
> >> > > > in our
> >> > > > > > > logs. What does the hardware problem comment indicates here?
> >> > Does
> >> > > > that
> >> > > > > > > mean it caused most likely due to hardware issue?
> >> > > > > > >
> >> > > > > > > We never had this problem in last couple of months. The Solr
> >> is
> >> > > > 6.6.2 and
> >> > > > > > > ZK: 3.4.10.
> >> > > > > > >
> >> > > > > > > Please share your thoughts.
> >> > > > > > >
> >> > > > > > > Thanks,
> >> > > > > > > Susheel
> >> > > > > > >
> >> > > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> >> > checksum
> >> > > > > > > failed *(hardware
> >> > > > > > > problem?)* : expected=db243d1a actual=7a00d3d2
> >> > > > > > >
> >> > > > > >
> >> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> >> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> >> > > > > > > [slice=_i27s_Lucene50_0.tim])
> >> > > > > > >
> >> > > > > > > It suddenly started in the logs and before which there was
> no
> >> > such
> >> > > > error.
> >> > > > > > > Searches & ingestions all seems to be working prior to that.
> >> > > > > > >
> >> > > > > > > ----
> >> > > > > > >
> >> > > > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872) [c:COLL
> >> > > s:shard1
> >> > > > > > > r:core_node1 x:COLL_shard1_replica1]
> >> > > > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> >> > > > update-script#processAdd:
> >> > > > > > >
> >> > > > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
> >> > > 08480_1-en_US
> >> > > > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872) [c:COLL
> >> > > s:shard1
> >> > > > > > > r:core_node1 x:COLL_shard1_replica1]
> >> o.a.s.h.RequestHandlerBase
> >> > > > > > > org.apache.solr.common.SolrException: Exception writing
> >> document
> >> > > id
> >> > > > > > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
> >> > > 1-en_US
> >> > > > to
> >> > > > > > the
> >> > > > > > > index; possible analysis error.
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> >> > > ateHandler2.java:206)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.processor.RunUpdateProcessor.processA
> >> > > dd(RunUpdateProcessorFactory.java:67)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> >> > > essAdd(UpdateRequestProcessor.java:55)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> >> > > doLocalAdd(DistributedUpdateProcessor.java:979)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> >> > > versionAdd(DistributedUpdateProcessor.java:1192)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> >> > > processAdd(DistributedUpdateProcessor.java:748)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> >> > > essAdd(UpdateRequestProcessor.java:55)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > >
> >> org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$
> >> > > ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProces
> >> > > sorFactory.java:380)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.handler.loader.JavabinLoader$1.update(Javabi
> >> > > nLoader.java:98)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> >> > > ec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> >> > > ec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> >> > > odec.java:306)
> >> > > > > > > at
> >> > > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> >> > > c.java:251)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> >> > > ec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> >> > > odec.java:271)
> >> > > > > > > at
> >> > > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> >> > > c.java:251)
> >> > > > > > > at
> >> > > > > >
> >> > > > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCo
> >> > > dec.java:173)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> >> > > ec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDoc
> >> > > s(JavabinLoader.java:108)
> >> > > > > > > at
> >> > > > > >
> >> > > >
> >> >
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRe
> >> > > questHandler.java:97)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.handler.ContentStreamHandlerBase.handleReque
> >> > > stBody(ContentStreamHandlerBase.java:68)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> >> > > uestHandlerBase.java:173)
> >> > > > > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> >> > > > > > > at
> >> > > >
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> >> > > > > > > at
> >> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> >> > > 529)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> >> > > atchFilter.java:361)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> >> > > atchFilter.java:305)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> >> > > r(ServletHandler.java:1691)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> >> > > dler.java:582)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> >> > > Handler.java:143)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> >> > > ndler.java:548)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(
> >> > > SessionHandler.java:226)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> >> > > ContextHandler.java:1180)
> >> > > > > > > at
> >> > > > > >
> >> > > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> >> > > ler.java:512)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.session.SessionHandler.doScope(
> >> > > SessionHandler.java:185)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(
> >> > > ContextHandler.java:1112)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> >> > > Handler.java:141)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> >> > > ndle(ContextHandlerCollection.java:213)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(
> >> > > HandlerCollection.java:119)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> >> > > erWrapper.java:134)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(Rewr
> >> > > iteHandler.java:335)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> >> > > erWrapper.java:134)
> >> > > > > > > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> >> > > > > > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> >> > > java:320)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> >> > > ction.java:251)
> >> > > > > > > at
> >> > > > > > > org.eclipse.jetty.io
> >> > > > > > .AbstractConnection$ReadCallback.succeeded(AbstractConnectio
> >> > > n.java:273)
> >> > > > > > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.
> >> > > java:95)
> >> > > > > > > at
> >> > > > > > > org.eclipse.jetty.io
> >> > > > > > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> >> > > .executeProduceConsume(ExecuteProduceConsume.java:303)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> >> > > .produceConsume(ExecuteProduceConsume.java:148)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> >> > > .run(ExecuteProduceConsume.java:136)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> >> > > ThreadPool.java:671)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedT
> >> > > hreadPool.java:589)
> >> > > > > > > at java.lang.Thread.run(Thread.java:748)
> >> > > > > > > Caused by: org.apache.lucene.store.AlreadyClosedException:
> >> this
> >> > > > > > IndexWriter
> >> > > > > > > is closed
> >> > > > > > > at
> >> > > >
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> >> > > > > > > at
> >> > > >
> org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> >> > > > > > > at
> >> > > > > >
> >> > > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWrit
> >> > > er.java:1567)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocument(D
> >> > > irectUpdateHandler2.java:924)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocVa
> >> > > lues(DirectUpdateHandler2.java:913)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
> >> > > irectUpdateHandler2.java:302)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
> >> > > dateHandler2.java:239)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> >> > > ateHandler2.java:194)
> >> > > > > > > ... 54 more
> >> > > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> >> > checksum
> >> > > > failed
> >> > > > > > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> >> > > > > > >
> >> > > > > >
> >> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> >> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> >> > > > > > > [slice=_i27s_Lucene50_0.tim]))
> >> > > > > > > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.
> >> > > java:419)
> >> > > > > > > at
> >> > > > > >
> >> > > > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecU
> >> > > til.java:526)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.chec
> >> > > kIntegrity(BlockTreeTermsReader.java:336)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> >> > > ldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterF
> >> > > ieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> >> > > > > > > at
> >> > > >
> >> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> >> > > ldsWriter.merge(PerFieldPostingsFormat.java:164)
> >> > > > > > > at
> >> > > > > >
> >> > > >
> >> >
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> >> > > > > > > at
> >> > > >
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> >> > > > > > > at
> >> > > >
> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> >> > > > > > > at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:
> >> > > 3931)
> >> > > > > > > at
> >> > > >
> >> org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Con
> >> > > currentMergeScheduler.java:624)
> >> > > > > > > at
> >> > > > > > >
> >> > > > > >
> >> > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread
> >> > > .run(ConcurrentMergeScheduler.java:661)
> >> > > > > > >
> >> > > > > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872) [c:COLL
> >> > > s:shard1
> >> > > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> >> > > > > > > [COLL_shard1_replica1]  webapp=/solr path=/update
> >> > > > > > > params={wt=javabin&version=2} status=400 QTime=69
> >> > > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Stephen Bianamara
Hi Susheel,

Yes, I believe you are correct on fixing a node in place. My org actually
just cycles instances rather than repair broken ones.

It's too bad that there's nothing conclusive we can look for to help
investigate the scope. We'd love to pin this down so that we could take
something concrete to investigate to AWS if it's a hardware failure (e.g.,
we found a log indicating ....). I haven't been able to find anything which
might clarify that matter outside of SOLR either. Perhaps it's just not
realistic at this time.

I'm also curious about another aspect, which is that the nodes don't report
as unhealthy. Currently a node with a bad checksum will just stay in the
collection forever. Shouldn't the node go to "down" if it has an
irreparable checksum?

On Fri, Oct 5, 2018 at 5:25 AM Susheel Kumar <[hidden email]> wrote:

> My understanding is once the index is corrupt, the only way to fix is using
> checkindex utility which will remove some bad segments and then only we can
> use it.
>
> This is bit scary that you see similar error on 6.6.2 though in our case we
> know we are going thru some hardware problem which likely would have caused
> the corruption but there is no concrete evidence which can be used to
> confirm if it is hardware or Solr/Lucene.  Are you able to use another AWS
> instance similar to Simon's case.
>
> Thanks,
> Susheel
>
> On Thu, Oct 4, 2018 at 7:11 PM Stephen Bianamara <[hidden email]>
> wrote:
>
> > To be more concrete: Is the definitive test of whether or not a core's
> > index is corrupt to copy it onto a new set of hardware and attempt to
> write
> > to it? If this is a definitive test, we can run the experiment and update
> > the report so you have a sense of how often this happens.
> >
> > Since this is a SOLR cloud node, which is already removed but whose data
> > dir was preserved, I believe I can just copy the data directory to a
> fresh
> > machine and start a regular non-cloud solr node hosting this core. Can
> you
> > please confirm that this will be a definitive test, or whether there is
> > some aspect needed to make it definitive?
> >
> > Thanks!
> >
> > On Wed, Oct 3, 2018 at 2:10 AM Stephen Bianamara <[hidden email]
> >
> > wrote:
> >
> > > Hello All --
> > >
> > > As it would happen, we've seen this error on version 6.6.2 very
> recently.
> > > This is also on an AWS instance, like Simon's report. The drive doesn't
> > > show any sign of being unhealthy, either from cursory investigation.
> > FWIW,
> > > this occurred during a collection backup.
> > >
> > > Erick, is there some diagnostic data we can find to help pin this down?
> > >
> > > Thanks!
> > > Stephen
> > >
> > > On Sun, Sep 30, 2018 at 12:48 PM Susheel Kumar <[hidden email]>
> > > wrote:
> > >
> > >> Thank you, Simon. Which basically points that something related to env
> > and
> > >> was causing the checksum failures than any lucene/solr issue.
> > >>
> > >> Eric - I did check with hardware folks and they are aware of some
> VMware
> > >> issue where the VM hosted in HCI environment is coming into some halt
> > >> state
> > >> for minute or so and may be loosing connections to disk/network.  So
> > that
> > >> probably may be the reason of index corruption though they have not
> been
> > >> able to find anything specific from logs during the time Solr run into
> > >> issue
> > >>
> > >> Also I had again issue where Solr is loosing the connection with
> > zookeeper
> > >> (Client session timed out, have not heard from server in 8367ms for
> > >> sessionid 0x0)  Does that points to similar hardware issue, Any
> > >> suggestions?
> > >>
> > >> Thanks,
> > >> Susheel
> > >>
> > >> 2018-09-29 17:30:44.070 INFO
> > >> (searcherExecutor-7-thread-1-processing-n:server54:8080_solr
> > >> x:COLL_shard4_replica2 s:shard4 c:COLL r:core_node8) [c:COLL s:shard4
> > >> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.SolrCore
> > >> [COLL_shard4_replica2] Registered new searcher
> > >> Searcher@7a4465b1[COLL_shard4_replica2]
> > >>
> > >>
> >
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_7x3f(6.6.2):C826923/317917:delGen=2523)
> > >> Uninverting(_83pb(6.6.2):C805451/172968:delGen=2957)
> > >> Uninverting(_3ywj(6.6.2):C727978/334529:delGen=2962)
> > >> Uninverting(_7vsw(6.6.2):C872110/385178:delGen=2020)
> > >> Uninverting(_8n89(6.6.2):C741293/109260:delGen=3863)
> > >> Uninverting(_7zkq(6.6.2):C720666/101205:delGen=3151)
> > >> Uninverting(_825d(6.6.2):C707731/112410:delGen=3168)
> > >> Uninverting(_dgwu(6.6.2):C760421/295964:delGen=4624)
> > >> Uninverting(_gs5x(6.6.2):C540942/138952:delGen=1623)
> > >> Uninverting(_gu6a(6.6.2):c75213/35640:delGen=1110)
> > >> Uninverting(_h33i(6.6.2):c131276/40356:delGen=706)
> > >> Uninverting(_h5tc(6.6.2):c44320/11080:delGen=380)
> > >> Uninverting(_h9d9(6.6.2):c35088/3188:delGen=104)
> > >> Uninverting(_h80h(6.6.2):c11927/3412:delGen=153)
> > >> Uninverting(_h7ll(6.6.2):c11284/1368:delGen=205)
> > >> Uninverting(_h8bs(6.6.2):c11518/2103:delGen=149)
> > >> Uninverting(_h9r3(6.6.2):c16439/1018:delGen=52)
> > >> Uninverting(_h9z1(6.6.2):c9428/823:delGen=27)
> > >> Uninverting(_h9v2(6.6.2):c933/33:delGen=12)
> > >> Uninverting(_ha1c(6.6.2):c1056/1:delGen=1)
> > >> Uninverting(_ha6i(6.6.2):c1883/124:delGen=8)
> > >> Uninverting(_ha3x(6.6.2):c807/14:delGen=3)
> > >> Uninverting(_ha47(6.6.2):c1229/133:delGen=6)
> > >> Uninverting(_hapk(6.6.2):c523) Uninverting(_haoq(6.6.2):c279)
> > >> Uninverting(_hamr(6.6.2):c311) Uninverting(_hap0(6.6.2):c338)
> > >> Uninverting(_hapu(6.6.2):c275) Uninverting(_hapv(6.6.2):C4/2:delGen=1)
> > >> Uninverting(_hapw(6.6.2):C5/2:delGen=1)
> > >> Uninverting(_hapx(6.6.2):C2/1:delGen=1)
> > >> Uninverting(_hapy(6.6.2):C2/1:delGen=1)
> > >> Uninverting(_hapz(6.6.2):C3/1:delGen=1)
> > >> Uninverting(_haq0(6.6.2):C6/3:delGen=1)
> > >> Uninverting(_haq1(6.6.2):C1)))}
> > >> 2018-09-29 17:30:52.390 WARN
> > >>
> > >>
> >
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server117:2182))
> > >> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
> > >> server in 8367ms for sessionid 0x0
> > >> 2018-09-29 17:31:01.302 WARN
> > >>
> > >>
> >
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server120:2182))
> > >> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
> > >> server in 8812ms for sessionid 0x0
> > >> 2018-09-29 17:31:14.049 INFO
> > >> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> > >>   ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper
> > >> reestablished.
> > >> 2018-09-29 17:31:14.049 INFO
> > >> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> > >>   ] o.a.s.c.ZkController ZooKeeper session re-connected ... refreshing
> > >> core states after session expiration.
> > >> 2018-09-29 17:31:14.051 INFO
> > >> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> > >>   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (16)
> > >> -> (15)
> > >> 2018-09-29 17:31:14.144 INFO  (qtp834133664-520378) [c:COLL s:shard4
> > >> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.S.Request
> > >> [COLL_shard4_replica2]  webapp=/solr path=/admin/ping
> > >>
> > >>
> >
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> > >>
> >
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> > >> }
> > >> webapp=/solr path=/admin/ping
> > >>
> > >>
> >
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> > >>
> >
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> > >> }
> > >> hits=4989979 status=0 QTime=0
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Sep 26, 2018 at 9:44 AM simon <[hidden email]> wrote:
> > >>
> > >> > I saw something like this a year ago which i reported as a possible
> > >> bug  (
> > >> > https://issues.apache.org/jira/browse/SOLR-10840, which has  a full
> > >> > description and stack traces)
> > >> >
> > >> > This occurred very randomly on an AWS instance; moving the index
> > >> directory
> > >> > to a different file system did not fix the problem Eventually I
> cloned
> > >> our
> > >> > environment to a new AWS instance, which proved to be the solution.
> > >> Why, I
> > >> > have no idea...
> > >> >
> > >> > -Simon
> > >> >
> > >> > On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar <
> [hidden email]
> > >
> > >> > wrote:
> > >> >
> > >> > > Got it. I'll have first hardware folks check and if they don't
> > >> see/find
> > >> > > anything suspicious then i'll return here.
> > >> > >
> > >> > > Wondering if any body has seen similar error and if they were able
> > to
> > >> > > confirm if it was hardware fault or so.
> > >> > >
> > >> > > Thnx
> > >> > >
> > >> > > On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <
> > >> [hidden email]>
> > >> > > wrote:
> > >> > >
> > >> > > > Mind you it could _still_ be Solr/Lucene, but let's check the
> > >> hardware
> > >> > > > first ;)
> > >> > > > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <
> > >> [hidden email]>
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > Hi Erick,
> > >> > > > >
> > >> > > > > Thanks so much for your reply.  I'll now look mostly into any
> > >> > possible
> > >> > > > > hardware issues than Solr/Lucene.
> > >> > > > >
> > >> > > > > Thanks again.
> > >> > > > >
> > >> > > > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
> > >> > > [hidden email]
> > >> > > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > There are several of reasons this would "suddenly" start
> > >> appearing.
> > >> > > > > > 1> Your disk went bad and some sector is no longer
> faithfully
> > >> > > > > > recording the bits. In this case the checksum will be wrong
> > >> > > > > > 2> You ran out of disk space sometime and the index was
> > >> corrupted.
> > >> > > > > > This isn't really a hardware problem.
> > >> > > > > > 3> Your disk controller is going wonky and not reading
> > reliably.
> > >> > > > > >
> > >> > > > > > The "possible hardware issue" message is to alert you that
> > this
> > >> is
> > >> > > > > > highly unusual and you should at leasts consider doing
> > integrity
> > >> > > > > > checks on your disk before assuming it's a Solr/Lucene
> problem
> > >> > > > > >
> > >> > > > > > Best,
> > >> > > > > > Erick
> > >> > > > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <
> > >> > [hidden email]
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > Hello,
> > >> > > > > > >
> > >> > > > > > > I am still trying to understand the corrupt index
> exception
> > we
> > >> > saw
> > >> > > > in our
> > >> > > > > > > logs. What does the hardware problem comment indicates
> here?
> > >> > Does
> > >> > > > that
> > >> > > > > > > mean it caused most likely due to hardware issue?
> > >> > > > > > >
> > >> > > > > > > We never had this problem in last couple of months. The
> Solr
> > >> is
> > >> > > > 6.6.2 and
> > >> > > > > > > ZK: 3.4.10.
> > >> > > > > > >
> > >> > > > > > > Please share your thoughts.
> > >> > > > > > >
> > >> > > > > > > Thanks,
> > >> > > > > > > Susheel
> > >> > > > > > >
> > >> > > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> > >> > checksum
> > >> > > > > > > failed *(hardware
> > >> > > > > > > problem?)* : expected=db243d1a actual=7a00d3d2
> > >> > > > > > >
> > >> > > > > >
> > >> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > >> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > >> > > > > > > [slice=_i27s_Lucene50_0.tim])
> > >> > > > > > >
> > >> > > > > > > It suddenly started in the logs and before which there was
> > no
> > >> > such
> > >> > > > error.
> > >> > > > > > > Searches & ingestions all seems to be working prior to
> that.
> > >> > > > > > >
> > >> > > > > > > ----
> > >> > > > > > >
> > >> > > > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872)
> [c:COLL
> > >> > > s:shard1
> > >> > > > > > > r:core_node1 x:COLL_shard1_replica1]
> > >> > > > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> > >> > > > update-script#processAdd:
> > >> > > > > > >
> > >> > > > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
> > >> > > 08480_1-en_US
> > >> > > > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872)
> [c:COLL
> > >> > > s:shard1
> > >> > > > > > > r:core_node1 x:COLL_shard1_replica1]
> > >> o.a.s.h.RequestHandlerBase
> > >> > > > > > > org.apache.solr.common.SolrException: Exception writing
> > >> document
> > >> > > id
> > >> > > > > > >
> G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
> > >> > > 1-en_US
> > >> > > > to
> > >> > > > > > the
> > >> > > > > > > index; possible analysis error.
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > >> > > ateHandler2.java:206)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.processor.RunUpdateProcessor.processA
> > >> > > dd(RunUpdateProcessorFactory.java:67)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > >> > > essAdd(UpdateRequestProcessor.java:55)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > >> > > doLocalAdd(DistributedUpdateProcessor.java:979)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > >> > > versionAdd(DistributedUpdateProcessor.java:1192)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > >> > > processAdd(DistributedUpdateProcessor.java:748)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > >> > > essAdd(UpdateRequestProcessor.java:55)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > >
> > >>
> org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$
> > >> > > ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProces
> > >> > > sorFactory.java:380)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.handler.loader.JavabinLoader$1.update(Javabi
> > >> > > nLoader.java:98)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > >> > > ec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > >> > > ec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > >> > > odec.java:306)
> > >> > > > > > > at
> > >> > > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > >> > > c.java:251)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > >> > > ec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > >> > > odec.java:271)
> > >> > > > > > > at
> > >> > > > > > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > >> > > c.java:251)
> > >> > > > > > > at
> > >> > > > > >
> > >> > > > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCo
> > >> > > dec.java:173)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > >> > > ec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDoc
> > >> > > s(JavabinLoader.java:108)
> > >> > > > > > > at
> > >> > > > > >
> > >> > > >
> > >> >
> > org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRe
> > >> > > questHandler.java:97)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.handler.ContentStreamHandlerBase.handleReque
> > >> > > stBody(ContentStreamHandlerBase.java:68)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> > >> > > uestHandlerBase.java:173)
> > >> > > > > > > at
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > >> > > > > > > at
> > >> > > >
> > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > >> > > > > > > at
> > >> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> > >> > > 529)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > >> > > atchFilter.java:361)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > >> > > atchFilter.java:305)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> > >> > > r(ServletHandler.java:1691)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> > >> > > dler.java:582)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > >> > > Handler.java:143)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> > >> > > ndler.java:548)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(
> > >> > > SessionHandler.java:226)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> > >> > > ContextHandler.java:1180)
> > >> > > > > > > at
> > >> > > > > >
> > >> > > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> > >> > > ler.java:512)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.session.SessionHandler.doScope(
> > >> > > SessionHandler.java:185)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(
> > >> > > ContextHandler.java:1112)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > >> > > Handler.java:141)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> > >> > > ndle(ContextHandlerCollection.java:213)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(
> > >> > > HandlerCollection.java:119)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > >> > > erWrapper.java:134)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(Rewr
> > >> > > iteHandler.java:335)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > >> > > erWrapper.java:134)
> > >> > > > > > > at org.eclipse.jetty.server.Server.handle(Server.java:534)
> > >> > > > > > > at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> > >> > > java:320)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> > >> > > ction.java:251)
> > >> > > > > > > at
> > >> > > > > > > org.eclipse.jetty.io
> > >> > > > > > .AbstractConnection$ReadCallback.succeeded(AbstractConnectio
> > >> > > n.java:273)
> > >> > > > > > > at org.eclipse.jetty.io
> .FillInterest.fillable(FillInterest.
> > >> > > java:95)
> > >> > > > > > > at
> > >> > > > > > > org.eclipse.jetty.io
> > >> > > > > > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > >> > > .executeProduceConsume(ExecuteProduceConsume.java:303)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > >> > > .produceConsume(ExecuteProduceConsume.java:148)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > >> > > .run(ExecuteProduceConsume.java:136)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> > >> > > ThreadPool.java:671)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedT
> > >> > > hreadPool.java:589)
> > >> > > > > > > at java.lang.Thread.run(Thread.java:748)
> > >> > > > > > > Caused by: org.apache.lucene.store.AlreadyClosedException:
> > >> this
> > >> > > > > > IndexWriter
> > >> > > > > > > is closed
> > >> > > > > > > at
> > >> > > >
> > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > >> > > > > > > at
> > >> > > >
> > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > >> > > > > > > at
> > >> > > > > >
> > >> > > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWrit
> > >> > > er.java:1567)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocument(D
> > >> > > irectUpdateHandler2.java:924)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocVa
> > >> > > lues(DirectUpdateHandler2.java:913)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
> > >> > > irectUpdateHandler2.java:302)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
> > >> > > dateHandler2.java:239)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > >> > > ateHandler2.java:194)
> > >> > > > > > > ... 54 more
> > >> > > > > > > Caused by: org.apache.lucene.index.CorruptIndexException:
> > >> > checksum
> > >> > > > failed
> > >> > > > > > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> > >> > > > > > >
> > >> > > > > >
> > >> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > >> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > >> > > > > > > [slice=_i27s_Lucene50_0.tim]))
> > >> > > > > > > at
> org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.
> > >> > > java:419)
> > >> > > > > > > at
> > >> > > > > >
> > >> > > > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecU
> > >> > > til.java:526)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.chec
> > >> > > kIntegrity(BlockTreeTermsReader.java:336)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > >> > > ldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterF
> > >> > > ieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > >> > > > > > > at
> > >> > > >
> > >> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > >> > > ldsWriter.merge(PerFieldPostingsFormat.java:164)
> > >> > > > > > > at
> > >> > > > > >
> > >> > > >
> > >> >
> > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > >> > > > > > > at
> > >> > > >
> > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > >> > > > > > > at
> > >> > > >
> > >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > >> > > > > > > at
> > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:
> > >> > > 3931)
> > >> > > > > > > at
> > >> > > >
> > >> org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Con
> > >> > > currentMergeScheduler.java:624)
> > >> > > > > > > at
> > >> > > > > > >
> > >> > > > > >
> > >> > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread
> > >> > > .run(ConcurrentMergeScheduler.java:661)
> > >> > > > > > >
> > >> > > > > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872)
> [c:COLL
> > >> > > s:shard1
> > >> > > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > >> > > > > > > [COLL_shard1_replica1]  webapp=/solr path=/update
> > >> > > > > > > params={wt=javabin&version=2} status=400 QTime=69
> > >> > > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Susheel Kumar-3
Exactly. I have a node with checksum issue and it is alive which is good
for us since if it goes down one of the shard would be down and thus
outage.

Yes, i agree that we don't get to know if there is node having checksum
issue and thats where we are putting log monitoring which will alert if
"corrupt" or "checksum" keyword is found in the logs.

Thnx

On Mon, Oct 8, 2018 at 5:41 PM Stephen Bianamara <[hidden email]>
wrote:

> Hi Susheel,
>
> Yes, I believe you are correct on fixing a node in place. My org actually
> just cycles instances rather than repair broken ones.
>
> It's too bad that there's nothing conclusive we can look for to help
> investigate the scope. We'd love to pin this down so that we could take
> something concrete to investigate to AWS if it's a hardware failure (e.g.,
> we found a log indicating ....). I haven't been able to find anything which
> might clarify that matter outside of SOLR either. Perhaps it's just not
> realistic at this time.
>
> I'm also curious about another aspect, which is that the nodes don't report
> as unhealthy. Currently a node with a bad checksum will just stay in the
> collection forever. Shouldn't the node go to "down" if it has an
> irreparable checksum?
>
> On Fri, Oct 5, 2018 at 5:25 AM Susheel Kumar <[hidden email]>
> wrote:
>
> > My understanding is once the index is corrupt, the only way to fix is
> using
> > checkindex utility which will remove some bad segments and then only we
> can
> > use it.
> >
> > This is bit scary that you see similar error on 6.6.2 though in our case
> we
> > know we are going thru some hardware problem which likely would have
> caused
> > the corruption but there is no concrete evidence which can be used to
> > confirm if it is hardware or Solr/Lucene.  Are you able to use another
> AWS
> > instance similar to Simon's case.
> >
> > Thanks,
> > Susheel
> >
> > On Thu, Oct 4, 2018 at 7:11 PM Stephen Bianamara <[hidden email]
> >
> > wrote:
> >
> > > To be more concrete: Is the definitive test of whether or not a core's
> > > index is corrupt to copy it onto a new set of hardware and attempt to
> > write
> > > to it? If this is a definitive test, we can run the experiment and
> update
> > > the report so you have a sense of how often this happens.
> > >
> > > Since this is a SOLR cloud node, which is already removed but whose
> data
> > > dir was preserved, I believe I can just copy the data directory to a
> > fresh
> > > machine and start a regular non-cloud solr node hosting this core. Can
> > you
> > > please confirm that this will be a definitive test, or whether there is
> > > some aspect needed to make it definitive?
> > >
> > > Thanks!
> > >
> > > On Wed, Oct 3, 2018 at 2:10 AM Stephen Bianamara <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > Hello All --
> > > >
> > > > As it would happen, we've seen this error on version 6.6.2 very
> > recently.
> > > > This is also on an AWS instance, like Simon's report. The drive
> doesn't
> > > > show any sign of being unhealthy, either from cursory investigation.
> > > FWIW,
> > > > this occurred during a collection backup.
> > > >
> > > > Erick, is there some diagnostic data we can find to help pin this
> down?
> > > >
> > > > Thanks!
> > > > Stephen
> > > >
> > > > On Sun, Sep 30, 2018 at 12:48 PM Susheel Kumar <
> [hidden email]>
> > > > wrote:
> > > >
> > > >> Thank you, Simon. Which basically points that something related to
> env
> > > and
> > > >> was causing the checksum failures than any lucene/solr issue.
> > > >>
> > > >> Eric - I did check with hardware folks and they are aware of some
> > VMware
> > > >> issue where the VM hosted in HCI environment is coming into some
> halt
> > > >> state
> > > >> for minute or so and may be loosing connections to disk/network.  So
> > > that
> > > >> probably may be the reason of index corruption though they have not
> > been
> > > >> able to find anything specific from logs during the time Solr run
> into
> > > >> issue
> > > >>
> > > >> Also I had again issue where Solr is loosing the connection with
> > > zookeeper
> > > >> (Client session timed out, have not heard from server in 8367ms for
> > > >> sessionid 0x0)  Does that points to similar hardware issue, Any
> > > >> suggestions?
> > > >>
> > > >> Thanks,
> > > >> Susheel
> > > >>
> > > >> 2018-09-29 17:30:44.070 INFO
> > > >> (searcherExecutor-7-thread-1-processing-n:server54:8080_solr
> > > >> x:COLL_shard4_replica2 s:shard4 c:COLL r:core_node8) [c:COLL
> s:shard4
> > > >> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.SolrCore
> > > >> [COLL_shard4_replica2] Registered new searcher
> > > >> Searcher@7a4465b1[COLL_shard4_replica2]
> > > >>
> > > >>
> > >
> >
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_7x3f(6.6.2):C826923/317917:delGen=2523)
> > > >> Uninverting(_83pb(6.6.2):C805451/172968:delGen=2957)
> > > >> Uninverting(_3ywj(6.6.2):C727978/334529:delGen=2962)
> > > >> Uninverting(_7vsw(6.6.2):C872110/385178:delGen=2020)
> > > >> Uninverting(_8n89(6.6.2):C741293/109260:delGen=3863)
> > > >> Uninverting(_7zkq(6.6.2):C720666/101205:delGen=3151)
> > > >> Uninverting(_825d(6.6.2):C707731/112410:delGen=3168)
> > > >> Uninverting(_dgwu(6.6.2):C760421/295964:delGen=4624)
> > > >> Uninverting(_gs5x(6.6.2):C540942/138952:delGen=1623)
> > > >> Uninverting(_gu6a(6.6.2):c75213/35640:delGen=1110)
> > > >> Uninverting(_h33i(6.6.2):c131276/40356:delGen=706)
> > > >> Uninverting(_h5tc(6.6.2):c44320/11080:delGen=380)
> > > >> Uninverting(_h9d9(6.6.2):c35088/3188:delGen=104)
> > > >> Uninverting(_h80h(6.6.2):c11927/3412:delGen=153)
> > > >> Uninverting(_h7ll(6.6.2):c11284/1368:delGen=205)
> > > >> Uninverting(_h8bs(6.6.2):c11518/2103:delGen=149)
> > > >> Uninverting(_h9r3(6.6.2):c16439/1018:delGen=52)
> > > >> Uninverting(_h9z1(6.6.2):c9428/823:delGen=27)
> > > >> Uninverting(_h9v2(6.6.2):c933/33:delGen=12)
> > > >> Uninverting(_ha1c(6.6.2):c1056/1:delGen=1)
> > > >> Uninverting(_ha6i(6.6.2):c1883/124:delGen=8)
> > > >> Uninverting(_ha3x(6.6.2):c807/14:delGen=3)
> > > >> Uninverting(_ha47(6.6.2):c1229/133:delGen=6)
> > > >> Uninverting(_hapk(6.6.2):c523) Uninverting(_haoq(6.6.2):c279)
> > > >> Uninverting(_hamr(6.6.2):c311) Uninverting(_hap0(6.6.2):c338)
> > > >> Uninverting(_hapu(6.6.2):c275)
> Uninverting(_hapv(6.6.2):C4/2:delGen=1)
> > > >> Uninverting(_hapw(6.6.2):C5/2:delGen=1)
> > > >> Uninverting(_hapx(6.6.2):C2/1:delGen=1)
> > > >> Uninverting(_hapy(6.6.2):C2/1:delGen=1)
> > > >> Uninverting(_hapz(6.6.2):C3/1:delGen=1)
> > > >> Uninverting(_haq0(6.6.2):C6/3:delGen=1)
> > > >> Uninverting(_haq1(6.6.2):C1)))}
> > > >> 2018-09-29 17:30:52.390 WARN
> > > >>
> > > >>
> > >
> >
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server117:2182))
> > > >> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
> > > >> server in 8367ms for sessionid 0x0
> > > >> 2018-09-29 17:31:01.302 WARN
> > > >>
> > > >>
> > >
> >
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server120:2182))
> > > >> [   ] o.a.z.ClientCnxn Client session timed out, have not heard from
> > > >> server in 8812ms for sessionid 0x0
> > > >> 2018-09-29 17:31:14.049 INFO
> > > >>
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> > > >>   ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper
> > > >> reestablished.
> > > >> 2018-09-29 17:31:14.049 INFO
> > > >>
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> > > >>   ] o.a.s.c.ZkController ZooKeeper session re-connected ...
> refreshing
> > > >> core states after session expiration.
> > > >> 2018-09-29 17:31:14.051 INFO
> > > >>
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> > > >>   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper...
> (16)
> > > >> -> (15)
> > > >> 2018-09-29 17:31:14.144 INFO  (qtp834133664-520378) [c:COLL s:shard4
> > > >> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.S.Request
> > > >> [COLL_shard4_replica2]  webapp=/solr path=/admin/ping
> > > >>
> > > >>
> > >
> >
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> > > >>
> > >
> >
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> > > >> }
> > > >> webapp=/solr path=/admin/ping
> > > >>
> > > >>
> > >
> >
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> > > >>
> > >
> >
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> > > >> }
> > > >> hits=4989979 status=0 QTime=0
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Sep 26, 2018 at 9:44 AM simon <[hidden email]> wrote:
> > > >>
> > > >> > I saw something like this a year ago which i reported as a
> possible
> > > >> bug  (
> > > >> > https://issues.apache.org/jira/browse/SOLR-10840, which has  a
> full
> > > >> > description and stack traces)
> > > >> >
> > > >> > This occurred very randomly on an AWS instance; moving the index
> > > >> directory
> > > >> > to a different file system did not fix the problem Eventually I
> > cloned
> > > >> our
> > > >> > environment to a new AWS instance, which proved to be the
> solution.
> > > >> Why, I
> > > >> > have no idea...
> > > >> >
> > > >> > -Simon
> > > >> >
> > > >> > On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar <
> > [hidden email]
> > > >
> > > >> > wrote:
> > > >> >
> > > >> > > Got it. I'll have first hardware folks check and if they don't
> > > >> see/find
> > > >> > > anything suspicious then i'll return here.
> > > >> > >
> > > >> > > Wondering if any body has seen similar error and if they were
> able
> > > to
> > > >> > > confirm if it was hardware fault or so.
> > > >> > >
> > > >> > > Thnx
> > > >> > >
> > > >> > > On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <
> > > >> [hidden email]>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Mind you it could _still_ be Solr/Lucene, but let's check the
> > > >> hardware
> > > >> > > > first ;)
> > > >> > > > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <
> > > >> [hidden email]>
> > > >> > > > wrote:
> > > >> > > > >
> > > >> > > > > Hi Erick,
> > > >> > > > >
> > > >> > > > > Thanks so much for your reply.  I'll now look mostly into
> any
> > > >> > possible
> > > >> > > > > hardware issues than Solr/Lucene.
> > > >> > > > >
> > > >> > > > > Thanks again.
> > > >> > > > >
> > > >> > > > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
> > > >> > > [hidden email]
> > > >> > > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > There are several of reasons this would "suddenly" start
> > > >> appearing.
> > > >> > > > > > 1> Your disk went bad and some sector is no longer
> > faithfully
> > > >> > > > > > recording the bits. In this case the checksum will be
> wrong
> > > >> > > > > > 2> You ran out of disk space sometime and the index was
> > > >> corrupted.
> > > >> > > > > > This isn't really a hardware problem.
> > > >> > > > > > 3> Your disk controller is going wonky and not reading
> > > reliably.
> > > >> > > > > >
> > > >> > > > > > The "possible hardware issue" message is to alert you that
> > > this
> > > >> is
> > > >> > > > > > highly unusual and you should at leasts consider doing
> > > integrity
> > > >> > > > > > checks on your disk before assuming it's a Solr/Lucene
> > problem
> > > >> > > > > >
> > > >> > > > > > Best,
> > > >> > > > > > Erick
> > > >> > > > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <
> > > >> > [hidden email]
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > Hello,
> > > >> > > > > > >
> > > >> > > > > > > I am still trying to understand the corrupt index
> > exception
> > > we
> > > >> > saw
> > > >> > > > in our
> > > >> > > > > > > logs. What does the hardware problem comment indicates
> > here?
> > > >> > Does
> > > >> > > > that
> > > >> > > > > > > mean it caused most likely due to hardware issue?
> > > >> > > > > > >
> > > >> > > > > > > We never had this problem in last couple of months. The
> > Solr
> > > >> is
> > > >> > > > 6.6.2 and
> > > >> > > > > > > ZK: 3.4.10.
> > > >> > > > > > >
> > > >> > > > > > > Please share your thoughts.
> > > >> > > > > > >
> > > >> > > > > > > Thanks,
> > > >> > > > > > > Susheel
> > > >> > > > > > >
> > > >> > > > > > > Caused by:
> org.apache.lucene.index.CorruptIndexException:
> > > >> > checksum
> > > >> > > > > > > failed *(hardware
> > > >> > > > > > > problem?)* : expected=db243d1a actual=7a00d3d2
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > > >> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > >> > > > > > > [slice=_i27s_Lucene50_0.tim])
> > > >> > > > > > >
> > > >> > > > > > > It suddenly started in the logs and before which there
> was
> > > no
> > > >> > such
> > > >> > > > error.
> > > >> > > > > > > Searches & ingestions all seems to be working prior to
> > that.
> > > >> > > > > > >
> > > >> > > > > > > ----
> > > >> > > > > > >
> > > >> > > > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872)
> > [c:COLL
> > > >> > > s:shard1
> > > >> > > > > > > r:core_node1 x:COLL_shard1_replica1]
> > > >> > > > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> > > >> > > > update-script#processAdd:
> > > >> > > > > > >
> > > >> > > > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
> > > >> > > 08480_1-en_US
> > > >> > > > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872)
> > [c:COLL
> > > >> > > s:shard1
> > > >> > > > > > > r:core_node1 x:COLL_shard1_replica1]
> > > >> o.a.s.h.RequestHandlerBase
> > > >> > > > > > > org.apache.solr.common.SolrException: Exception writing
> > > >> document
> > > >> > > id
> > > >> > > > > > >
> > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
> > > >> > > 1-en_US
> > > >> > > > to
> > > >> > > > > > the
> > > >> > > > > > > index; possible analysis error.
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > > >> > > ateHandler2.java:206)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.processor.RunUpdateProcessor.processA
> > > >> > > dd(RunUpdateProcessorFactory.java:67)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > > >> > > essAdd(UpdateRequestProcessor.java:55)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > > >> > > doLocalAdd(DistributedUpdateProcessor.java:979)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > > >> > > versionAdd(DistributedUpdateProcessor.java:1192)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > > >> > > processAdd(DistributedUpdateProcessor.java:748)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > > >> > > essAdd(UpdateRequestProcessor.java:55)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > >
> > > >>
> > org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$
> > > >> > > ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProces
> > > >> > > sorFactory.java:380)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.handler.loader.JavabinLoader$1.update(Javabi
> > > >> > > nLoader.java:98)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > >> > >
> ec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > >> > > ec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > > >> > > odec.java:306)
> > > >> > > > > > > at
> > > >> > > > > >
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > > >> > > c.java:251)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > >> > > ec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > > >> > > odec.java:271)
> > > >> > > > > > > at
> > > >> > > > > >
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > > >> > > c.java:251)
> > > >> > > > > > > at
> > > >> > > > > >
> > > >> > > > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCo
> > > >> > > dec.java:173)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > >> > > ec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDoc
> > > >> > > s(JavabinLoader.java:108)
> > > >> > > > > > > at
> > > >> > > > > >
> > > >> > > >
> > > >> >
> > >
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRe
> > > >> > > questHandler.java:97)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.handler.ContentStreamHandlerBase.handleReque
> > > >> > > stBody(ContentStreamHandlerBase.java:68)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> > > >> > > uestHandlerBase.java:173)
> > > >> > > > > > > at
> > org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > > >> > > > > > > at
> > > >> > > >
> > > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > > >> > > > > > > at
> > > >> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> > > >> > > 529)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > > >> > > atchFilter.java:361)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > > >> > > atchFilter.java:305)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > >> > > r(ServletHandler.java:1691)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> > > >> > > dler.java:582)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > > >> > > Handler.java:143)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> > > >> > > ndler.java:548)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(
> > > >> > > SessionHandler.java:226)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> > > >> > > ContextHandler.java:1180)
> > > >> > > > > > > at
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> > > >> > > ler.java:512)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.session.SessionHandler.doScope(
> > > >> > > SessionHandler.java:185)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(
> > > >> > > ContextHandler.java:1112)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > > >> > > Handler.java:141)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> > > >> > > ndle(ContextHandlerCollection.java:213)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(
> > > >> > > HandlerCollection.java:119)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > > >> > > erWrapper.java:134)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(Rewr
> > > >> > > iteHandler.java:335)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > > >> > > erWrapper.java:134)
> > > >> > > > > > > at
> org.eclipse.jetty.server.Server.handle(Server.java:534)
> > > >> > > > > > > at
> > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> > > >> > > java:320)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> > > >> > > ction.java:251)
> > > >> > > > > > > at
> > > >> > > > > > > org.eclipse.jetty.io
> > > >> > > > > >
> .AbstractConnection$ReadCallback.succeeded(AbstractConnectio
> > > >> > > n.java:273)
> > > >> > > > > > > at org.eclipse.jetty.io
> > .FillInterest.fillable(FillInterest.
> > > >> > > java:95)
> > > >> > > > > > > at
> > > >> > > > > > > org.eclipse.jetty.io
> > > >> > > > > >
> .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > > >> > > .executeProduceConsume(ExecuteProduceConsume.java:303)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > > >> > > .produceConsume(ExecuteProduceConsume.java:148)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > > >> > > .run(ExecuteProduceConsume.java:136)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> > > >> > > ThreadPool.java:671)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedT
> > > >> > > hreadPool.java:589)
> > > >> > > > > > > at java.lang.Thread.run(Thread.java:748)
> > > >> > > > > > > Caused by:
> org.apache.lucene.store.AlreadyClosedException:
> > > >> this
> > > >> > > > > > IndexWriter
> > > >> > > > > > > is closed
> > > >> > > > > > > at
> > > >> > > >
> > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > > >> > > > > > > at
> > > >> > > >
> > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > > >> > > > > > > at
> > > >> > > > > >
> > > >> > > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWrit
> > > >> > > er.java:1567)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocument(D
> > > >> > > irectUpdateHandler2.java:924)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocVa
> > > >> > > lues(DirectUpdateHandler2.java:913)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
> > > >> > > irectUpdateHandler2.java:302)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
> > > >> > > dateHandler2.java:239)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > > >> > > ateHandler2.java:194)
> > > >> > > > > > > ... 54 more
> > > >> > > > > > > Caused by:
> org.apache.lucene.index.CorruptIndexException:
> > > >> > checksum
> > > >> > > > failed
> > > >> > > > > > > (hardware problem?) : expected=db243d1a actual=7a00d3d2
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > > >> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > >> > > > > > > [slice=_i27s_Lucene50_0.tim]))
> > > >> > > > > > > at
> > org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.
> > > >> > > java:419)
> > > >> > > > > > > at
> > > >> > > > > >
> > > >> > > > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecU
> > > >> > > til.java:526)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.chec
> > > >> > > kIntegrity(BlockTreeTermsReader.java:336)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > > >> > > ldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterF
> > > >> > > ieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > > >> > > > > > > at
> > > >> > > >
> > > >>
> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > > >> > > ldsWriter.merge(PerFieldPostingsFormat.java:164)
> > > >> > > > > > > at
> > > >> > > > > >
> > > >> > > >
> > > >> >
> > >
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > > >> > > > > > > at
> > > >> > > >
> > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > > >> > > > > > > at
> > > >> > > >
> > > >>
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > > >> > > > > > > at
> > > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:
> > > >> > > 3931)
> > > >> > > > > > > at
> > > >> > > >
> > > >>
> org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Con
> > > >> > > currentMergeScheduler.java:624)
> > > >> > > > > > > at
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread
> > > >> > > .run(ConcurrentMergeScheduler.java:661)
> > > >> > > > > > >
> > > >> > > > > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872)
> > [c:COLL
> > > >> > > s:shard1
> > > >> > > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > > >> > > > > > > [COLL_shard1_replica1]  webapp=/solr path=/update
> > > >> > > > > > > params={wt=javabin&version=2} status=400 QTime=69
> > > >> > > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: checksum failed (hardware problem?)

Stephen Bianamara
Thanks for confirm. Indeed, we are planning on adding log monitoring as
well to work around this issue.

It seems to me that if SOLR is unable to recognize an irreparable failure,
then that is a bug. I filed this bug to track.


   1.    SOLR-12850 <https://issues.apache.org/jira/browse/SOLR-12850> SOLR
   Unaware When Index Has Corrupt Checksum


On Tue, Oct 9, 2018 at 6:05 AM Susheel Kumar <[hidden email]> wrote:

> Exactly. I have a node with checksum issue and it is alive which is good
> for us since if it goes down one of the shard would be down and thus
> outage.
>
> Yes, i agree that we don't get to know if there is node having checksum
> issue and thats where we are putting log monitoring which will alert if
> "corrupt" or "checksum" keyword is found in the logs.
>
> Thnx
>
> On Mon, Oct 8, 2018 at 5:41 PM Stephen Bianamara <[hidden email]>
> wrote:
>
> > Hi Susheel,
> >
> > Yes, I believe you are correct on fixing a node in place. My org actually
> > just cycles instances rather than repair broken ones.
> >
> > It's too bad that there's nothing conclusive we can look for to help
> > investigate the scope. We'd love to pin this down so that we could take
> > something concrete to investigate to AWS if it's a hardware failure
> (e.g.,
> > we found a log indicating ....). I haven't been able to find anything
> which
> > might clarify that matter outside of SOLR either. Perhaps it's just not
> > realistic at this time.
> >
> > I'm also curious about another aspect, which is that the nodes don't
> report
> > as unhealthy. Currently a node with a bad checksum will just stay in the
> > collection forever. Shouldn't the node go to "down" if it has an
> > irreparable checksum?
> >
> > On Fri, Oct 5, 2018 at 5:25 AM Susheel Kumar <[hidden email]>
> > wrote:
> >
> > > My understanding is once the index is corrupt, the only way to fix is
> > using
> > > checkindex utility which will remove some bad segments and then only we
> > can
> > > use it.
> > >
> > > This is bit scary that you see similar error on 6.6.2 though in our
> case
> > we
> > > know we are going thru some hardware problem which likely would have
> > caused
> > > the corruption but there is no concrete evidence which can be used to
> > > confirm if it is hardware or Solr/Lucene.  Are you able to use another
> > AWS
> > > instance similar to Simon's case.
> > >
> > > Thanks,
> > > Susheel
> > >
> > > On Thu, Oct 4, 2018 at 7:11 PM Stephen Bianamara <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > To be more concrete: Is the definitive test of whether or not a
> core's
> > > > index is corrupt to copy it onto a new set of hardware and attempt to
> > > write
> > > > to it? If this is a definitive test, we can run the experiment and
> > update
> > > > the report so you have a sense of how often this happens.
> > > >
> > > > Since this is a SOLR cloud node, which is already removed but whose
> > data
> > > > dir was preserved, I believe I can just copy the data directory to a
> > > fresh
> > > > machine and start a regular non-cloud solr node hosting this core.
> Can
> > > you
> > > > please confirm that this will be a definitive test, or whether there
> is
> > > > some aspect needed to make it definitive?
> > > >
> > > > Thanks!
> > > >
> > > > On Wed, Oct 3, 2018 at 2:10 AM Stephen Bianamara <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > Hello All --
> > > > >
> > > > > As it would happen, we've seen this error on version 6.6.2 very
> > > recently.
> > > > > This is also on an AWS instance, like Simon's report. The drive
> > doesn't
> > > > > show any sign of being unhealthy, either from cursory
> investigation.
> > > > FWIW,
> > > > > this occurred during a collection backup.
> > > > >
> > > > > Erick, is there some diagnostic data we can find to help pin this
> > down?
> > > > >
> > > > > Thanks!
> > > > > Stephen
> > > > >
> > > > > On Sun, Sep 30, 2018 at 12:48 PM Susheel Kumar <
> > [hidden email]>
> > > > > wrote:
> > > > >
> > > > >> Thank you, Simon. Which basically points that something related to
> > env
> > > > and
> > > > >> was causing the checksum failures than any lucene/solr issue.
> > > > >>
> > > > >> Eric - I did check with hardware folks and they are aware of some
> > > VMware
> > > > >> issue where the VM hosted in HCI environment is coming into some
> > halt
> > > > >> state
> > > > >> for minute or so and may be loosing connections to disk/network.
> So
> > > > that
> > > > >> probably may be the reason of index corruption though they have
> not
> > > been
> > > > >> able to find anything specific from logs during the time Solr run
> > into
> > > > >> issue
> > > > >>
> > > > >> Also I had again issue where Solr is loosing the connection with
> > > > zookeeper
> > > > >> (Client session timed out, have not heard from server in 8367ms
> for
> > > > >> sessionid 0x0)  Does that points to similar hardware issue, Any
> > > > >> suggestions?
> > > > >>
> > > > >> Thanks,
> > > > >> Susheel
> > > > >>
> > > > >> 2018-09-29 17:30:44.070 INFO
> > > > >> (searcherExecutor-7-thread-1-processing-n:server54:8080_solr
> > > > >> x:COLL_shard4_replica2 s:shard4 c:COLL r:core_node8) [c:COLL
> > s:shard4
> > > > >> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.SolrCore
> > > > >> [COLL_shard4_replica2] Registered new searcher
> > > > >> Searcher@7a4465b1[COLL_shard4_replica2]
> > > > >>
> > > > >>
> > > >
> > >
> >
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_7x3f(6.6.2):C826923/317917:delGen=2523)
> > > > >> Uninverting(_83pb(6.6.2):C805451/172968:delGen=2957)
> > > > >> Uninverting(_3ywj(6.6.2):C727978/334529:delGen=2962)
> > > > >> Uninverting(_7vsw(6.6.2):C872110/385178:delGen=2020)
> > > > >> Uninverting(_8n89(6.6.2):C741293/109260:delGen=3863)
> > > > >> Uninverting(_7zkq(6.6.2):C720666/101205:delGen=3151)
> > > > >> Uninverting(_825d(6.6.2):C707731/112410:delGen=3168)
> > > > >> Uninverting(_dgwu(6.6.2):C760421/295964:delGen=4624)
> > > > >> Uninverting(_gs5x(6.6.2):C540942/138952:delGen=1623)
> > > > >> Uninverting(_gu6a(6.6.2):c75213/35640:delGen=1110)
> > > > >> Uninverting(_h33i(6.6.2):c131276/40356:delGen=706)
> > > > >> Uninverting(_h5tc(6.6.2):c44320/11080:delGen=380)
> > > > >> Uninverting(_h9d9(6.6.2):c35088/3188:delGen=104)
> > > > >> Uninverting(_h80h(6.6.2):c11927/3412:delGen=153)
> > > > >> Uninverting(_h7ll(6.6.2):c11284/1368:delGen=205)
> > > > >> Uninverting(_h8bs(6.6.2):c11518/2103:delGen=149)
> > > > >> Uninverting(_h9r3(6.6.2):c16439/1018:delGen=52)
> > > > >> Uninverting(_h9z1(6.6.2):c9428/823:delGen=27)
> > > > >> Uninverting(_h9v2(6.6.2):c933/33:delGen=12)
> > > > >> Uninverting(_ha1c(6.6.2):c1056/1:delGen=1)
> > > > >> Uninverting(_ha6i(6.6.2):c1883/124:delGen=8)
> > > > >> Uninverting(_ha3x(6.6.2):c807/14:delGen=3)
> > > > >> Uninverting(_ha47(6.6.2):c1229/133:delGen=6)
> > > > >> Uninverting(_hapk(6.6.2):c523) Uninverting(_haoq(6.6.2):c279)
> > > > >> Uninverting(_hamr(6.6.2):c311) Uninverting(_hap0(6.6.2):c338)
> > > > >> Uninverting(_hapu(6.6.2):c275)
> > Uninverting(_hapv(6.6.2):C4/2:delGen=1)
> > > > >> Uninverting(_hapw(6.6.2):C5/2:delGen=1)
> > > > >> Uninverting(_hapx(6.6.2):C2/1:delGen=1)
> > > > >> Uninverting(_hapy(6.6.2):C2/1:delGen=1)
> > > > >> Uninverting(_hapz(6.6.2):C3/1:delGen=1)
> > > > >> Uninverting(_haq0(6.6.2):C6/3:delGen=1)
> > > > >> Uninverting(_haq1(6.6.2):C1)))}
> > > > >> 2018-09-29 17:30:52.390 WARN
> > > > >>
> > > > >>
> > > >
> > >
> >
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server117:2182))
> > > > >> [   ] o.a.z.ClientCnxn Client session timed out, have not heard
> from
> > > > >> server in 8367ms for sessionid 0x0
> > > > >> 2018-09-29 17:31:01.302 WARN
> > > > >>
> > > > >>
> > > >
> > >
> >
> (zkCallback-5-thread-91-processing-n:server54:8080_solr-SendThread(server120:2182))
> > > > >> [   ] o.a.z.ClientCnxn Client session timed out, have not heard
> from
> > > > >> server in 8812ms for sessionid 0x0
> > > > >> 2018-09-29 17:31:14.049 INFO
> > > > >>
> > (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> > > > >>   ] o.a.s.c.c.ConnectionManager Connection with ZooKeeper
> > > > >> reestablished.
> > > > >> 2018-09-29 17:31:14.049 INFO
> > > > >>
> > (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> > > > >>   ] o.a.s.c.ZkController ZooKeeper session re-connected ...
> > refreshing
> > > > >> core states after session expiration.
> > > > >> 2018-09-29 17:31:14.051 INFO
> > > > >>
> > (zkCallback-5-thread-91-processing-n:server54:8080_solr-EventThread) [
> > > > >>   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper...
> > (16)
> > > > >> -> (15)
> > > > >> 2018-09-29 17:31:14.144 INFO  (qtp834133664-520378) [c:COLL
> s:shard4
> > > > >> r:core_node8 x:COLL_shard4_replica2] o.a.s.c.S.Request
> > > > >> [COLL_shard4_replica2]  webapp=/solr path=/admin/ping
> > > > >>
> > > > >>
> > > >
> > >
> >
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> > > > >>
> > > >
> > >
> >
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> > > > >> }
> > > > >> webapp=/solr path=/admin/ping
> > > > >>
> > > > >>
> > > >
> > >
> >
> params={distrib=false&df=wordTokens&_stateVer_=COLL:1246&preferLocalShards=false&qt=/admin/ping&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=
> > > > >>
> > > >
> > >
> >
> http://server54:8080/solr/COLL_shard4_replica2/|http://server53:8080/solr/COLL_shard4_replica1/&rows=10&version=2&q={!lucene}*:*&NOW=1538242274139&isShard=true&wt=javabin
> > > > >> }
> > > > >> hits=4989979 status=0 QTime=0
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Wed, Sep 26, 2018 at 9:44 AM simon <[hidden email]> wrote:
> > > > >>
> > > > >> > I saw something like this a year ago which i reported as a
> > possible
> > > > >> bug  (
> > > > >> > https://issues.apache.org/jira/browse/SOLR-10840, which has  a
> > full
> > > > >> > description and stack traces)
> > > > >> >
> > > > >> > This occurred very randomly on an AWS instance; moving the index
> > > > >> directory
> > > > >> > to a different file system did not fix the problem Eventually I
> > > cloned
> > > > >> our
> > > > >> > environment to a new AWS instance, which proved to be the
> > solution.
> > > > >> Why, I
> > > > >> > have no idea...
> > > > >> >
> > > > >> > -Simon
> > > > >> >
> > > > >> > On Mon, Sep 24, 2018 at 1:13 PM, Susheel Kumar <
> > > [hidden email]
> > > > >
> > > > >> > wrote:
> > > > >> >
> > > > >> > > Got it. I'll have first hardware folks check and if they don't
> > > > >> see/find
> > > > >> > > anything suspicious then i'll return here.
> > > > >> > >
> > > > >> > > Wondering if any body has seen similar error and if they were
> > able
> > > > to
> > > > >> > > confirm if it was hardware fault or so.
> > > > >> > >
> > > > >> > > Thnx
> > > > >> > >
> > > > >> > > On Mon, Sep 24, 2018 at 1:01 PM Erick Erickson <
> > > > >> [hidden email]>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Mind you it could _still_ be Solr/Lucene, but let's check
> the
> > > > >> hardware
> > > > >> > > > first ;)
> > > > >> > > > On Mon, Sep 24, 2018 at 9:50 AM Susheel Kumar <
> > > > >> [hidden email]>
> > > > >> > > > wrote:
> > > > >> > > > >
> > > > >> > > > > Hi Erick,
> > > > >> > > > >
> > > > >> > > > > Thanks so much for your reply.  I'll now look mostly into
> > any
> > > > >> > possible
> > > > >> > > > > hardware issues than Solr/Lucene.
> > > > >> > > > >
> > > > >> > > > > Thanks again.
> > > > >> > > > >
> > > > >> > > > > On Mon, Sep 24, 2018 at 12:43 PM Erick Erickson <
> > > > >> > > [hidden email]
> > > > >> > > > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > There are several of reasons this would "suddenly" start
> > > > >> appearing.
> > > > >> > > > > > 1> Your disk went bad and some sector is no longer
> > > faithfully
> > > > >> > > > > > recording the bits. In this case the checksum will be
> > wrong
> > > > >> > > > > > 2> You ran out of disk space sometime and the index was
> > > > >> corrupted.
> > > > >> > > > > > This isn't really a hardware problem.
> > > > >> > > > > > 3> Your disk controller is going wonky and not reading
> > > > reliably.
> > > > >> > > > > >
> > > > >> > > > > > The "possible hardware issue" message is to alert you
> that
> > > > this
> > > > >> is
> > > > >> > > > > > highly unusual and you should at leasts consider doing
> > > > integrity
> > > > >> > > > > > checks on your disk before assuming it's a Solr/Lucene
> > > problem
> > > > >> > > > > >
> > > > >> > > > > > Best,
> > > > >> > > > > > Erick
> > > > >> > > > > > On Mon, Sep 24, 2018 at 9:26 AM Susheel Kumar <
> > > > >> > [hidden email]
> > > > >> > > >
> > > > >> > > > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > Hello,
> > > > >> > > > > > >
> > > > >> > > > > > > I am still trying to understand the corrupt index
> > > exception
> > > > we
> > > > >> > saw
> > > > >> > > > in our
> > > > >> > > > > > > logs. What does the hardware problem comment indicates
> > > here?
> > > > >> > Does
> > > > >> > > > that
> > > > >> > > > > > > mean it caused most likely due to hardware issue?
> > > > >> > > > > > >
> > > > >> > > > > > > We never had this problem in last couple of months.
> The
> > > Solr
> > > > >> is
> > > > >> > > > 6.6.2 and
> > > > >> > > > > > > ZK: 3.4.10.
> > > > >> > > > > > >
> > > > >> > > > > > > Please share your thoughts.
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks,
> > > > >> > > > > > > Susheel
> > > > >> > > > > > >
> > > > >> > > > > > > Caused by:
> > org.apache.lucene.index.CorruptIndexException:
> > > > >> > checksum
> > > > >> > > > > > > failed *(hardware
> > > > >> > > > > > > problem?)* : expected=db243d1a actual=7a00d3d2
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > > > >> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > >> > > > > > > [slice=_i27s_Lucene50_0.tim])
> > > > >> > > > > > >
> > > > >> > > > > > > It suddenly started in the logs and before which there
> > was
> > > > no
> > > > >> > such
> > > > >> > > > error.
> > > > >> > > > > > > Searches & ingestions all seems to be working prior to
> > > that.
> > > > >> > > > > > >
> > > > >> > > > > > > ----
> > > > >> > > > > > >
> > > > >> > > > > > > 2018-09-03 17:16:49.056 INFO  (qtp834133664-519872)
> > > [c:COLL
> > > > >> > > s:shard1
> > > > >> > > > > > > r:core_node1 x:COLL_shard1_replica1]
> > > > >> > > > > > > o.a.s.u.p.StatelessScriptUpdateProcessorFactory
> > > > >> > > > update-script#processAdd:
> > > > >> > > > > > >
> > > > >> > > > newid=G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-25520
> > > > >> > > 08480_1-en_US
> > > > >> > > > > > > 2018-09-03 17:16:49.057 ERROR (qtp834133664-519872)
> > > [c:COLL
> > > > >> > > s:shard1
> > > > >> > > > > > > r:core_node1 x:COLL_shard1_replica1]
> > > > >> o.a.s.h.RequestHandlerBase
> > > > >> > > > > > > org.apache.solr.common.SolrException: Exception
> writing
> > > > >> document
> > > > >> > > id
> > > > >> > > > > > >
> > > G31MXMRZESC0CYPR!A-G31MXMRZESC0CYPR.2552019802_1-2552008480_
> > > > >> > > 1-en_US
> > > > >> > > > to
> > > > >> > > > > > the
> > > > >> > > > > > > index; possible analysis error.
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > > > >> > > ateHandler2.java:206)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.processor.RunUpdateProcessor.processA
> > > > >> > > dd(RunUpdateProcessorFactory.java:67)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > > > >> > > essAdd(UpdateRequestProcessor.java:55)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > > > >> > > doLocalAdd(DistributedUpdateProcessor.java:979)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > > > >> > > versionAdd(DistributedUpdateProcessor.java:1192)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.processor.DistributedUpdateProcessor.
> > > > >> > > processAdd(DistributedUpdateProcessor.java:748)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.processor.UpdateRequestProcessor.proc
> > > > >> > > essAdd(UpdateRequestProcessor.java:55)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > >
> > > > >>
> > > org.apache.solr.update.processor.StatelessScriptUpdateProcessorFactory$
> > > > >> > > ScriptUpdateProcessor.processAdd(StatelessScriptUpdateProces
> > > > >> > > sorFactory.java:380)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.handler.loader.JavabinLoader$1.update(Javabi
> > > > >> > > nLoader.java:98)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > > >> > >
> > ec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:180)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > > >> > > ec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > > > >> > > odec.java:306)
> > > > >> > > > > > > at
> > > > >> > > > > >
> > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > > > >> > > c.java:251)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > > >> > > ec$1.readNamedList(JavaBinUpdateRequestCodec.java:122)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinC
> > > > >> > > odec.java:271)
> > > > >> > > > > > > at
> > > > >> > > > > >
> > org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCode
> > > > >> > > c.java:251)
> > > > >> > > > > > > at
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCo
> > > > >> > > dec.java:173)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCod
> > > > >> > > ec.unmarshal(JavaBinUpdateRequestCodec.java:187)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDoc
> > > > >> > > s(JavabinLoader.java:108)
> > > > >> > > > > > > at
> > > > >> > > > > >
> > > > >> > > >
> > > > >> >
> > > >
> > org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRe
> > > > >> > > questHandler.java:97)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.handler.ContentStreamHandlerBase.handleReque
> > > > >> > > stBody(ContentStreamHandlerBase.java:68)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> > > > >> > > uestHandlerBase.java:173)
> > > > >> > > > > > > at
> > > org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> > > > >> > > > > > > at
> > > > >> > > >
> > > > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> > > > >> > > > > > > at
> > > > >> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> > > > >> > > 529)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > > > >> > > atchFilter.java:361)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> > > > >> > > atchFilter.java:305)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > > >> > > r(ServletHandler.java:1691)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> > > > >> > > dler.java:582)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > > > >> > > Handler.java:143)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> > > > >> > > ndler.java:548)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(
> > > > >> > > SessionHandler.java:226)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> > > > >> > > ContextHandler.java:1180)
> > > > >> > > > > > > at
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> > > > >> > > ler.java:512)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.session.SessionHandler.doScope(
> > > > >> > > SessionHandler.java:185)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(
> > > > >> > > ContextHandler.java:1112)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> > > > >> > > Handler.java:141)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> > > > >> > > ndle(ContextHandlerCollection.java:213)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(
> > > > >> > > HandlerCollection.java:119)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > > > >> > > erWrapper.java:134)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(Rewr
> > > > >> > > iteHandler.java:335)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> > > > >> > > erWrapper.java:134)
> > > > >> > > > > > > at
> > org.eclipse.jetty.server.Server.handle(Server.java:534)
> > > > >> > > > > > > at
> > > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> > > > >> > > java:320)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> > > > >> > > ction.java:251)
> > > > >> > > > > > > at
> > > > >> > > > > > > org.eclipse.jetty.io
> > > > >> > > > > >
> > .AbstractConnection$ReadCallback.succeeded(AbstractConnectio
> > > > >> > > n.java:273)
> > > > >> > > > > > > at org.eclipse.jetty.io
> > > .FillInterest.fillable(FillInterest.
> > > > >> > > java:95)
> > > > >> > > > > > > at
> > > > >> > > > > > > org.eclipse.jetty.io
> > > > >> > > > > >
> > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > > > >> > > .executeProduceConsume(ExecuteProduceConsume.java:303)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > > > >> > > .produceConsume(ExecuteProduceConsume.java:148)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> > > > >> > > .run(ExecuteProduceConsume.java:136)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> > > > >> > > ThreadPool.java:671)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedT
> > > > >> > > hreadPool.java:589)
> > > > >> > > > > > > at java.lang.Thread.run(Thread.java:748)
> > > > >> > > > > > > Caused by:
> > org.apache.lucene.store.AlreadyClosedException:
> > > > >> this
> > > > >> > > > > > IndexWriter
> > > > >> > > > > > > is closed
> > > > >> > > > > > > at
> > > > >> > > >
> > > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> > > > >> > > > > > > at
> > > > >> > > >
> > > > org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> > > > >> > > > > > > at
> > > > >> > > > > >
> > > > >> > > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWrit
> > > > >> > > er.java:1567)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocument(D
> > > > >> > > irectUpdateHandler2.java:924)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocVa
> > > > >> > > lues(DirectUpdateHandler2.java:913)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
> > > > >> > > irectUpdateHandler2.java:302)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
> > > > >> > > dateHandler2.java:239)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpd
> > > > >> > > ateHandler2.java:194)
> > > > >> > > > > > > ... 54 more
> > > > >> > > > > > > Caused by:
> > org.apache.lucene.index.CorruptIndexException:
> > > > >> > checksum
> > > > >> > > > failed
> > > > >> > > > > > > (hardware problem?) : expected=db243d1a
> actual=7a00d3d2
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/
> > > > >> > > app/solr/data/COLL_shard1_replica1/data/index/_i27s.cfs")
> > > > >> > > > > > > [slice=_i27s_Lucene50_0.tim]))
> > > > >> > > > > > > at
> > > org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.
> > > > >> > > java:419)
> > > > >> > > > > > > at
> > > > >> > > > > >
> > > > >> > > > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecU
> > > > >> > > til.java:526)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.chec
> > > > >> > > kIntegrity(BlockTreeTermsReader.java:336)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > > > >> > > ldsReader.checkIntegrity(PerFieldPostingsFormat.java:348)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.lucene.codecs.perfield.PerFieldMergeState$FilterF
> > > > >> > > ieldsProducer.checkIntegrity(PerFieldMergeState.java:271)
> > > > >> > > > > > > at
> > > > >> > > >
> > > > >>
> > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:96)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$Fie
> > > > >> > > ldsWriter.merge(PerFieldPostingsFormat.java:164)
> > > > >> > > > > > > at
> > > > >> > > > > >
> > > > >> > > >
> > > > >> >
> > > >
> > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216)
> > > > >> > > > > > > at
> > > > >> > > >
> > > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> > > > >> > > > > > > at
> > > > >> > > >
> > > > >>
> > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
> > > > >> > > > > > > at
> > > > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:
> > > > >> > > 3931)
> > > > >> > > > > > > at
> > > > >> > > >
> > > > >>
> > org.apache.solr.update.SolrIndexWriter.merge(SolrIndexWriter.java:188)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(Con
> > > > >> > > currentMergeScheduler.java:624)
> > > > >> > > > > > > at
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread
> > > > >> > > .run(ConcurrentMergeScheduler.java:661)
> > > > >> > > > > > >
> > > > >> > > > > > > 2018-09-03 17:16:49.116 INFO  (qtp834133664-519872)
> > > [c:COLL
> > > > >> > > s:shard1
> > > > >> > > > > > > r:core_node1 x:COLL_shard1_replica1] o.a.s.c.S.Request
> > > > >> > > > > > > [COLL_shard1_replica1]  webapp=/solr path=/update
> > > > >> > > > > > > params={wt=javabin&version=2} status=400 QTime=69
> > > > >> > > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>