Many Checksum Errors

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Many Checksum Errors

Dennis Kubes
All,

We are continually experiencing checksum errors when running some jobs
under heavy load (specifically merging segments or crawldbs).  I am lost
as to whether this is a hardware or software problem.  Two questions,
one is anyone else experiencing a large number of checksum type errors
on big clusters?  Two, does anyone know if this is hardware or software
related?  Here are some examples.

Dennis Kubes


org.apache.hadoop.fs.ChecksumException: Checksum error:
/d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
        at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
        at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
        at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
        at
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        at java.io.DataInputStream.readFully(DataInputStream.java:176)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at
org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(SequenceFile.java:427)
        at
org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(SequenceFile.java:414)
        at
org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java:1669)
        at
org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(SequenceFile.java:2585)
        at
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2356)
        at
org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2230)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:517)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1701)



Map output lost, rescheduling: getMapOutput(task_0042_m_000375_0,4) failed :
org.apache.hadoop.fs.ChecksumException: Checksum error:
/d01/hadoop/mapred/local/task_0042_m_000375_0/file.out at 20267008
        at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
        at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
        at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
        at
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        at java.io.DataInputStream.read(DataInputStream.java:134)
        at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1932)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
        at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
        at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
        at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
        at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
        at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
        at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
        at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

Raghu Angadi-2

Can you manually try to read one such file with 'hadoop fs -cat'? If it
is not a transient software error, you should see the checksum error
again. If you see the error, it does not confirm a hardware error but if
you are able to read correctly, then it is mostly Hadoop bug.

Raghu.

Dennis Kubes wrote:

> All,
>
> We are continually experiencing checksum errors when running some jobs
> under heavy load (specifically merging segments or crawldbs).  I am lost
> as to whether this is a hardware or software problem.  Two questions,
> one is anyone else experiencing a large number of checksum type errors
> on big clusters?  Two, does anyone know if this is hardware or software
> related?  Here are some examples.
>
> Dennis Kubes
>
>
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> /d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>
>     at
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>     at java.io.DataInputStream.readFully(DataInputStream.java:152)
>     at
> org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(SequenceFile.java:427)
>
>     at
> org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(SequenceFile.java:414)
>
>     at
> org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java:1669)
>
>     at
> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(SequenceFile.java:2585)
>
>     at
> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2356)
>
>     at
> org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2230)
>     at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:517)
>
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
>     at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1701)
>
>
>
> Map output lost, rescheduling: getMapOutput(task_0042_m_000375_0,4)
> failed :
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> /d01/hadoop/mapred/local/task_0042_m_000375_0/file.out at 20267008
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>
>     at
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>     at java.io.DataInputStream.read(DataInputStream.java:134)
>     at
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1932)
>
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>     at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>     at
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
>
>     at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>     at
> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
>
>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>     at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>     at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>     at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
>     at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>     at
> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
>     at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>     at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>

Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

Dennis Kubes
I can read files through fs cat.  Also the errors once rescheduled will
most often fix themselves, although some times enough of them occur
where a single job will fail.

Dennis Kubes

Raghu Angadi wrote:

>
> Can you manually try to read one such file with 'hadoop fs -cat'? If it
> is not a transient software error, you should see the checksum error
> again. If you see the error, it does not confirm a hardware error but if
> you are able to read correctly, then it is mostly Hadoop bug.
>
> Raghu.
>
> Dennis Kubes wrote:
>> All,
>>
>> We are continually experiencing checksum errors when running some jobs
>> under heavy load (specifically merging segments or crawldbs).  I am
>> lost as to whether this is a hardware or software problem.  Two
>> questions, one is anyone else experiencing a large number of checksum
>> type errors on big clusters?  Two, does anyone know if this is
>> hardware or software related?  Here are some examples.
>>
>> Dennis Kubes
>>
>>
>> org.apache.hadoop.fs.ChecksumException: Checksum error:
>> /d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>>
>>     at
>> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>>
>>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>>     at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>     at
>> org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(SequenceFile.java:427)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(SequenceFile.java:414)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java:1669)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(SequenceFile.java:2585)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2356)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2230)
>>
>>     at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:517)
>>
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
>>     at
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1701)
>>
>>
>>
>> Map output lost, rescheduling: getMapOutput(task_0042_m_000375_0,4)
>> failed :
>> org.apache.hadoop.fs.ChecksumException: Checksum error:
>> /d01/hadoop/mapred/local/task_0042_m_000375_0/file.out at 20267008
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>>
>>     at
>> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>>
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>     at java.io.DataInputStream.read(DataInputStream.java:134)
>>     at
>> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1932)
>>
>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>>     at
>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>>     at
>> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
>>
>>     at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>>     at
>> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
>>
>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>>     at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>>     at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>>     at
>> org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
>>     at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>>     at
>> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
>>     at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>>     at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

Doug Cutting
In reply to this post by Dennis Kubes
Do you have ECC memory on your nodes?  Nodes without ECC have been known
to trigger high rates of checksum errors.

Doug

Dennis Kubes wrote:

> All,
>
> We are continually experiencing checksum errors when running some jobs
> under heavy load (specifically merging segments or crawldbs).  I am lost
> as to whether this is a hardware or software problem.  Two questions,
> one is anyone else experiencing a large number of checksum type errors
> on big clusters?  Two, does anyone know if this is hardware or software
> related?  Here are some examples.
>
> Dennis Kubes
>
>
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> /d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>
>     at
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>     at java.io.DataInputStream.readFully(DataInputStream.java:152)
>     at
> org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(SequenceFile.java:427)
>
>     at
> org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(SequenceFile.java:414)
>
>     at
> org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java:1669)
>
>     at
> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(SequenceFile.java:2585)
>
>     at
> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2356)
>
>     at
> org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2230)
>     at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:517)
>
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
>     at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1701)
>
>
>
> Map output lost, rescheduling: getMapOutput(task_0042_m_000375_0,4)
> failed :
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> /d01/hadoop/mapred/local/task_0042_m_000375_0/file.out at 20267008
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>
>     at
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>     at java.io.DataInputStream.read(DataInputStream.java:134)
>     at
> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1932)
>
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>     at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>     at
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
>
>     at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>     at
> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
>
>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>     at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>     at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>     at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
>     at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>     at
> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
>     at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>     at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>
Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

Dennis Kubes
Doug,

Do we know if this is a hardware issue.  If it is possibly a software
issue I can dedicate some resources to tracking down bugs.  I would just
need a little guidance on where to start looking?

Dennis Kubes

Doug Cutting wrote:

> Do you have ECC memory on your nodes?  Nodes without ECC have been known
> to trigger high rates of checksum errors.
>
> Doug
>
> Dennis Kubes wrote:
>> All,
>>
>> We are continually experiencing checksum errors when running some jobs
>> under heavy load (specifically merging segments or crawldbs).  I am
>> lost as to whether this is a hardware or software problem.  Two
>> questions, one is anyone else experiencing a large number of checksum
>> type errors on big clusters?  Two, does anyone know if this is
>> hardware or software related?  Here are some examples.
>>
>> Dennis Kubes
>>
>>
>> org.apache.hadoop.fs.ChecksumException: Checksum error:
>> /d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>>
>>     at
>> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>>
>>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>>     at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>     at
>> org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(SequenceFile.java:427)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(SequenceFile.java:414)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java:1669)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(SequenceFile.java:2585)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2356)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2230)
>>
>>     at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:517)
>>
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
>>     at
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1701)
>>
>>
>>
>> Map output lost, rescheduling: getMapOutput(task_0042_m_000375_0,4)
>> failed :
>> org.apache.hadoop.fs.ChecksumException: Checksum error:
>> /d01/hadoop/mapred/local/task_0042_m_000375_0/file.out at 20267008
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>>
>>     at
>> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>>
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>     at java.io.DataInputStream.read(DataInputStream.java:134)
>>     at
>> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1932)
>>
>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>>     at
>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>>     at
>> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
>>
>>     at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>>     at
>> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
>>
>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>>     at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>>     at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>>     at
>> org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
>>     at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>>     at
>> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
>>     at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>>     at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>>
Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

Doug Cutting
Dennis Kubes wrote:
> Do we know if this is a hardware issue.  If it is possibly a software
> issue I can dedicate some resources to tracking down bugs.  I would just
> need a little guidance on where to start looking?

We don't know.  The checksum mechanism is designed to catch hardware
problems.  So one must certainly consider that as a likely cause.  If it
is instead a software bug then it should be reproducible.  Are you
seeing any consistent patterns?  If not, then I'd lean towards hardware.

Michael Stack has some experience tracking down problems with flaky
memory.  Michael, did you use a test program to validate the memory on a
node?

Again, do your nodes have ECC memory?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

Dennis Kubes


Doug Cutting wrote:

> Dennis Kubes wrote:
>> Do we know if this is a hardware issue.  If it is possibly a software
>> issue I can dedicate some resources to tracking down bugs.  I would
>> just need a little guidance on where to start looking?
>
> We don't know.  The checksum mechanism is designed to catch hardware
> problems.  So one must certainly consider that as a likely cause.  If it
> is instead a software bug then it should be reproducible.  Are you
> seeing any consistent patterns?  If not, then I'd lean towards hardware.
>
> Michael Stack has some experience tracking down problems with flaky
> memory.  Michael, did you use a test program to validate the memory on a
> node?
>
> Again, do your nodes have ECC memory?

Sorry, I was checking on that.  No, the nodes don't have ECC memory.  I
just priced it out and it is only $20 more per Gig to go ECC so I think
that is what we are going to do.  We are going to do some tests and I
will keep the list updated on the progress.  Thanks for your help.

Dennis Kubes
>
> Doug
Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

Dennis Kubes
It turns out that ECC memory did the trick.  We replaced all memory on
our 50 node cluster with ECC memory and it has just completed a 50
Million page crawl and merge with 0 errors.  Before we would have 10-20
errors or more on this job.

I still find it interesting that the non-ecc memory passed all burn-in
and hardware tests yet still failed randomly under production
conditions.  I guess a good rule of thumb would be for production nutch
and hadoop systems ECC memory is always the way to go.  Anyways, thanks
for all the help in getting this problem resolved.

Dennis Kubes

Dennis Kubes wrote:

>
>
> Doug Cutting wrote:
>> Dennis Kubes wrote:
>>> Do we know if this is a hardware issue.  If it is possibly a software
>>> issue I can dedicate some resources to tracking down bugs.  I would
>>> just need a little guidance on where to start looking?
>>
>> We don't know.  The checksum mechanism is designed to catch hardware
>> problems.  So one must certainly consider that as a likely cause.  If
>> it is instead a software bug then it should be reproducible.  Are you
>> seeing any consistent patterns?  If not, then I'd lean towards hardware.
>>
>> Michael Stack has some experience tracking down problems with flaky
>> memory.  Michael, did you use a test program to validate the memory on
>> a node?
>>
>> Again, do your nodes have ECC memory?
>
> Sorry, I was checking on that.  No, the nodes don't have ECC memory.  I
> just priced it out and it is only $20 more per Gig to go ECC so I think
> that is what we are going to do.  We are going to do some tests and I
> will keep the list updated on the progress.  Thanks for your help.
>
> Dennis Kubes
>>
>> Doug
Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

Raghu Angadi-2
Dennis Kubes wrote:

> It turns out that ECC memory did the trick.  We replaced all memory on
> our 50 node cluster with ECC memory and it has just completed a 50
> Million page crawl and merge with 0 errors.  Before we would have 10-20
> errors or more on this job.
>
> I still find it interesting that the non-ecc memory passed all burn-in
> and hardware tests yet still failed randomly under production
> conditions.  I guess a good rule of thumb would be for production nutch
> and hadoop systems ECC memory is always the way to go.  Anyways, thanks
> for all the help in getting this problem resolved.

This is good validation how important ECC memory is. Currently HDFS
client deletes a block when it notices a checksum error. After moving to
Block level CRCs soon, we should make Datanode re-validate the block
before deciding to delete it.

Raghu.

> Dennis Kubes
>
> Dennis Kubes wrote:
>>
>>
>> Doug Cutting wrote:
>>> Dennis Kubes wrote:
>>>> Do we know if this is a hardware issue.  If it is possibly a
>>>> software issue I can dedicate some resources to tracking down bugs.  
>>>> I would just need a little guidance on where to start looking?
>>>
>>> We don't know.  The checksum mechanism is designed to catch hardware
>>> problems.  So one must certainly consider that as a likely cause.  If
>>> it is instead a software bug then it should be reproducible.  Are you
>>> seeing any consistent patterns?  If not, then I'd lean towards hardware.
>>>
>>> Michael Stack has some experience tracking down problems with flaky
>>> memory.  Michael, did you use a test program to validate the memory
>>> on a node?
>>>
>>> Again, do your nodes have ECC memory?
>>
>> Sorry, I was checking on that.  No, the nodes don't have ECC memory.  
>> I just priced it out and it is only $20 more per Gig to go ECC so I
>> think that is what we are going to do.  We are going to do some tests
>> and I will keep the list updated on the progress.  Thanks for your help.
>>
>> Dennis Kubes
>>>
>>> Doug

Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

stack-3
In reply to this post by Doug Cutting
Doug Cutting wrote:
> Michael Stack has some experience tracking down problems with flaky
> memory.  Michael, did you use a test program to validate the memory on
> a node?
One of the lads at the Archive used to run CTCS,
http://sourceforge.net/projects/va-ctcs/.  It was good for weeding out
bad hardware.  But we also found that machines that passed multiple CTCS
burnins could continue to throw checksum errors (These were non-ECC
machines).

St.Ack
P.S. Pardon the tardy reply.  Have been offline for last couple of weeks.
Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

bigjules
In reply to this post by Doug Cutting
I am also getting intermittent checksum errors during map reduce jobs.  Mostly they go away when the map or reduce is retried.  One error seems to have made its way into the output of a job.

I cannot get this file from the hdfs because of the checksum error.  As you suggest, a faulty memory stick may have caused a corruption of my input file (or the checksum file).  

Is this problem rare enough to put it down to faulty memory?  You mention that you have seen it reported before, but I'm wondering if there have been reports of Checksum errors that weren't due to faulty memory (couldn't find any with a forum search).

I suppose that Dennis Kubes problems did go away when he replaced his entire cluster's memory with ECC sticks. (not all of us have that luxury)

Running hadoop-0.12.3 on a single windows server 2003 machine (using cygwin) without ECC memory.

Jules



Doug Cutting wrote
Do you have ECC memory on your nodes?  Nodes without ECC have been known
to trigger high rates of checksum errors.

Doug

Dennis Kubes wrote:
> All,
>
> We are continually experiencing checksum errors when running some jobs
> under heavy load (specifically merging segments or crawldbs).  I am lost
> as to whether this is a hardware or software problem.  Two questions,
> one is anyone else experiencing a large number of checksum type errors
> on big clusters?  Two, does anyone know if this is hardware or software
> related?  Here are some examples.
>
> Dennis Kubes
>
>
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> /d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
>     at
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
Reply | Threaded
Open this post in threaded view
|

Re: Many Checksum Errors

Pallavi Palleti
In reply to this post by Dennis Kubes
Hi,
 I am also getting checksum error very frequently in my code. I am using hadoop-0.13.0. Below is the checksum error that I am getting. Can some one please help me in this regard.
org.apache.hadoop.fs.ChecksumException: Checksum error: /tmp/hadoop-user/dfs/dir1/dir2/part-00000 at 0
        at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:264)
        at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:221)
        at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175)
        at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1186)
        at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1171)
        at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1162)

Thanks
       
Dennis Kubes wrote
I can read files through fs cat.  Also the errors once rescheduled will
most often fix themselves, although some times enough of them occur
where a single job will fail.

Dennis Kubes

Raghu Angadi wrote:
>
> Can you manually try to read one such file with 'hadoop fs -cat'? If it
> is not a transient software error, you should see the checksum error
> again. If you see the error, it does not confirm a hardware error but if
> you are able to read correctly, then it is mostly Hadoop bug.
>
> Raghu.
>
> Dennis Kubes wrote:
>> All,
>>
>> We are continually experiencing checksum errors when running some jobs
>> under heavy load (specifically merging segments or crawldbs).  I am
>> lost as to whether this is a hardware or software problem.  Two
>> questions, one is anyone else experiencing a large number of checksum
>> type errors on big clusters?  Two, does anyone know if this is
>> hardware or software related?  Here are some examples.
>>
>> Dennis Kubes
>>
>>
>> org.apache.hadoop.fs.ChecksumException: Checksum error:
>> /d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>>
>>     at
>> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>>
>>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>>     at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>     at
>> org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(SequenceFile.java:427)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(SequenceFile.java:414)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java:1669)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(SequenceFile.java:2585)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2356)
>>
>>     at
>> org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2230)
>>
>>     at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:517)
>>
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
>>     at
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1701)
>>
>>
>>
>> Map output lost, rescheduling: getMapOutput(task_0042_m_000375_0,4)
>> failed :
>> org.apache.hadoop.fs.ChecksumException: Checksum error:
>> /d01/hadoop/mapred/local/task_0042_m_000375_0/file.out at 20267008
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
>>
>>     at
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>>
>>     at
>> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>>
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>     at java.io.DataInputStream.read(DataInputStream.java:134)
>>     at
>> org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1932)
>>
>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>>     at
>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
>>     at
>> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
>>
>>     at
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
>>     at
>> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
>>
>>     at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
>>     at org.mortbay.http.HttpServer.service(HttpServer.java:954)
>>     at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
>>     at
>> org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
>>     at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
>>     at
>> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
>>     at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
>>     at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>>
>