[jira] Created: (NUTCH-170) Crash with multiple temp directories

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-170) Crash with multiple temp directories

Nick Burch (Jira)
Crash with multiple temp directories
------------------------------------

         Key: NUTCH-170
         URL: http://issues.apache.org/jira/browse/NUTCH-170
     Project: Nutch
        Type: Bug
    Reporter: Rod Taylor
    Priority: Critical


A brief read of the code indicated it may be possible to use multiple local directories using something like the below:

  <property>
    <name>mapred.local.dir</name>
    <value>/local,/local1,/local2</value>
    <description>The local directory where MapReduce stores intermediate
    data files.
    </description>
  </property>

This failed with the below exception during either the generate or update phase (not entirely sure which).

java.lang.ArrayIndexOutOfBoundsException
        at java.util.zip.CRC32.update(CRC32.java:51)
        at org.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:92)
        at org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:156)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
        at java.io.DataInputStream.readFully(DataInputStream.java:176)
        at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
        at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
        at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)
        at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)
        at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)
        at org.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)
        at org.apache.nutch.segment.SegmentReader$InputFormat$1.next(SegmentReader.java:80)
        at org.apache.nutch.mapred.MapTask$2.next(MapTask.java:106)
        at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
        at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:604)

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-170) Crash with multiple temp directories

Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-170?page=comments#action_12362355 ]

Rod Taylor commented on NUTCH-170:
----------------------------------

Wish there was an "edit" option in JIRA.  Obviously it was within the SegmentReader -- though I don't believe it does anything special to try to read the data file off disk.

I do not and have not seen this exception with a single local directory configuration.

> Crash with multiple temp directories
> ------------------------------------
>
>          Key: NUTCH-170
>          URL: http://issues.apache.org/jira/browse/NUTCH-170
>      Project: Nutch
>         Type: Bug
>     Reporter: Rod Taylor
>     Priority: Critical

>
> A brief read of the code indicated it may be possible to use multiple local directories using something like the below:
>   <property>
>     <name>mapred.local.dir</name>
>     <value>/local,/local1,/local2</value>
>     <description>The local directory where MapReduce stores intermediate
>     data files.
>     </description>
>   </property>
> This failed with the below exception during either the generate or update phase (not entirely sure which).
> java.lang.ArrayIndexOutOfBoundsException
>         at java.util.zip.CRC32.update(CRC32.java:51)
>         at org.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:92)
>         at org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:156)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>         at java.io.DataInputStream.readFully(DataInputStream.java:176)
>         at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
>         at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
>         at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)
>         at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)
>         at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)
>         at org.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)
>         at org.apache.nutch.segment.SegmentReader$InputFormat$1.next(SegmentReader.java:80)
>         at org.apache.nutch.mapred.MapTask$2.next(MapTask.java:106)
>         at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
>         at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:604)

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-170) Crash with multiple temp directories

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-170?page=comments#action_12362482 ]

Doug Cutting commented on NUTCH-170:
------------------------------------

I have sucessfully used mapred.local.dir with multiple values on many occasions.

Can you please try to distill this to an easy to reproduce test-case?  Thanks.

> Crash with multiple temp directories
> ------------------------------------
>
>          Key: NUTCH-170
>          URL: http://issues.apache.org/jira/browse/NUTCH-170
>      Project: Nutch
>         Type: Bug
>     Reporter: Rod Taylor
>     Priority: Critical

>
> A brief read of the code indicated it may be possible to use multiple local directories using something like the below:
>   <property>
>     <name>mapred.local.dir</name>
>     <value>/local,/local1,/local2</value>
>     <description>The local directory where MapReduce stores intermediate
>     data files.
>     </description>
>   </property>
> This failed with the below exception during either the generate or update phase (not entirely sure which).
> java.lang.ArrayIndexOutOfBoundsException
>         at java.util.zip.CRC32.update(CRC32.java:51)
>         at org.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:92)
>         at org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:156)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>         at java.io.DataInputStream.readFully(DataInputStream.java:176)
>         at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
>         at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
>         at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)
>         at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)
>         at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)
>         at org.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)
>         at org.apache.nutch.segment.SegmentReader$InputFormat$1.next(SegmentReader.java:80)
>         at org.apache.nutch.mapred.MapTask$2.next(MapTask.java:106)
>         at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
>         at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:604)

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-170) Crash with multiple temp directories

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-170?page=comments#action_12365169 ]

Rod Taylor commented on NUTCH-170:
----------------------------------

I enabled multiple local directories on 50% of my machines. This error, which does not appear otherwise (we processed a few hundred million pages without error), happens across all machines BUT always on part-##'s which were created on the split directory boxes.

Secondly, the error only occurs during the segread command, org.apache.nutch.segment.SegmentReader, while retrieving the data out of common storage (NAS in our case).

060203 222553 task_m_4x4zan 0.7223333% /opt/sitesell/sbider_data/nutch/segments/segmentset-2006-02-02254/20060202181143-3/content/part-00001/data:0+21025408
060203 222553 task_m_4x4zan  Problem reading checksum file: java.io.EOFException. Ignoring.
060203 222553 task_m_4x4zan  Error running child
060203 222553 task_m_4x4zan java.lang.ArrayIndexOutOfBoundsException
060203 222553 task_m_4x4zan     at java.util.zip.CRC32.update(CRC32.java:43)
060203 222553 task_m_4x4zan     at org.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:92)
060203 222553 task_m_4x4zan     at org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:156)
060203 222553 task_m_4x4zan     at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
060203 222553 task_m_4x4zan     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
060203 222553 task_m_4x4zan     at java.io.DataInputStream.readFully(DataInputStream.java:176)
060203 222553 task_m_4x4zan     at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
060203 222553 task_m_4x4zan     at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060203 222553 task_m_4x4zan     at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)
060203 222553 task_m_4x4zan     at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)
060203 222553 task_m_4x4zan     at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)
060203 222553 task_m_4x4zan     at org.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)
060203 222553 task_m_4x4zan     at org.apache.nutch.segment.SegmentReader$InputFormat$1.next(SegmentReader.java:80)
060203 222553 task_m_4x4zan     at org.apache.nutch.mapred.MapTask$2.next(MapTask.java:106)
060203 222553 task_m_4x4zan     at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
060203 222553 task_m_4x4zan     at org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
060203 222553 task_m_4x4zan     at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)

060203 222557 task_m_4x4zan done; removing files.

> Crash with multiple temp directories
> ------------------------------------
>
>          Key: NUTCH-170
>          URL: http://issues.apache.org/jira/browse/NUTCH-170
>      Project: Nutch
>         Type: Bug
>     Reporter: Rod Taylor
>     Priority: Critical

>
> A brief read of the code indicated it may be possible to use multiple local directories using something like the below:
>   <property>
>     <name>mapred.local.dir</name>
>     <value>/local,/local1,/local2</value>
>     <description>The local directory where MapReduce stores intermediate
>     data files.
>     </description>
>   </property>
> This failed with the below exception during either the generate or update phase (not entirely sure which).
> java.lang.ArrayIndexOutOfBoundsException
>         at java.util.zip.CRC32.update(CRC32.java:51)
>         at org.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:92)
>         at org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:156)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>         at java.io.DataInputStream.readFully(DataInputStream.java:176)
>         at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
>         at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
>         at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)
>         at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)
>         at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)
>         at org.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)
>         at org.apache.nutch.segment.SegmentReader$InputFormat$1.next(SegmentReader.java:80)
>         at org.apache.nutch.mapred.MapTask$2.next(MapTask.java:106)
>         at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
>         at org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:604)

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira