[jira] [Commented] (NUTCH-2756) Segment Part problem with HDFS on distibuted mode

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2756) Segment Part problem with HDFS on distibuted mode

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992437#comment-16992437 ]

Sebastian Nagel commented on NUTCH-2756:
----------------------------------------

The killed container was one launched speculatively:
{noformat}2019-12-10 06:34:22,872 INFO [DefaultSpeculator background processing] org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator: DefaultSpeculator.addSpeculativeAttempt -- we are speculating task_1575911127307_0231_r_000001
2019-12-10 06:34:22,872 INFO [DefaultSpeculator background processing] org.apache.hadoop.mapreduce.v2.app.speculate.DefaultSpeculator: We launched 1 speculations.  Sleeping 15000 milliseconds.
{noformat}
For a small cluster and only few tasks speculative execution makes little sense, you could disable it by setting the properties mapreduce.map.speculative and mapreduce.reduce.speculative to false.  However, speculative execution shouldn't lead to broken job output.

> Segment Part problem with HDFS on distibuted mode
> -------------------------------------------------
>
>                 Key: NUTCH-2756
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2756
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Lucas Pauchard
>            Priority: Major
>         Attachments: 0_byte_file_screenshot.PNG, hadoop-env.sh, hdfs-site.xml, mapred-site.xml, syslog, yarn-env.sh, yarn-site.xml
>
>
> During the parsing, it happens sometimes that parts of the data on the HDFS is missing after the parsing.
> When I take a look at our HDFS, I've got this file with 0 bytes (see attachments).
> After that the CrawlDB complains about this specific (corrupted?) part:
> {panel:title=log_crawl}
> 2019-12-04 22:25:57,454 INFO mapreduce.Job: Task Id : attempt_1575479127636_0047_m_000017_2, Status : FAILED
> Error: java.io.EOFException: hdfs://jobmaster:9000/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 not a SequenceFile
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1964)
>         at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1923)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1872)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1886)
>         at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.initialize(SequenceFileRecordReader.java:54)
>         at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:560)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:798)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)~
> {panel}
> When I check the namenode logs, I don't see any error during the writing of the segment part but one hour later, I've got the following log:
> {panel:title=log_namenode}
> 2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 2], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index
> 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/parse_data/part-r-00004/index closed.
> 2019-12-04 23:23:13,750 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  Holder: DFSClient_attempt_1575479127636_0046_r_000004_1_1307945884_1, pending creates: 1], src=/user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004
> 2019-12-04 23:23:13,750 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* internalReleaseLease: All existing blocks are COMPLETE, lease removed, file /user/hadoop/crawlmultiokhttp/segment/20191204221308/crawl_parse/part-r-00004 closed.
> {panel}
> This issue is hard to reproduce and I can't figure out what are the preconditions. It seems that it just happens randomly.
> Maybe the problem is coming from a bad management when we close the file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)