[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160972#comment-16160972 ]

ASF GitHub Bot commented on NUTCH-2375:
---------------------------------------

sebastian-nagel commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
URL: https://github.com/apache/nutch/pull/221#discussion_r138018731
 
 

 ##########
 File path: src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
 ##########
 @@ -29,73 +29,84 @@
 import org.apache.hadoop.io.Writable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.io.SequenceFile.CompressionType;
-import org.apache.hadoop.mapred.FileOutputFormat;
+import org.apache.hadoop.util.Progressable;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 import org.apache.hadoop.mapred.InvalidJobConfException;
-import org.apache.hadoop.mapred.OutputFormat;
-import org.apache.hadoop.mapred.RecordWriter;
-import org.apache.hadoop.mapred.JobConf;
-import org.apache.hadoop.mapred.Reporter;
-import org.apache.hadoop.mapred.SequenceFileOutputFormat;
+import org.apache.hadoop.mapreduce.OutputFormat;
+import org.apache.hadoop.mapreduce.RecordWriter;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapred.FileSplit;
 import org.apache.hadoop.util.Progressable;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseOutputFormat;
 import org.apache.nutch.protocol.Content;
 
 /** Splits FetcherOutput entries into multiple map files. */
-public class FetcherOutputFormat implements OutputFormat<Text, NutchWritable> {
+public class FetcherOutputFormat extends FileOutputFormat<Text, NutchWritable> {
 
-  public void checkOutputSpecs(FileSystem fs, JobConf job) throws IOException {
+  @Override
+  public void checkOutputSpecs(JobContext job) throws IOException {
+    Configuration conf = job.getConfiguration();
+    FileSystem fs = FileSystem.get(conf);
     Path out = FileOutputFormat.getOutputPath(job);
     if ((out == null) && (job.getNumReduceTasks() != 0)) {
-      throw new InvalidJobConfException("Output directory not set in JobConf.");
+      throw new InvalidJobConfException("Output directory not set in conf.");
     }
     if (fs == null) {
-      fs = out.getFileSystem(job);
+      fs = out.getFileSystem(conf);
     }
     if (fs.exists(new Path(out, CrawlDatum.FETCH_DIR_NAME)))
       throw new IOException("Segment already fetched!");
   }
 
-  public RecordWriter<Text, NutchWritable> getRecordWriter(final FileSystem fs,
-      final JobConf job, final String name, final Progressable progress)
+  @Override
+  public RecordWriter<Text, NutchWritable> getRecordWriter(TaskAttemptContext context)
           throws IOException {
 
-    Path out = FileOutputFormat.getOutputPath(job);
+    Configuration conf = context.getConfiguration();
+    String name = context.getJobName();//getTaskAttemptID().toString();
+    Path dir = FileOutputFormat.getOutputPath(context);
+    FileSystem fs = dir.getFileSystem(context.getConfiguration());
+    Path out = FileOutputFormat.getOutputPath(context);
 
 Review comment:
   This will change the output folder structure and probably will cause collisions of output folders if run in distributed mode (on a Hadoop cluster). The directory tree of a segment should look as before:
   ```
   crawl/segments/20170816093452/                                                                                                                                
   |-- content                                                                                                                                                  
   |   `-- part-00000
   |       |-- data
   |       `-- index
   |-- crawl_fetch
   |   `-- part-00000
   |       |-- data
   |       `-- index
   |-- crawl_generate
   |   `-- part-00000
   |-- crawl_parse
   |   `-- part-00000
   |-- parse_data
   |   `-- part-00000
   |       |-- data
   |       `-- index
   `-- parse_text
       `-- part-00000
           |-- data
           `-- index
   ```
   
   There will be changes due to the MapReduce upgrade (part-xxxxx -> part-r-xxxxx). The tree is now
   ```
   crawl/segments/20170911103223/
   |-- content
   |   `-- FetchData
   |       |-- data
   |       `-- index
   |-- crawl_fetch
   |   `-- FetchData
   |       |-- data
   |       `-- index
   |-- crawl_generate
   |   `-- part-r-00000
   |-- crawl_parse
   |   `-- parse\ crawl
   |       `-- segments
   |           `-- 20170911103223
   |-- parse_data
   |   `-- parse\ crawl
   |       `-- segments
   |           `-- 20170911103223
   |               |-- data
   |               `-- index
   `-- parse_text
       `-- parse\ crawl
           `-- segments
               `-- 20170911103223
                   |-- data
                   `-- index
   ```
   
   which makes a crawl failing, e.g. with
   ```
   CrawlDb update: java.io.FileNotFoundException: File file:.../crawl/segments/20170911103223/crawl_parse/parse crawl/data does not exist
           at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609)
           at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822)
           at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599)
           at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
   ```
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-2375
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2375
>             Project: Nutch
>          Issue Type: Improvement
>          Components: deployment
>            Reporter: Omkar Reddy
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which has been deprecated. It need to be updated to org.apache.hadoop.mapreduce dependency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)