[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161253#comment-16161253 ]

ASF GitHub Bot commented on NUTCH-2375:
---------------------------------------

Omkar20895 commented on a change in pull request #221: NUTCH-2375 Upgrading nutch to use org.apache.hadoop.mapreduce
URL: https://github.com/apache/nutch/pull/221#discussion_r138073377
 
 

 ##########
 File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
 ##########
 @@ -368,41 +367,46 @@ public void close() {
     closeReaders();
   }
 
-  private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort) throws IOException{
+  private TreeMap<String, LongWritable> processStatJobHelper(String crawlDb, Configuration config, boolean sort)
+          throws IOException, InterruptedException, ClassNotFoundException{
   Path tmpFolder = new Path(crawlDb, "stat_tmp" + System.currentTimeMillis());
 
-  JobConf job = new NutchJob(config);
+  Job job = NutchJob.getInstance(config);
+          config = job.getConfiguration();
   job.setJobName("stats " + crawlDb);
-  job.setBoolean("db.reader.stats.sort", sort);
+  config.setBoolean("db.reader.stats.sort", sort);
 
   FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));
-  job.setInputFormat(SequenceFileInputFormat.class);
+  job.setInputFormatClass(SequenceFileInputFormat.class);
 
   job.setMapperClass(CrawlDbStatMapper.class);
   job.setCombinerClass(CrawlDbStatCombiner.class);
   job.setReducerClass(CrawlDbStatReducer.class);
 
   FileOutputFormat.setOutputPath(job, tmpFolder);
-  job.setOutputFormat(SequenceFileOutputFormat.class);
+  job.setOutputFormatClass(SequenceFileOutputFormat.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(LongWritable.class);
 
   // https://issues.apache.org/jira/browse/NUTCH-1029
-  job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);
-
-  JobClient.runJob(job);
+  config.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);
 
+          try {
+            int complete = job.waitForCompletion(true)?0:1;
+          } catch (InterruptedException | ClassNotFoundException e) {
+            LOG.error(StringUtils.stringifyException(e));
+            throw e;
+          }
   // reading the result
   FileSystem fileSystem = tmpFolder.getFileSystem(config);
-  SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(config,
-  tmpFolder);
+  MapFile.Reader[] readers = MapFileOutputFormat.getReaders(tmpFolder, config);
 
 Review comment:
   @sebastian-nagel SequenceFileOutputFormat does not have the sub-routine in the upgrade(new API). One of the things that I can do is replicate the SequenceFileOutputFormat.getReaders(of the old API) in a separate util file in org/apache/nutch/util/ please let me know your thoughts in it.
   
   The implementation of the old API getReaders can be found [here](https://hadoop.apache.org/docs/r2.6.1/api/src-html/org/apache/hadoop/mapred/SequenceFileOutputFormat.html#line.84). Thanks.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-2375
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2375
>             Project: Nutch
>          Issue Type: Improvement
>          Components: deployment
>            Reporter: Omkar Reddy
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which has been deprecated. It need to be updated to org.apache.hadoop.mapreduce dependency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)