Jérôme Charron wrote:
> In my environment, the crawl command terminate with the following error:
> 2006-07-06 17:41:49,735 ERROR mapred.JobClient
> (JobClient.java:submitJob(273))
> - Input directory /localpath/crawl/crawldb/current in local is invalid.
> Exception in thread "main" java.io.IOException: Input directory
> /localpathcrawl/crawldb/current in local is invalid.
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
Hadoop 0.4.0 by default requires all input directories to exist, where
previous releases did not. So we need to either create an empty
"current" directory or change the InputFormat used in
CrawlDb.createJob() to be one that overrides
InputFormat.areValidInputDirectories(). The former is probably easier.
I've attached a patch. Does this fix things for folks?
Doug
Index: src/java/org/apache/nutch/crawl/CrawlDb.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDb.java (revision 417882)
+++ src/java/org/apache/nutch/crawl/CrawlDb.java (working copy)
@@ -65,7 +65,8 @@
if (LOG.isInfoEnabled()) { LOG.info("CrawlDb update: done"); }
}
- public static JobConf createJob(Configuration config, Path crawlDb) {
+ public static JobConf createJob(Configuration config, Path crawlDb)
+ throws IOException {
Path newCrawlDb =
new Path(crawlDb,
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
@@ -73,7 +74,11 @@
JobConf job = new NutchJob(config);
job.setJobName("crawldb " + crawlDb);
- job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
+
+ Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME);
+ if (FileSystem.get(job).exists(current)) {
+ job.addInputPath(current);
+ }
job.setInputFormat(SequenceFileInputFormat.class);
job.setInputKeyClass(UTF8.class);
job.setInputValueClass(CrawlDatum.class);