Error with Hadoop-0.4.0

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Error with Hadoop-0.4.0

Jérôme Charron
Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
1.5
(more precisely since HADOOP-129 and File replacement by Path).

In my environment, the crawl command terminate with the following error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is invalid.
Exception in thread "main" java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

By looking at the Nutch code, and simply changing the line 145 of Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I finaly don"t
understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));

For curiosity, if a hadoop guru can explain why there is such a
regression...

Does somebody have the same error?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Sami Siren-2
Jérôme Charron wrote:

> Hi,
>
> I encountered some problems with Nutch trunk version.
> In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
> 1.5
> (more precisely since HADOOP-129 and File replacement by Path).
> Does somebody have the same error?

I am not seeing this (just run inject on a single machine(linux)
configuration, local fs without problems ).

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Jérôme Charron
> > I encountered some problems with Nutch trunk version.
> > In fact it seems to be related to changes related to Hadoop-0.4.0 and
> JDK
> > 1.5
> > (more precisely since HADOOP-129 and File replacement by Path).
> > Does somebody have the same error?
>
> I am not seeing this (just run inject on a single machine(linux)
> configuration, local fs without problems ).

Thanks for your feedback Sami.
The strange think is that I have exactly the same behavior on two different
boxes !!

Jérôme
Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Stefan Groschupf-2
In reply to this post by Jérôme Charron
Hi Jérôme,

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.

Stefan

On 06.07.2006, at 08:54, Jérôme Charron wrote:

> Hi,
>
> I encountered some problems with Nutch trunk version.
> In fact it seems to be related to changes related to Hadoop-0.4.0  
> and JDK
> 1.5
> (more precisely since HADOOP-129 and File replacement by Path).
>
> In my environment, the crawl command terminate with the following  
> error:
> 2006-07-06 17:41:49,735 ERROR mapred.JobClient  
> (JobClient.java:submitJob(273))
> - Input directory /localpath/crawl/crawldb/current in local is  
> invalid.
> Exception in thread "main" java.io.IOException: Input directory
> /localpathcrawl/crawldb/current in local is invalid.
>        at org.apache.hadoop.mapred.JobClient.submitJob
> (JobClient.java:274)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
> 327)
>        at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
> By looking at the Nutch code, and simply changing the line 145 of  
> Injector
> by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
> (tempDir))
> all is working fine. By taking a closer look at CrawlDb code, I  
> finaly don"t
> understand why there is the following line in the createJob method:
> job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
>
> For curiosity, if a hadoop guru can explain why there is such a
> regression...
>
> Does somebody have the same error?
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/

Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Jérôme Charron
> I have the same problem on a distribute environment! :-(
> So I think can confirm this is a bug.

Thanks for this feedback Stefan.


> We should fix that.

What I suggest, is simply to remove the line 75 in createJob method from
CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and CrawlDb.update()
and
the inputPath setted in createJob is not needed neither by Injector.inject()
nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Stefan Groschupf-2
We tried your suggested fix:
> Injector
> by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
> (tempDir))

and this worked without any problem.

Thanks for catching that, this saved us a lot of time.
Stefan

On 07.07.2006, at 16:08, Jérôme Charron wrote:

>> I have the same problem on a distribute environment! :-(
>> So I think can confirm this is a bug.
>
> Thanks for this feedback Stefan.
>
>
>> We should fix that.
>
> What I suggest, is simply to remove the line 75 in createJob method  
> from
> CrawlDb :
> setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
> In fact, this method is only used by Injector.inject() and  
> CrawlDb.update()
> and
> the inputPath setted in createJob is not needed neither by  
> Injector.inject()
> nor
> CrawlDb.update() methods.
>
> If no objection, I will commit this change tomorrow.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/

Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Andrzej Białecki-2
In reply to this post by Jérôme Charron
Jérôme Charron wrote:

>
> What I suggest, is simply to remove the line 75 in createJob method from
> CrawlDb :
> setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
> In fact, this method is only used by Injector.inject() and
> CrawlDb.update()
> and
> the inputPath setted in createJob is not needed neither by
> Injector.inject()
> nor
> CrawlDb.update() methods.

Hold your horses - it IS needed, otherwise you will lose the original
information from CrawlDB.

>
> If no objection, I will commit this change tomorrow.

-1.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Andrzej Białecki-2
In reply to this post by Stefan Groschupf-2
Stefan Groschupf wrote:
> We tried your suggested fix:
>> Injector
>> by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
>> (tempDir))

I suspect that this is not the right solution - have you actually tested
that the resulting db contains all entries from the input dirs?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: Error with Hadoop-0.4.0

Gal Nitzan
In reply to this post by Sami Siren-2
To get the same behavior, just try to inject to a new crawldb that doesn't
exist.

The reason many doesn't get it is that crawldb already exists in their
environment.



-----Original Message-----
From: Sami Siren [mailto:[hidden email]]
Sent: Thursday, July 06, 2006 7:23 PM
To: [hidden email]
Subject: Re: Error with Hadoop-0.4.0

Jérôme Charron wrote:

> Hi,
>
> I encountered some problems with Nutch trunk version.
> In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
> 1.5
> (more precisely since HADOOP-129 and File replacement by Path).
> Does somebody have the same error?

I am not seeing this (just run inject on a single machine(linux)
configuration, local fs without problems ).

--
 Sami Siren


Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Doug Cutting
In reply to this post by Jérôme Charron
Jérôme Charron wrote:

> In my environment, the crawl command terminate with the following error:
> 2006-07-06 17:41:49,735 ERROR mapred.JobClient
> (JobClient.java:submitJob(273))
> - Input directory /localpath/crawl/crawldb/current in local is invalid.
> Exception in thread "main" java.io.IOException: Input directory
> /localpathcrawl/crawldb/current in local is invalid.
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>        at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
Hadoop 0.4.0 by default requires all input directories to exist, where
previous releases did not.  So we need to either create an empty
"current" directory or change the InputFormat used in
CrawlDb.createJob() to be one that overrides
InputFormat.areValidInputDirectories().  The former is probably easier.
  I've attached a patch.  Does this fix things for folks?

Doug

Index: src/java/org/apache/nutch/crawl/CrawlDb.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDb.java (revision 417882)
+++ src/java/org/apache/nutch/crawl/CrawlDb.java (working copy)
@@ -65,7 +65,8 @@
     if (LOG.isInfoEnabled()) { LOG.info("CrawlDb update: done"); }
   }
 
-  public static JobConf createJob(Configuration config, Path crawlDb) {
+  public static JobConf createJob(Configuration config, Path crawlDb)
+    throws IOException {
     Path newCrawlDb =
       new Path(crawlDb,
                Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
@@ -73,7 +74,11 @@
     JobConf job = new NutchJob(config);
     job.setJobName("crawldb " + crawlDb);
 
-    job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
+
+    Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME);
+    if (FileSystem.get(job).exists(current)) {
+      job.addInputPath(current);
+    }
     job.setInputFormat(SequenceFileInputFormat.class);
     job.setInputKeyClass(UTF8.class);
     job.setInputValueClass(CrawlDatum.class);
Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Sami Siren-2
In reply to this post by Gal Nitzan
Gal Nitzan wrote:

>To get the same behavior, just try to inject to a new crawldb that doesn't
>exist.
>
>The reason many doesn't get it is that crawldb already exists in their
>environment.
>
>
>  
>
true, I was injecting to existing crawldb.

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Sami Siren-2
In reply to this post by Doug Cutting
Doug Cutting wrote:

>  Jérôme Charron wrote:
>
> > In my environment, the crawl command terminate with the following
> > error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient
> > (JobClient.java:submitJob(273)) - Input directory
> > /localpath/crawl/crawldb/current in local is invalid. Exception in
> > thread "main" java.io.IOException: Input directory
> > /localpathcrawl/crawldb/current in local is invalid. at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at
> > org.apache.nutch.crawl.Injector.inject(Injector.java:146) at
> > org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
>
>
>  Hadoop 0.4.0 by default requires all input directories to exist,
>  where previous releases did not. So we need to either create an
>  empty "current" directory or change the InputFormat used in
>  CrawlDb.createJob() to be one that overrides
>  InputFormat.areValidInputDirectories(). The former is probably
>  easier. I've attached a patch. Does this fix things for folks?
>

Patch works for me.
--
 Sami Siren

Reply | Threaded
Open this post in threaded view
|

Re: Error with Hadoop-0.4.0

Doug Cutting
Sami Siren wrote:
> Patch works for me.

OK.  I just committed it.

Thanks!

Doug