Nutch java.io.exception

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch java.io.exception

Armel T. Nene-2
 

Hi guys,

 

I am currently running Nutch .8.2-dev on MS Windows Vista using Sun JVM 6. I
have setup Nutch in my IDE (NetBeans) and it works great. Afterward, I have
applied Nutch-61 https://issues.apache.org/jira/browse/NUTCH-61 to my local
version. Now, when I run Nutch within the IDE, all the steps are performed
with no problem. I can view the content of the crawldb, segments and index
are fine. If i run it a loop, the process execute without any problem.

 

I then package the version and run it in a testing environment. At first no
index were being created. I setup the log files for Hadoop to debug as Nutch
wasn't giving any errors. There are some debug line from Hadoop that look
suspicious. Below is an extract:

 

From the log status, I can see that the problem occurs on Generate and
Inject stage. Can anybody help me in overcoming this problem, I will be glad
to provide a working version of the Nutch-61 once tested.

 

2007-04-05 16:35:30,976 INFO  mapred.LocalJobRunner -
E:/iDna-nutch-RC1/iDna-nutch-launcher/test/urls/urls:0+55

2007-04-05 16:35:31,073 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule

2007-04-05 16:35:31,074 INFO  crawl.FetchSchedule -
defaultInterval=7.46496E9

2007-04-05 16:35:31,074 INFO  crawl.FetchSchedule - maxInterval=2592000.0

2007-04-05 16:35:31,084 DEBUG io.SequenceFile - running sort pass

2007-04-05 16:35:31,096 INFO  io.SequenceFile - flushing segment 0

2007-04-05 16:35:31,928 INFO  mapred.JobClient -  map 100%  reduce 0%

2007-04-05 16:35:31,940 INFO  mapred.LocalJobRunner - reduce > reduce

2007-04-05 16:35:32,928 INFO  mapred.JobClient - Job complete: job_ui1cje

2007-04-05 16:35:32,928 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.

2007-04-05 16:35:32,938 DEBUG conf.Configuration - java.io.IOException:
config(config)

                at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)

                at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:86)

                at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:97)

                at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)

                at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java:74)

                at org.apache.nutch.crawl.Injector.inject(Injector.java:222)

                at org.apache.nutch.crawl.Injector.main(Injector.java:242)

                at
com.idna.nutch.launcher.CrawlerManager.injector(CrawlerManager.java:63)

                at
com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:209)

 

2007-04-05 16:35:32,943 INFO  conf.Configuration - parsing
jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-0.4.0-patched.jar!/ha
doop-default.xml

2007-04-05 16:35:32,951 INFO  conf.Configuration - parsing
file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-default.xml

2007-04-05 16:35:32,961 INFO  conf.Configuration - parsing
jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-0.4.0-patched.jar!/ma
pred-default.xml

2007-04-05 16:35:32,966 INFO  conf.Configuration - parsing
jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-0.4.0-patched.jar!/ma
pred-default.xml

2007-04-05 16:35:32,973 INFO  conf.Configuration - parsing
file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-site.xml

2007-04-05 16:35:32,980 INFO  conf.Configuration - parsing
file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/hadoop-site.xml

2007-04-05 16:35:33,040 DEBUG conf.Configuration - java.io.IOException:
config(config)

                at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)

                at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:86)

                at
org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:58)

                at
org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:182)

                at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:292)

                at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)

                at org.apache.nutch.crawl.Injector.inject(Injector.java:224)

                at org.apache.nutch.crawl.Injector.main(Injector.java:242)

                at
com.idna.nutch.launcher.CrawlerManager.injector(CrawlerManager.java:63)

                at
com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:209)

 

2007-04-05 16:35:33,501 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule

2007-04-05 16:35:33,501 INFO  crawl.FetchSchedule -
defaultInterval=7.46496E9

2007-04-05 16:35:33,501 INFO  crawl.FetchSchedule - maxInterval=2592000.0

2007-04-05 16:35:33,508 DEBUG io.SequenceFile - running sort pass

2007-04-05 16:35:33,514 INFO  io.SequenceFile - flushing segment 0

2007-04-05 16:35:33,639 INFO  mapred.LocalJobRunner - reduce > reduce

2007-04-05 16:35:34,120 INFO  mapred.JobClient - Job complete: job_qzwgkh

2007-04-05 16:35:34,429 INFO  crawl.Injector - Injector: done

2007-04-05 16:35:34,439 INFO  crawl.Generator - topN: 100

2007-04-05 16:35:34,439 DEBUG conf.Configuration - java.io.IOException:
config()

                at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:67)

                at
org.apache.nutch.util.NutchConfiguration.create(NutchConfiguration.java:50)

                at org.apache.nutch.crawl.Generator.main(Generator.java:416)

                at
com.idna.nutch.launcher.CrawlerManager.autoGenSegList(CrawlerManager.java:80
)

                at
com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:211)

 

2007-04-05 16:35:34,443 INFO  conf.Configuration - parsing
jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-0.4.0-patched.jar!/ha
doop-default.xml

2007-04-05 16:35:34,450 INFO  conf.Configuration - parsing
file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-default.xml

2007-04-05 16:35:34,462 INFO  conf.Configuration - parsing
file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-site.xml

2007-04-05 16:35:34,468 INFO  conf.Configuration - parsing
file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/hadoop-site.xml

2007-04-05 16:35:35,470 INFO  crawl.Generator - Generator: starting

2007-04-05 16:35:35,470 INFO  crawl.Generator - Generator: segment:
test/segments/20070405163535

2007-04-05 16:35:35,470 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.

2007-04-05 16:35:35,471 DEBUG conf.Configuration - java.io.IOException:
config(config)

                at
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)

                at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:86)

                at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:97)

                at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)

                at
org.apache.nutch.crawl.Generator.generate(Generator.java:309)

                at org.apache.nutch.crawl.Generator.main(Generator.java:417)

                at
com.idna.nutch.launcher.CrawlerManager.autoGenSegList(CrawlerManager.java:80
)

                at
com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:211)

 

===========================

Armel T. Nene

iDNA Solutions LTD

Tel: +44 (20) 7257 6124

Mobile: +44 (7886)950 483

Web: http://www.idna-solutions.com

Blog: http://blog.idna-solutions.com

 

Reply | Threaded
Open this post in threaded view
|

Re: Nutch java.io.exception

Doğacan Güney-3
Hi,

That is not a problem, AFAIK. Hadoop, for some reason, has a code like this
in Configuration's constructor.

    if (LOG.isDebugEnabled()) {
      LOG.debug(StringUtils.stringifyException(new
IOException("config()")));
    }

That is what you are seeing. Since that exception is created but now thrown
it should be harmless. To track your problem you may want to do a "readseg
-list" after generation and fetch to see if the numbers make sense.

On 4/10/07, Armel T. Nene <[hidden email]> wrote:

>
>
>
> Hi guys,
>
>
>
> I am currently running Nutch .8.2-dev on MS Windows Vista using Sun JVM 6.
> I
> have setup Nutch in my IDE (NetBeans) and it works great. Afterward, I
> have
> applied Nutch-61 https://issues.apache.org/jira/browse/NUTCH-61 to my
> local
> version. Now, when I run Nutch within the IDE, all the steps are performed
> with no problem. I can view the content of the crawldb, segments and index
> are fine. If i run it a loop, the process execute without any problem.
>
>
>
> I then package the version and run it in a testing environment. At first
> no
> index were being created. I setup the log files for Hadoop to debug as
> Nutch
> wasn't giving any errors. There are some debug line from Hadoop that look
> suspicious. Below is an extract:
>
>
>
> From the log status, I can see that the problem occurs on Generate and
> Inject stage. Can anybody help me in overcoming this problem, I will be
> glad
> to provide a working version of the Nutch-61 once tested.
>
>
>
> 2007-04-05 16:35:30,976 INFO  mapred.LocalJobRunner -
> E:/iDna-nutch-RC1/iDna-nutch-launcher/test/urls/urls:0+55
>
> 2007-04-05 16:35:31,073 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>
> 2007-04-05 16:35:31,074 INFO  crawl.FetchSchedule -
> defaultInterval=7.46496E9
>
> 2007-04-05 16:35:31,074 INFO  crawl.FetchSchedule - maxInterval=2592000.0
>
> 2007-04-05 16:35:31,084 DEBUG io.SequenceFile - running sort pass
>
> 2007-04-05 16:35:31,096 INFO  io.SequenceFile - flushing segment 0
>
> 2007-04-05 16:35:31,928 INFO  mapred.JobClient -  map 100%  reduce 0%
>
> 2007-04-05 16:35:31,940 INFO  mapred.LocalJobRunner - reduce > reduce
>
> 2007-04-05 16:35:32,928 INFO  mapred.JobClient - Job complete: job_ui1cje
>
> 2007-04-05 16:35:32,928 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
>
> 2007-04-05 16:35:32,938 DEBUG conf.Configuration - java.io.IOException:
> config(config)
>
>                 at
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>
>                 at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java
> :86)
>
>                 at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java
> :97)
>
>                 at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>
>                 at org.apache.nutch.crawl.CrawlDb.createJob(CrawlDb.java
> :74)
>
>                 at org.apache.nutch.crawl.Injector.inject(Injector.java
> :222)
>
>                 at org.apache.nutch.crawl.Injector.main(Injector.java:242)
>
>                 at
> com.idna.nutch.launcher.CrawlerManager.injector(CrawlerManager.java:63)
>
>                 at
> com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:209)
>
>
>
> 2007-04-05 16:35:32,943 INFO  conf.Configuration - parsing
> jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-
> 0.4.0-patched.jar!/ha
> doop-default.xml
>
> 2007-04-05 16:35:32,951 INFO  conf.Configuration - parsing
> file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-default.xml
>
> 2007-04-05 16:35:32,961 INFO  conf.Configuration - parsing
> jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-
> 0.4.0-patched.jar!/ma
> pred-default.xml
>
> 2007-04-05 16:35:32,966 INFO  conf.Configuration - parsing
> jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-
> 0.4.0-patched.jar!/ma
> pred-default.xml
>
> 2007-04-05 16:35:32,973 INFO  conf.Configuration - parsing
> file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-site.xml
>
> 2007-04-05 16:35:32,980 INFO  conf.Configuration - parsing
> file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/hadoop-site.xml
>
> 2007-04-05 16:35:33,040 DEBUG conf.Configuration - java.io.IOException:
> config(config)
>
>                 at
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>
>                 at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java
> :86)
>
>                 at
> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:58)
>
>                 at
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:182)
>
>                 at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:292)
>
>                 at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>
>                 at org.apache.nutch.crawl.Injector.inject(Injector.java
> :224)
>
>                 at org.apache.nutch.crawl.Injector.main(Injector.java:242)
>
>                 at
> com.idna.nutch.launcher.CrawlerManager.injector(CrawlerManager.java:63)
>
>                 at
> com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:209)
>
>
>
> 2007-04-05 16:35:33,501 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>
> 2007-04-05 16:35:33,501 INFO  crawl.FetchSchedule -
> defaultInterval=7.46496E9
>
> 2007-04-05 16:35:33,501 INFO  crawl.FetchSchedule - maxInterval=2592000.0
>
> 2007-04-05 16:35:33,508 DEBUG io.SequenceFile - running sort pass
>
> 2007-04-05 16:35:33,514 INFO  io.SequenceFile - flushing segment 0
>
> 2007-04-05 16:35:33,639 INFO  mapred.LocalJobRunner - reduce > reduce
>
> 2007-04-05 16:35:34,120 INFO  mapred.JobClient - Job complete: job_qzwgkh
>
> 2007-04-05 16:35:34,429 INFO  crawl.Injector - Injector: done
>
> 2007-04-05 16:35:34,439 INFO  crawl.Generator - topN: 100
>
> 2007-04-05 16:35:34,439 DEBUG conf.Configuration - java.io.IOException:
> config()
>
>                 at
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:67)
>
>                 at
> org.apache.nutch.util.NutchConfiguration.create(NutchConfiguration.java
> :50)
>
>                 at org.apache.nutch.crawl.Generator.main(Generator.java
> :416)
>
>                 at
> com.idna.nutch.launcher.CrawlerManager.autoGenSegList(CrawlerManager.java
> :80
> )
>
>                 at
> com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:211)
>
>
>
> 2007-04-05 16:35:34,443 INFO  conf.Configuration - parsing
> jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-
> 0.4.0-patched.jar!/ha
> doop-default.xml
>
> 2007-04-05 16:35:34,450 INFO  conf.Configuration - parsing
> file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-default.xml
>
> 2007-04-05 16:35:34,462 INFO  conf.Configuration - parsing
> file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-site.xml
>
> 2007-04-05 16:35:34,468 INFO  conf.Configuration - parsing
> file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/hadoop-site.xml
>
> 2007-04-05 16:35:35,470 INFO  crawl.Generator - Generator: starting
>
> 2007-04-05 16:35:35,470 INFO  crawl.Generator - Generator: segment:
> test/segments/20070405163535
>
> 2007-04-05 16:35:35,470 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
>
> 2007-04-05 16:35:35,471 DEBUG conf.Configuration - java.io.IOException:
> config(config)
>
>                 at
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>
>                 at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java
> :86)
>
>                 at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java
> :97)
>
>                 at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>
>                 at
> org.apache.nutch.crawl.Generator.generate(Generator.java:309)
>
>                 at org.apache.nutch.crawl.Generator.main(Generator.java
> :417)
>
>                 at
> com.idna.nutch.launcher.CrawlerManager.autoGenSegList(CrawlerManager.java
> :80
> )
>
>                 at
> com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:211)
>
>
>
> ===========================
>
> Armel T. Nene
>
> iDNA Solutions LTD
>
> Tel: +44 (20) 7257 6124
>
> Mobile: +44 (7886)950 483
>
> Web: http://www.idna-solutions.com
>
> Blog: http://blog.idna-solutions.com
>
>
>
>


--
Doğacan Güney