Just getting started w/tutorial- errors in crawl.log

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Just getting started w/tutorial- errors in crawl.log

ohaya
Hi,

I've just gotten nutch installed, and am stepping through the tutorial at:

http://lucene.apache.org/nutch/tutorial8.html

It seems to be working, but I get a number of messages in crawl.log, like:

Error parsing: http://lucene.apache.org/skin/getMenu.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/skin/getMenu.js
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

Then, at the end of the log, I get:

LinkDb: adding segment: file:/opt/nutch-1.0/crawl.test/segments/20090713171413
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)

I must have missed something, but being new, I can't figure out what is causing that problem?

Thanks,
Jim

Reply | Threaded
Open this post in threaded view
|

Re: Just getting started w/tutorial- errors in crawl.log

Alex McLintock
> but I get a number of messages in crawl.log, like:
>
> Error parsing: http://lucene.apache.org/skin/getMenu.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/skin/getMenu.js
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

I dont see this as an error to worry about. It is just saying that it
has been directed to fetch a ".js" file but it doesnt know
how to parse it looking for values to index or links to crawl. I dont
see the need to do that with javascript so I would treat this "Error"
as a warning.


> Then, at the end of the log, I get:
>
> LinkDb: adding segment: file:/opt/nutch-1.0/crawl.test/segments/20090713171413
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
>        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>
> I must have missed something, but being new, I can't figure out what is causing that problem?
>
> Thanks,
> Jim

Have you told us what commands you ran? Is the hard disk full? What is
actually in that segment? Does it contain perhaps an aborted run?

Can you simply delete that segment/directory if there isnt much data
in there that you dont mind losing?

Goodluck.

Alex
Reply | Threaded
Open this post in threaded view
|

Re: Just getting started w/tutorial- errors in crawl.log

beats
In reply to this post by ohaya
hi jim,

what i think ur error statement says it couldn't  find plugin for parsing a perticular content type.

go to parse-plugins.xml in conf directory.
there u will find different plugin id define for different Content type.

add perticular plugin-id in nutch-site.xml file under plugin.includes property.

in ur case it is try adding parse-js

gud luck

Beats

ohaya wrote
Hi,

I've just gotten nutch installed, and am stepping through the tutorial at:

http://lucene.apache.org/nutch/tutorial8.html

It seems to be working, but I get a number of messages in crawl.log, like:

Error parsing: http://lucene.apache.org/skin/getMenu.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/skin/getMenu.js
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

Then, at the end of the log, I get:

LinkDb: adding segment: file:/opt/nutch-1.0/crawl.test/segments/20090713171413
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)

I must have missed something, but being new, I can't figure out what is causing that problem?

Thanks,
Jim
Reply | Threaded
Open this post in threaded view
|

Re: Just getting started w/tutorial- errors in crawl.log

xiao yang
In reply to this post by ohaya
Hi, Jim

I got the second error too. It's because the previous crawl failed
abnormally.
There should be the following sub-directories in /segments/20090713171413:
content  crawl_fetch  crawl_generate  crawl_parse  parse_data  parse_text

My solution is deleting the corrupted directory and re-crawl.
If error still occurs, see logs/hadoop.log for details.

Xiao

On Tue, Jul 14, 2009 at 8:58 AM, <[hidden email]> wrote:

> Hi,
>
> I've just gotten nutch installed, and am stepping through the tutorial at:
>
> http://lucene.apache.org/nutch/tutorial8.html
>
> It seems to be working, but I get a number of messages in crawl.log, like:
>
> Error parsing: http://lucene.apache.org/skin/getMenu.js:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/javascript url=
> http://lucene.apache.org/skin/getMenu.js
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
> Then, at the end of the log, I get:
>
> LinkDb: adding segment:
> file:/opt/nutch-1.0/crawl.test/segments/20090713171413
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist:
> file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
>        at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>        at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>        at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>
> I must have missed something, but being new, I can't figure out what is
> causing that problem?
>
> Thanks,
> Jim
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Just getting started w/tutorial- errors in crawl.log

ohaya
In reply to this post by Alex McLintock
Alex (et al),

There was/is plenty of space on the drive (>3GB).

I was trying the command line from the tutorial:

bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log

I'm re-running again, to see what happens.  If I get that error again, I'll delete the dirs, as yourself and xiao yang suggested.

Jim

---- Alex McLintock <[hidden email]> wrote:

> > but I get a number of messages in crawl.log, like:
> >
> > Error parsing: http://lucene.apache.org/skin/getMenu.js: org.apache.nutch.parse.ParseException: parser not found for contentType=application/javascript url=http://lucene.apache.org/skin/getMenu.js
> >        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
> >        at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
> >        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
>
> I dont see this as an error to worry about. It is just saying that it
> has been directed to fetch a ".js" file but it doesnt know
> how to parse it looking for values to index or links to crawl. I dont
> see the need to do that with javascript so I would treat this "Error"
> as a warning.
>
>
> > Then, at the end of the log, I get:
> >
> > LinkDb: adding segment: file:/opt/nutch-1.0/crawl.test/segments/20090713171413
> > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch-1.0/crawl.test/segments/20090713171413/parse_data
> >        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> >        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> >
> > I must have missed something, but being new, I can't figure out what is causing that problem?
> >
> > Thanks,
> > Jim
>
> Have you told us what commands you ran? Is the hard disk full? What is
> actually in that segment? Does it contain perhaps an aborted run?
>
> Can you simply delete that segment/directory if there isnt much data
> in there that you dont mind losing?
>
> Goodluck.
>
> Alex