Crawl and parse exceptions

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawl and parse exceptions

Matt Zytaruk
I've been having a lot of trouble lately with the newest nutch src. Both
my crawls and parses are failing (for our fetches we crawl and parse at
the same time with just the default nutch config, just to get the
outlinks and update the crawldb, but then later on, after the fetch we
do another parse with custom parse filters). Here are the exceptions below.

This exception happens sometimes when crawling (on the linkdb part of
the crawl):

Exception in thread "main" java.io.IOException: Not a file:
/user/nutch/segments/20060107130328/parse_data/part-00000/data
        at org.apache.nutch.ipc.Client.call(Client.java:294)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy1.submitJob(Unknown Source)
        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)

We also got this for awhile (seems like the mapred/system dir is never
being created for some reason):
java.io.IOException: Cannot open filename
/nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
       at org.apache.nutch.ipc.Client.call(Client.java:294)
       at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
       at $Proxy1.open(Unknown Source)
       at
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256)

       at
org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.<init>(NDFSClient.java:242)

       at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
       at
org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66)
       at
org.apache.nutch.fs.NFSDataInputStream$Checker.<init>(NFSDataInputStream.java:45)

       at
org.apache.nutch.fs.NFSDataInputStream.<init>(NFSDataInputStream.java:221)
       at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
       at
org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149)
       at
org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221)
       at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346)

       at
org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:332)

       at
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)
       at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
       at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)

Then, on parsing, we got this, within 10 second of the parse starting:

060109 093759 task_m_ltgpnj  Error running child
060109 093759 task_m_ltgpnj java.lang.RuntimeException: java.io.EOFException
060109 093759 task_m_ltgpnj     at
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.protocol.Content.getContent(Content.java:124)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
060109 093759 task_m_ltgpnj     at
java.io.DataInputStream.readFully(DataInputStream.java:268)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.io.UTF8.readChars(UTF8.java:212)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.io.UTF8.readString(UTF8.java:204)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81)
060109 093759 task_m_ltgpnj     at
org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54)
060109 093759 task_m_ltgpnj     ... 6 more
060109 093802 task_m_txrnu3 done; removing files.
060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
060109 093805 task_m_ltgpnj done; removing files.
060109 093805 Lost connection to JobTracker
[crawler-d-03.internal.wavefire.ca/127.0.0.2:8050].
ex=java.lang.NullPointerException  Retrying...

On a different segment we got this instead:
Exception in thread "main" java.io.IOException: No input directories
specified in: NutchConf: nutch-default.xml , mapred-default.xml ,
/nutch-data/nutch/tmp/nutch/mapred/local/jobTracker/job_tn7u97.xml ,
nutch-site.xml
        at org.apache.nutch.ipc.Client.call(Client.java:294)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy0.submitJob(Unknown Source)
        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:95)
        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:113)

(I think you usually get this error when you dont put the right
filenames in arguments, but that is definately not the case here)


These are all tasks on segments which worked fine before we changed src
code (we had been working with the src from about the beginning of
december previously). It's also not a permissions issue as it all worked
fine previously. The only things that have changed are the updated code
and the number of map/reduce tasks in the config (side note: what is the
best number of tasks for each to use? we have a set of 2 machines that
works together to crawl, and a set of 3 machines that work together to
parse/index).

Any help would be muchly appreciated as otherwise I am doomed. Thanks,
ahead of time.

-Matt Zytaruk


Reply | Threaded
Open this post in threaded view
|

Re: Crawl and parse exceptions

Matt Zytaruk
Just a followup, i figured out the 3rd exception below ( Exception in
thread "main" java.io.IOException: No input directories specified in:
NutchConf..) so no worries there. but the others are still issues.

Matt Zytaruk wrote:

> I've been having a lot of trouble lately with the newest nutch src.
> Both my crawls and parses are failing (for our fetches we crawl and
> parse at the same time with just the default nutch config, just to get
> the outlinks and update the crawldb, but then later on, after the
> fetch we do another parse with custom parse filters). Here are the
> exceptions below.
>
> This exception happens sometimes when crawling (on the linkdb part of
> the crawl):
>
> Exception in thread "main" java.io.IOException: Not a file:
> /user/nutch/segments/20060107130328/parse_data/part-00000/data
>        at org.apache.nutch.ipc.Client.call(Client.java:294)
>        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>        at $Proxy1.submitJob(Unknown Source)
>        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:131)
>
> We also got this for awhile (seems like the mapred/system dir is never
> being created for some reason):
> java.io.IOException: Cannot open filename
> /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
>       at org.apache.nutch.ipc.Client.call(Client.java:294)
>       at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>       at $Proxy1.open(Unknown Source)
>       at
> org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.openInfo(NDFSClient.java:256)
>
>       at
> org.apache.nutch.ndfs.NDFSClient$NDFSInputStream.<init>(NDFSClient.java:242)
>
>       at org.apache.nutch.ndfs.NDFSClient.open(NDFSClient.java:79)
>       at
> org.apache.nutch.fs.NDFSFileSystem.openRaw(NDFSFileSystem.java:66)
>       at
> org.apache.nutch.fs.NFSDataInputStream$Checker.<init>(NFSDataInputStream.java:45)
>
>       at
> org.apache.nutch.fs.NFSDataInputStream.<init>(NFSDataInputStream.java:221)
>
>       at
> org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:160)
>       at
> org.apache.nutch.fs.NutchFileSystem.open(NutchFileSystem.java:149)
>       at
> org.apache.nutch.fs.NDFSFileSystem.copyToLocalFile(NDFSFileSystem.java:221)
>
>       at
> org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:346)
>
>       at
> org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:332)
>
>       at
> org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:232)
>       at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:286)
>       at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:651)
>
> Then, on parsing, we got this, within 10 second of the parse starting:
>
> 060109 093759 task_m_ltgpnj  Error running child
> 060109 093759 task_m_ltgpnj java.lang.RuntimeException:
> java.io.EOFException
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:57)
>
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.protocol.Content.getContent(Content.java:124)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.crawl.MD5Signature.calculate(MD5Signature.java:33)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:62)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.mapred.MapRunner.run(MapRunner.java:52)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
> 060109 093759 task_m_ltgpnj Caused by: java.io.EOFException
> 060109 093759 task_m_ltgpnj     at
> java.io.DataInputStream.readFully(DataInputStream.java:268)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
>
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.io.UTF8.readChars(UTF8.java:212)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.io.UTF8.readString(UTF8.java:204)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.protocol.ContentProperties.readFields(ContentProperties.java:169)
>
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:81)
> 060109 093759 task_m_ltgpnj     at
> org.apache.nutch.io.CompressedWritable.ensureInflated(CompressedWritable.java:54)
>
> 060109 093759 task_m_ltgpnj     ... 6 more
> 060109 093802 task_m_txrnu3 done; removing files.
> 060109 093802 Server connection on port 50050 from 127.0.0.2: exiting
> 060109 093805 task_m_ltgpnj done; removing files.
> 060109 093805 Lost connection to JobTracker
> [crawler-d-03.internal.wavefire.ca/127.0.0.2:8050].
> ex=java.lang.NullPointerException  Retrying...
>
> On a different segment we got this instead:
> Exception in thread "main" java.io.IOException: No input directories
> specified in: NutchConf: nutch-default.xml , mapred-default.xml ,
> /nutch-data/nutch/tmp/nutch/mapred/local/jobTracker/job_tn7u97.xml ,
> nutch-site.xml
>        at org.apache.nutch.ipc.Client.call(Client.java:294)
>        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>        at $Proxy0.submitJob(Unknown Source)
>        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:95)
>        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:113)
>
> (I think you usually get this error when you dont put the right
> filenames in arguments, but that is definately not the case here)
>
>
> These are all tasks on segments which worked fine before we changed
> src code (we had been working with the src from about the beginning of
> december previously). It's also not a permissions issue as it all
> worked fine previously. The only things that have changed are the
> updated code and the number of map/reduce tasks in the config (side
> note: what is the best number of tasks for each to use? we have a set
> of 2 machines that works together to crawl, and a set of 3 machines
> that work together to parse/index).
>
> Any help would be muchly appreciated as otherwise I am doomed. Thanks,
> ahead of time.
>
> -Matt Zytaruk
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Crawl and parse exceptions

Doug Cutting-2
In reply to this post by Matt Zytaruk
Matt Zytaruk wrote:
> Exception in thread "main" java.io.IOException: Not a file:
> /user/nutch/segments/20060107130328/parse_data/part-00000/data
>        at org.apache.nutch.ipc.Client.call(Client.java:294)

This is an error returned from an RPC call.  There should be more
details about this in a slave log, e.g., a better stack trace, some
context, etc.  What do you see there?

> We also got this for awhile (seems like the mapred/system dir is never
> being created for some reason):
> java.io.IOException: Cannot open filename
> /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
>       at org.apache.nutch.ipc.Client.call(Client.java:294)

Again, it would be interesting to see what happened on the other end of
this RPC call.  Please look in the remote log.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Crawl and parse exceptions

Matt Zytaruk
Unfortunately, the logs have since been overwritten by nutch so I can't
check them, but I am pretty sure those are actually the messages from
the task tracker log on the remote machine. If I am remembering
correctly, all that was shown on the master was a short exception saying
the child failed or something like that. I wish I could be more help but
as I said, when the jobtracker/tasktrackers were stopped and started,
they overwrote the log.

-Matt Zytaruk

Doug Cutting wrote:

> Matt Zytaruk wrote:
>
>> Exception in thread "main" java.io.IOException: Not a file:
>> /user/nutch/segments/20060107130328/parse_data/part-00000/data
>>        at org.apache.nutch.ipc.Client.call(Client.java:294)
>
>
> This is an error returned from an RPC call.  There should be more
> details about this in a slave log, e.g., a better stack trace, some
> context, etc.  What do you see there?
>
>> We also got this for awhile (seems like the mapred/system dir is
>> never being created for some reason):
>> java.io.IOException: Cannot open filename
>> /nutch-data/nutch/tmp/nutch/mapred/system/submit_euiwjv/job.xml
>>       at org.apache.nutch.ipc.Client.call(Client.java:294)
>
>
> Again, it would be interesting to see what happened on the other end
> of this RPC call.  Please look in the remote log.
>
> Doug
>
>