java.io.IOException: Task process exit with nonzero status

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

java.io.IOException: Task process exit with nonzero status

Michael-49
Hello.

I'm testing mapred branch (revision 290934) on two servers. Everything
works fine except in the fetch process: one of the servers constantly
stops fetching earlier then second server. Logs are something like
this:

050921 232008 task_m_jh0915 0.06203456% 51356 pages, 9688 errors, 8.7 pages/s, 1493 kb/s,
050921 232009 task_m_jh0915 0.06203456% 51356 pages, 9688 errors, 8.7 pages/s, 1493 kb/s,
[ a lot of repeating lines ]
050921 232012 task_m_jh0915 0.06203456% 51357 pages, 9688 errors, 8.7 pages/s, 1492 kb/s,
050921 232012 Task task_m_jh0915 is done.
050921 232012 Task task_m_jh0915 is done.
050921 232013 Server connection on port 41755 from 127.0.0.1: exiting

When i'm looking at web-inteface at 7845 port i see:
<tr>
<td>task_m_jh0915</td><td>1.0</td><td>51357 pages, 9688 errors, 8.7 pages/s, 1492 kb/s, </td>
<td>
java.io.IOException: Task process exit with nonzero status.
        at org.apache.nutch.mapred.TaskRunner.runChild(TaskRunner.java:132)
        at org.apache.nutch.mapred.TaskRunner.run(TaskRunner.java:92)
</td></tr>
<tr>
<td>task_m_dtya1r</td><td>0.4285658</td><td>391226 pages, 12591 errors, 13.0 pages/s, 2271 kb/s, </td>
<td></td></tr>

Can anyone point what i'm doing wrong?

Michael

Reply | Threaded
Open this post in threaded view
|

Links in a segement

Richard Rodrigues
Hello,

I am developping a search engine for internet forum using Nutch.
I would like to create a page with the most linked pages in the last crawl.

I would like to kown if there is a ways to get all outgoing links in a
segment or
all the outgoing links in the db (with a date condition).

Thanks in adavance for any suggestion,

Best Regards,

Richard Rodrigues
www.Kelforum.com

Reply | Threaded
Open this post in threaded view
|

Re: Links in a segement

Michael Ji
the simpliest way is to use bin/nutch amdin.. to dump
webdb, from dumped text file of null.link, you can
pick the outlinks for a particular URL (or MD5),

Michael Ji,

--- Richard Rodrigues <[hidden email]>
wrote:

> Hello,
>
> I am developping a search engine for internet forum
> using Nutch.
> I would like to create a page with the most linked
> pages in the last crawl.
>
> I would like to kown if there is a ways to get all
> outgoing links in a
> segment or
> all the outgoing links in the db (with a date
> condition).
>
> Thanks in adavance for any suggestion,
>
> Best Regards,
>
> Richard Rodrigues
> www.Kelforum.com
>
>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
Reply | Threaded
Open this post in threaded view
|

Re: Links in a segement

Richard Rodrigues-2

Thank you for you help.
 bin/nutch admin could be useful but I need something based on crawling
date.

I checked again the documentation and I think I will use this command :
bin/nutch  segread segments/20050922091545 -dump | grep outlink

This way, I  will be able to generate reports based on the dates of the
crawls.

Richard Rodrigues
www.Kelforum.com


----- Original Message ----- :

From: "Michael Ji" <[hidden email]>
To: <[hidden email]>
Sent: Thursday, September 22, 2005 11:28 PM
Subject: Re: Links in a segement


> the simpliest way is to use bin/nutch amdin.. to dump
> webdb, from dumped text file of null.link, you can
> pick the outlinks for a particular URL (or MD5),
>
> Michael Ji,
>
> --- Richard Rodrigues <[hidden email]>
> wrote:
>
>> Hello,
>>
>> I am developping a search engine for internet forum
>> using Nutch.
>> I would like to create a page with the most linked
>> pages in the last crawl.
>>
>> I would like to kown if there is a ways to get all
>> outgoing links in a
>> segment or
>> all the outgoing links in the db (with a date
>> condition).
>>
>> Thanks in adavance for any suggestion,
>>
>> Best Regards,
>>
>> Richard Rodrigues
>> www.Kelforum.com
>>
>>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com 

Reply | Threaded
Open this post in threaded view
|

How can I recover an aborted fetch process

Gal Nitzan
In reply to this post by Michael Ji
Hi,

In the FAQ there is the following answer and I really do not understand
it so I'm sure it is a good candidate for revision :-) .

the answer as follows:

 >>>>You'll need to touch the file fetcher.done in the segment
directory.<<<<

when a fetch is aborted there is no such file as fetcher.done at least
not on my system

 >>>> All the pages that were not crawled will be re-generated for fetch
pretty soon. <<<<

How? (probably by calling generate?) what will re-generate it.

 >>>> If you fetched lots of pages, and don't want to have to re-fetch
them again, this is the best way.<<<<

Please feel free to elaborate....

Regards,

Gal
Reply | Threaded
Open this post in threaded view
|

Re: How can I recover an aborted fetch process

em-13
You cannot resume failed fetch.
You can either 1. restart it, 2. use whatever's fetched so far.

To perform 2 you'll need to create 'fetcher.done' in the segment
directory. To do this, simply:
#cd <your segment directory>
#touch fetcher.done
the 'touch' command will create the file (size 0 bytes).

Once that's done, run updatedb.




Gal Nitzan wrote:

> Hi,
>
> In the FAQ there is the following answer and I really do not
> understand it so I'm sure it is a good candidate for revision :-) .
>
> the answer as follows:
>
> >>>>You'll need to touch the file fetcher.done in the segment
> directory.<<<<
>
> when a fetch is aborted there is no such file as fetcher.done at least
> not on my system
>
> >>>> All the pages that were not crawled will be re-generated for
> fetch pretty soon. <<<<
>
> How? (probably by calling generate?) what will re-generate it.
>
> >>>> If you fetched lots of pages, and don't want to have to re-fetch
> them again, this is the best way.<<<<
>
> Please feel free to elaborate....
>
> Regards,
>
> Gal


Reply | Threaded
Open this post in threaded view
|

Re: How can I recover an aborted fetch process

Gal Nitzan
EM wrote:

> You cannot resume failed fetch.
> You can either 1. restart it, 2. use whatever's fetched so far.
>
> To perform 2 you'll need to create 'fetcher.done' in the segment
> directory. To do this, simply:
> #cd <your segment directory>
> #touch fetcher.done
> the 'touch' command will create the file (size 0 bytes).
>
> Once that's done, run updatedb.
>
>
>
>
> Gal Nitzan wrote:
>
>> Hi,
>>
>> In the FAQ there is the following answer and I really do not
>> understand it so I'm sure it is a good candidate for revision :-) .
>>
>> the answer as follows:
>>
>> >>>>You'll need to touch the file fetcher.done in the segment
>> directory.<<<<
>>
>> when a fetch is aborted there is no such file as fetcher.done at
>> least not on my system
>>
>> >>>> All the pages that were not crawled will be re-generated for
>> fetch pretty soon. <<<<
>>
>> How? (probably by calling generate?) what will re-generate it.
>>
>> >>>> If you fetched lots of pages, and don't want to have to re-fetch
>> them again, this is the best way.<<<<
>>
>> Please feel free to elaborate....
>>
>> Regards,
>>
>> Gal
>
>
>
> .
>
Thanks EM...
Reply | Threaded
Open this post in threaded view
|

Re: java.io.IOException: Task process exit with nonzero status

Doug Cutting-2
In reply to this post by Michael-49
Michael wrote:
> java.io.IOException: Task process exit with nonzero status.
>         at org.apache.nutch.mapred.TaskRunner.runChild(TaskRunner.java:132)
>         at org.apache.nutch.mapred.TaskRunner.run(TaskRunner.java:92)

What do you see in that tasktracker's log?  Hopefully there is a more
informative error message there.

What version of the mapred branch are you running?  I fixed a bug a week
and a half ago that could cause this.  There was a filehandle leak that
resulted in this error after a tasktracker had run more than around 800
tasks.  If you have not updated your code recently, please try that.

Doug
Reply | Threaded
Open this post in threaded view
|

Re[2]: java.io.IOException: Task process exit with nonzero status

Michael-49
DC> What version of the mapred branch are you running?  I fixed a bug a week
DC> and a half ago that could cause this.  There was a filehandle leak that
DC> resulted in this error after a tasktracker had run more than around 800
DC> tasks.  If you have not updated your code recently, please try that.

It seems that new version fixed this problem, i haven't seen this
error anymore, but new problem arised during indexing process (i'm
using mapred revision 291801):

i'm trying to index via "./nutch index", segments were created by slightly
modificated version of crawl.Crawl class. With 1-2 segments everything
works ok, with about 20 segments task tracker logs on both servers
show repeating error block:

050926 180831 task_r_o4tt4z Got 1 map output locations.
050926 180831 Client connection to 127.0.0.1:60218: starting
050926 180831 Server connection on port 60218 from 127.0.0.1: starting
050926 180831 Client connection to 127.0.0.1:60218 caught: java.lang.IndexOutOfBoundsException
java.lang.IndexOutOfBoundsException
        at java.io.DataInputStream.readFully(DataInputStream.java:263)
        at org.apache.nutch.mapred.MapOutputFile.readFields(MapOutputFile.java:123)
        at org.apache.nutch.io.ObjectWritable.readObject(ObjectWritable.java:232)
        at org.apache.nutch.io.ObjectWritable.readFields(ObjectWritable.java:60)
        at org.apache.nutch.ipc.Client$Connection.run(Client.java:163)
050926 180831 Client connection to 127.0.0.1:60218: closing
050926 180831 Server handler on 60218 caught: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:106)
        at java.io.DataOutputStream.write(DataOutputStream.java:85)
        at org.apache.nutch.mapred.MapOutputFile.write(MapOutputFile.java:98)
        at org.apache.nutch.io.ObjectWritable.writeObject(ObjectWritable.java:117)
        at org.apache.nutch.io.ObjectWritable.write(ObjectWritable.java:64)
        at org.apache.nutch.ipc.Server$Handler.run(Server.java:213)
050926 180831 Server connection on port 60218 from 127.0.0.1: exiting
050926 180931 task_r_o4tt4z copy failed: task_m_ypindn from goku1.deeptown.net/127.0.0.1:60218
java.io.IOException: timed out waiting for response
        at org.apache.nutch.ipc.Client.call(Client.java:296)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy2.getFile(Unknown Source)
        at org.apache.nutch.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:94)
        at org.apache.nutch.mapred.TaskRunner.run(TaskRunner.java:61)






Michael

Reply | Threaded
Open this post in threaded view
|

Re[3]: java.io.IOException: Task process exit with nonzero status

Michael-49
I think i found the problem:
At MapOutputFile.java:123
bytesToRead = Math.min((int) unread, buffer.length);

if unread is greater then 2^31, bytesToRead will be negative.

M> It seems that new version fixed this problem, i haven't seen this
M> error anymore, but new problem arised during indexing process (i'm
M> using mapred revision 291801):

M> i'm trying to index via "./nutch index", segments were created by slightly
M> modificated version of crawl.Crawl class. With 1-2 segments everything
M> works ok, with about 20 segments task tracker logs on both servers
M> show repeating error block:

M> 050926 180831 task_r_o4tt4z Got 1 map output locations.
M> 050926 180831 Client connection to 127.0.0.1:60218: starting
M> 050926 180831 Server connection on port 60218 from 127.0.0.1: starting
M> 050926 180831 Client connection to 127.0.0.1:60218 caught:
M> java.lang.IndexOutOfBoundsException
M> java.lang.IndexOutOfBoundsException
M>         at
M> java.io.DataInputStream.readFully(DataInputStream.java:263)
M>         at
M> org.apache.nutch.mapred.MapOutputFile.readFields(MapOutputFile.java:123)
M>         at
M> org.apache.nutch.io.ObjectWritable.readObject(ObjectWritable.java:232)
M>         at
M> org.apache.nutch.io.ObjectWritable.readFields(ObjectWritable.java:60)
M>         at
M> org.apache.nutch.ipc.Client$Connection.run(Client.java:163)
M> 050926 180831 Client connection to 127.0.0.1:60218: closing
M> 050926 180831 Server handler on 60218 caught:
M> java.net.SocketException: Connection reset
M> java.net.SocketException: Connection reset
M>         at
M> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
M>         at
M> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
M>         at
M> java.io.BufferedOutputStream.write(BufferedOutputStream.java:106)
M>         at java.io.DataOutputStream.write(DataOutputStream.java:85)
M>         at
M> org.apache.nutch.mapred.MapOutputFile.write(MapOutputFile.java:98)
M>         at
M> org.apache.nutch.io.ObjectWritable.writeObject(ObjectWritable.java:117)
M>         at
M> org.apache.nutch.io.ObjectWritable.write(ObjectWritable.java:64)
M>         at org.apache.nutch.ipc.Server$Handler.run(Server.java:213)
M> 050926 180831 Server connection on port 60218 from 127.0.0.1: exiting
M> 050926 180931 task_r_o4tt4z copy failed: task_m_ypindn from
M> goku1.deeptown.net/127.0.0.1:60218
M> java.io.IOException: timed out waiting for response
M>         at org.apache.nutch.ipc.Client.call(Client.java:296)
M>         at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
M>         at $Proxy2.getFile(Unknown Source)
M>         at
M> org.apache.nutch.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:94)
M>         at
M> org.apache.nutch.mapred.TaskRunner.run(TaskRunner.java:61)






Michael

Reply | Threaded
Open this post in threaded view
|

Re: java.io.IOException: Task process exit with nonzero status

Doug Cutting-2
Michael wrote:
> I think i found the problem:
> At MapOutputFile.java:123
> bytesToRead = Math.min((int) unread, buffer.length);
>
> if unread is greater then 2^31, bytesToRead will be negative.

So the fix is to change this to:

bytesToRead = (int)Math.min(unread, buffer.length);

Right?  Does this fix things for you?  If so, I'll commit it.

Thanks,

Doug
Reply | Threaded
Open this post in threaded view
|

Re[2]: java.io.IOException: Task process exit with nonzero status

Michael-49
I'm not sure how java interacts with auto type casting (i'm from C
world), and i found, that java don't support unsigned int, so i did
like this:

     int bytesToRead=buffer.length;
     if(((int)unread)>0)
     {
            bytesToRead = Math.min((int) unread, buffer.length);
     }
     
Yes, this helped me, though i don't understand why others haven't
experienced such problem.

>> I think i found the problem:
>> At MapOutputFile.java:123
>> bytesToRead = Math.min((int) unread, buffer.length);
>>
>> if unread is greater then 2^31, bytesToRead will be negative.

DC> So the fix is to change this to:

DC> bytesToRead = (int)Math.min(unread, buffer.length);

DC> Right?  Does this fix things for you?  If so, I'll commit it.

DC> Thanks,

DC> Doug



Michael

Reply | Threaded
Open this post in threaded view
|

java.io.IOException: Cannot create file (in reduce task)

Gal Nitzan
In reply to this post by Doug Cutting-2
Hello,

I'm testing mapred on one machine only.

Everything worked fine from the start until I got the exception in the
reduce task:

Diagnostic Text

java.io.IOException: Cannot create file
/user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
at org.apache.nutch.ipc.Client.call(Client.java:294) at
org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at
$Proxy1.create(Unknown Source) at
org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.nextBlockOutputStream(NDFSClient.java:574)
at
org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.(NDFSClient.java:549)
at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83) at
org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:76) at
org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:71) at
org.apache.nutch.io.SequenceFile$Writer.(SequenceFile.java:94) at
org.apache.nutch.io.MapFile$Writer.(MapFile.java:108) at
org.apache.nutch.io.MapFile$Writer.(MapFile.java:76) at
org.apache.nutch.crawl.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:48)
at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:245) at
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:580)

In the jontracker log:

050928 155253 Server connection on port 8011 from 127.0.0.1: exiting
050928 160814 Server connection on port 8011 from 127.0.0.1: starting
050928 160814 parsing file:/mapred/conf/nutch-default.xml
050928 160814 parsing file:/mapred/conf/mapred-default.xml
050928 160814 parsing /nutch/mapred/local/job_s4isvd.xml
050928 160814 parsing file:/mapred/conf/nutch-site.xml
050928 160814 parsing file:/mapred/conf/nutch-default.xml
050928 160815 parsing file:/mapred/conf/mapred-default.xml
050928 160815 parsing /nutch/mapred/local/job_s4isvd.xml
050928 160815 parsing file:/mapred/conf/nutch-site.xml
050928 160815 Adding task 'task_m_ax7n90' to set for tracker 'tracker_41883'
050928 160821 Task 'task_m_ax7n90' has finished successfully.
050928 160821 Adding task 'task_m_vl2bge' to set for tracker 'tracker_41883'
050928 160827 Task 'task_m_vl2bge' has finished successfully.
050928 160827 Adding task 'task_m_i54kht' to set for tracker 'tracker_41883'
050928 160830 Task 'task_m_i54kht' has finished successfully.
050928 160830 Adding task 'task_m_1eymym' to set for tracker 'tracker_41883'
050928 160833 Task 'task_m_1eymym' has finished successfully.
050928 160833 Adding task 'task_r_w9azpi' to set for tracker 'tracker_41883'
050928 160839 Task 'task_r_w9azpi' has finished successfully.
050928 160839 Server connection on port 8011 from 127.0.0.1: exiting
050928 171406 Task 'task_m_klo24y' has finished successfully.
050928 171406 Adding task 'task_r_x48xa3' to set for tracker 'tracker_41883'
050928 171434 Task 'task_r_x48xa3' has been lost.
050928 171434 Adding task 'task_r_x48xa3' to set for tracker 'tracker_41883'
050928 171501 Task 'task_r_x48xa3' has been lost.
050928 171501 Adding task 'task_r_x48xa3' to set for tracker 'tracker_41883'
050928 171520 Task 'task_r_x48xa3' has been lost.
050928 171520 Adding task 'task_r_x48xa3' to set for tracker 'tracker_41883'
050928 171551 Task 'task_r_x48xa3' has been lost.
050928 171551 Task task_r_x48xa3 has failed 4 times.  Aborting owning
job job_mtzp7h
050928 171552 Server connection on port 8011 from 127.0.0.1: exiting

In namenode log

050928 171547 Server handler on 8009 call error: java.io.IOException:
Cannot create file
/user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
java.io.IOException: Cannot create file
/user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
        at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98)
        at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:324)
        at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
        at org.apache.nutch.ipc.Server$Handler.run(Server.java:198)

In fetch log

050928 171526  reduce 47%
050928 171538  reduce 50%
050928 171551  reduce 100%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:309)
        at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
        at org.apache.nutch.crawl.Fetcher.main(Fetcher.java:364)

Any idea, anyone?

Thanks, Gal

Reply | Threaded
Open this post in threaded view
|

Re: java.io.IOException: Cannot create file (in reduce task)

Doug Cutting-2
Thanks for the detailed report.  This is a bug.  The problem is that the
default is not to permit files to be overwritten.  But when a reduce
task re-executes (because something failed) it needs to overwrite data.
  My guess is that the cause of the initial failure might have been the
same: that this was not your first attempt to fetch this segment, that
you were overwriting the last attempt.  Is that right, or did something
else first cause the reduce task to fail?

I think the fix is to change the filesystem code (local and NDFS) so
that overwriting is permitted by default.  With MapReduce, tasks may be
re-executed, so overwriting is normal.  Application code should add
error checking code at the start to check that output files do not
already exist if we wish to prevent unintentional overwriting.

If there are no objections, I will make this change in the mapred branch.

Doug

Gal Nitzan wrote:

> Hello,
>
> I'm testing mapred on one machine only.
>
> Everything worked fine from the start until I got the exception in the
> reduce task:
>
> Diagnostic Text
>
> java.io.IOException: Cannot create file
> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
> at org.apache.nutch.ipc.Client.call(Client.java:294) at
> org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at
> $Proxy1.create(Unknown Source) at
> org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.nextBlockOutputStream(NDFSClient.java:574)
> at
> org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.(NDFSClient.java:549)
> at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83) at
> org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:76) at
> org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:71) at
> org.apache.nutch.io.SequenceFile$Writer.(SequenceFile.java:94) at
> org.apache.nutch.io.MapFile$Writer.(MapFile.java:108) at
> org.apache.nutch.io.MapFile$Writer.(MapFile.java:76) at
> org.apache.nutch.crawl.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:48)
> at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:245) at
> org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:580)
>
> In the jontracker log:
>
> 050928 155253 Server connection on port 8011 from 127.0.0.1: exiting
> 050928 160814 Server connection on port 8011 from 127.0.0.1: starting
> 050928 160814 parsing file:/mapred/conf/nutch-default.xml
> 050928 160814 parsing file:/mapred/conf/mapred-default.xml
> 050928 160814 parsing /nutch/mapred/local/job_s4isvd.xml
> 050928 160814 parsing file:/mapred/conf/nutch-site.xml
> 050928 160814 parsing file:/mapred/conf/nutch-default.xml
> 050928 160815 parsing file:/mapred/conf/mapred-default.xml
> 050928 160815 parsing /nutch/mapred/local/job_s4isvd.xml
> 050928 160815 parsing file:/mapred/conf/nutch-site.xml
> 050928 160815 Adding task 'task_m_ax7n90' to set for tracker
> 'tracker_41883'
> 050928 160821 Task 'task_m_ax7n90' has finished successfully.
> 050928 160821 Adding task 'task_m_vl2bge' to set for tracker
> 'tracker_41883'
> 050928 160827 Task 'task_m_vl2bge' has finished successfully.
> 050928 160827 Adding task 'task_m_i54kht' to set for tracker
> 'tracker_41883'
> 050928 160830 Task 'task_m_i54kht' has finished successfully.
> 050928 160830 Adding task 'task_m_1eymym' to set for tracker
> 'tracker_41883'
> 050928 160833 Task 'task_m_1eymym' has finished successfully.
> 050928 160833 Adding task 'task_r_w9azpi' to set for tracker
> 'tracker_41883'
> 050928 160839 Task 'task_r_w9azpi' has finished successfully.
> 050928 160839 Server connection on port 8011 from 127.0.0.1: exiting
> 050928 171406 Task 'task_m_klo24y' has finished successfully.
> 050928 171406 Adding task 'task_r_x48xa3' to set for tracker
> 'tracker_41883'
> 050928 171434 Task 'task_r_x48xa3' has been lost.
> 050928 171434 Adding task 'task_r_x48xa3' to set for tracker
> 'tracker_41883'
> 050928 171501 Task 'task_r_x48xa3' has been lost.
> 050928 171501 Adding task 'task_r_x48xa3' to set for tracker
> 'tracker_41883'
> 050928 171520 Task 'task_r_x48xa3' has been lost.
> 050928 171520 Adding task 'task_r_x48xa3' to set for tracker
> 'tracker_41883'
> 050928 171551 Task 'task_r_x48xa3' has been lost.
> 050928 171551 Task task_r_x48xa3 has failed 4 times.  Aborting owning
> job job_mtzp7h
> 050928 171552 Server connection on port 8011 from 127.0.0.1: exiting
>
> In namenode log
>
> 050928 171547 Server handler on 8009 call error: java.io.IOException:
> Cannot create file
> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
>
> java.io.IOException: Cannot create file
> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
>
>        at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98)
>        at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:324)
>        at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
>        at org.apache.nutch.ipc.Server$Handler.run(Server.java:198)
>
> In fetch log
>
> 050928 171526  reduce 47%
> 050928 171538  reduce 50%
> 050928 171551  reduce 100%
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:309)
>        at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
>        at org.apache.nutch.crawl.Fetcher.main(Fetcher.java:364)
>
> Any idea, anyone?
>
> Thanks, Gal
>
Reply | Threaded
Open this post in threaded view
|

Re: java.io.IOException: Cannot create file (in reduce task)

Gal Nitzan
Doug Cutting wrote:

> Thanks for the detailed report.  This is a bug.  The problem is that
> the default is not to permit files to be overwritten.  But when a
> reduce task re-executes (because something failed) it needs to
> overwrite data.  My guess is that the cause of the initial failure
> might have been the same: that this was not your first attempt to
> fetch this segment, that you were overwriting the last attempt.  Is
> that right, or did something else first cause the reduce task to fail?
>
> I think the fix is to change the filesystem code (local and NDFS) so
> that overwriting is permitted by default.  With MapReduce, tasks may
> be re-executed, so overwriting is normal.  Application code should add
> error checking code at the start to check that output files do not
> already exist if we wish to prevent unintentional overwriting.
>
> If there are no objections, I will make this change in the mapred branch.
>
> Doug
>
> Gal Nitzan wrote:
>> Hello,
>>
>> I'm testing mapred on one machine only.
>>
>> Everything worked fine from the start until I got the exception in
>> the reduce task:
>>
>> Diagnostic Text
>>
>> java.io.IOException: Cannot create file
>> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
>> at org.apache.nutch.ipc.Client.call(Client.java:294) at
>> org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at
>> $Proxy1.create(Unknown Source) at
>> org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.nextBlockOutputStream(NDFSClient.java:574)
>> at
>> org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.(NDFSClient.java:549)
>> at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:83) at
>> org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:76) at
>> org.apache.nutch.fs.NDFSFileSystem.create(NDFSFileSystem.java:71) at
>> org.apache.nutch.io.SequenceFile$Writer.(SequenceFile.java:94) at
>> org.apache.nutch.io.MapFile$Writer.(MapFile.java:108) at
>> org.apache.nutch.io.MapFile$Writer.(MapFile.java:76) at
>> org.apache.nutch.crawl.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:48)
>> at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:245) at
>> org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:580)
>>
>> In the jontracker log:
>>
>> 050928 155253 Server connection on port 8011 from 127.0.0.1: exiting
>> 050928 160814 Server connection on port 8011 from 127.0.0.1: starting
>> 050928 160814 parsing file:/mapred/conf/nutch-default.xml
>> 050928 160814 parsing file:/mapred/conf/mapred-default.xml
>> 050928 160814 parsing /nutch/mapred/local/job_s4isvd.xml
>> 050928 160814 parsing file:/mapred/conf/nutch-site.xml
>> 050928 160814 parsing file:/mapred/conf/nutch-default.xml
>> 050928 160815 parsing file:/mapred/conf/mapred-default.xml
>> 050928 160815 parsing /nutch/mapred/local/job_s4isvd.xml
>> 050928 160815 parsing file:/mapred/conf/nutch-site.xml
>> 050928 160815 Adding task 'task_m_ax7n90' to set for tracker
>> 'tracker_41883'
>> 050928 160821 Task 'task_m_ax7n90' has finished successfully.
>> 050928 160821 Adding task 'task_m_vl2bge' to set for tracker
>> 'tracker_41883'
>> 050928 160827 Task 'task_m_vl2bge' has finished successfully.
>> 050928 160827 Adding task 'task_m_i54kht' to set for tracker
>> 'tracker_41883'
>> 050928 160830 Task 'task_m_i54kht' has finished successfully.
>> 050928 160830 Adding task 'task_m_1eymym' to set for tracker
>> 'tracker_41883'
>> 050928 160833 Task 'task_m_1eymym' has finished successfully.
>> 050928 160833 Adding task 'task_r_w9azpi' to set for tracker
>> 'tracker_41883'
>> 050928 160839 Task 'task_r_w9azpi' has finished successfully.
>> 050928 160839 Server connection on port 8011 from 127.0.0.1: exiting
>> 050928 171406 Task 'task_m_klo24y' has finished successfully.
>> 050928 171406 Adding task 'task_r_x48xa3' to set for tracker
>> 'tracker_41883'
>> 050928 171434 Task 'task_r_x48xa3' has been lost.
>> 050928 171434 Adding task 'task_r_x48xa3' to set for tracker
>> 'tracker_41883'
>> 050928 171501 Task 'task_r_x48xa3' has been lost.
>> 050928 171501 Adding task 'task_r_x48xa3' to set for tracker
>> 'tracker_41883'
>> 050928 171520 Task 'task_r_x48xa3' has been lost.
>> 050928 171520 Adding task 'task_r_x48xa3' to set for tracker
>> 'tracker_41883'
>> 050928 171551 Task 'task_r_x48xa3' has been lost.
>> 050928 171551 Task task_r_x48xa3 has failed 4 times.  Aborting owning
>> job job_mtzp7h
>> 050928 171552 Server connection on port 8011 from 127.0.0.1: exiting
>>
>> In namenode log
>>
>> 050928 171547 Server handler on 8009 call error: java.io.IOException:
>> Cannot create file
>> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
>>
>> java.io.IOException: Cannot create file
>> /user/root/crawl-20050927142856/segments/20050928075732/crawl_fetch/part-00000/data
>>
>>        at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98)
>>        at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>
>>        at java.lang.reflect.Method.invoke(Method.java:324)
>>        at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
>>        at org.apache.nutch.ipc.Server$Handler.run(Server.java:198)
>>
>> In fetch log
>>
>> 050928 171526  reduce 47%
>> 050928 171538  reduce 50%
>> 050928 171551  reduce 100%
>> Exception in thread "main" java.io.IOException: Job failed!
>>        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:309)
>>        at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
>>        at org.apache.nutch.crawl.Fetcher.main(Fetcher.java:364)
>>
>> Any idea, anyone?
>>
>> Thanks, Gal
>>
>
> .
>
I believe you are right. When I checked the file I noticed it exists.

However, I run the fetcher only once on that segment.

Gal
Reply | Threaded
Open this post in threaded view
|

Re: java.io.IOException: Cannot create file (in reduce task)

Doug Cutting-2
Gal Nitzan wrote:
> I believe you are right. When I checked the file I noticed it exists.
>
> However, I run the fetcher only once on that segment.

Please try the attached patch and tell me if it fixes this for you.

Doug

Index: src/java/org/apache/nutch/ndfs/NDFSClient.java
===================================================================
--- src/java/org/apache/nutch/ndfs/NDFSClient.java (revision 292500)
+++ src/java/org/apache/nutch/ndfs/NDFSClient.java (working copy)
@@ -71,14 +71,6 @@
         return new NDFSInputStream(src.toString());
     }
 
-    /**
-     * Create an output stream that writes to all the right places.
-     * Basically creates instance of inner subclass of OutputStream
-     * that handles datanode/namenode negotiation.
-     */
-    public NFSOutputStream create(UTF8 src) throws IOException {
-        return create(src, false);
-    }
     public NFSOutputStream create(UTF8 src, boolean overwrite) throws IOException {
         return new NDFSOutputStream(src, overwrite);
     }
Index: src/java/org/apache/nutch/fs/LocalFileSystem.java
===================================================================
--- src/java/org/apache/nutch/fs/LocalFileSystem.java (revision 292500)
+++ src/java/org/apache/nutch/fs/LocalFileSystem.java (working copy)
@@ -95,13 +95,6 @@
         return new LocalNFSFileInputStream(f);
     }
 
-    /**
-     * Create the file at f.
-     */
-    public NFSOutputStream create(File f) throws IOException {
-        return create(f, false);
-    }
-
     /*********************************************************
      * For create()'s NFSOutputStream.
      *********************************************************/
@@ -128,8 +121,6 @@
       public void write(int b) throws IOException { fos.write(b); }
     }
 
-    /**
-     */
     public NFSOutputStream create(File f, boolean overwrite) throws IOException {
         if (f.exists() && ! overwrite) {
             throw new IOException("File already exists:"+f);
Index: src/java/org/apache/nutch/fs/NutchFileSystem.java
===================================================================
--- src/java/org/apache/nutch/fs/NutchFileSystem.java (revision 292500)
+++ src/java/org/apache/nutch/fs/NutchFileSystem.java (working copy)
@@ -122,10 +122,18 @@
     public abstract NFSInputStream open(File f) throws IOException;
 
     /**
-     * Opens an OutputStream at the indicated File, whether local
-     * or via NDFS.
+     * Opens an OutputStream at the indicated File.
+     * Files are overwritten by default.
      */
-    public abstract NFSOutputStream create(File f) throws IOException;
+    public NFSOutputStream create(File f) throws IOException {
+        return create(f, true);
+    }
+
+    /** Opens an OutputStream at the indicated File.
+     * @param f the file name to open
+     * @param overwrite if a file with this name already exists, then if true,
+     *   the file will be overwritten, and if false an error will be thrown.
+     */
     public abstract NFSOutputStream create(File f, boolean overwrite) throws IOException;
 
     /**
Reply | Threaded
Open this post in threaded view
|

Re: java.io.IOException: Task process exit with nonzero status

Doug Cutting-2
In reply to this post by Michael-49
Michael wrote:
> I'm not sure how java interacts with auto type casting (i'm from C
> world), and i found, that java don't support unsigned int, so i did
> like this:
>
>      int bytesToRead=buffer.length;
>      if(((int)unread)>0)
>      {
>             bytesToRead = Math.min((int) unread, buffer.length);
>      }

I think the patch I proposed is simpler and correct.  I committed it.

> Yes, this helped me, though i don't understand why others haven't
> experienced such problem.

I think the reason that I have not seen it is is that I usually run
hundreds of map tasks, and the output of a single map task has never
been greater than 2GB.

Thanks for catching this!

Doug
Reply | Threaded
Open this post in threaded view
|

New plugin

Gal Nitzan
In reply to this post by Gal Nitzan
Hi,

I have written (not much) a new plugin, based on the URLFilter
interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to
crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS)
and on the back-end a database.

For each url
    filter is called
end for

filter
  get the domain name from url
   call cache.get domain
   if not in cache try the database
   if in database cache it and return it
   return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table
to use and domain field from nutch-site.xml

Since I do not have the tools to add it to the svn and all, If someone
is interested let me know and I can mail it.

Regards,

Gal
Reply | Threaded
Open this post in threaded view
|

Re: New plugin

John X
Hi, Gal,

Yes, I am interested. You can post the tarball to
http://issues.apache.org/jira/browse/Nutch

Thanks,

John

On Thu, Sep 29, 2005 at 09:53:42PM +0200, Gal Nitzan wrote:

> Hi,
>
> I have written (not much) a new plugin, based on the URLFilter
> interface: urlfilter-db .
>
> The purpose of this plugin is to filter domains, i.e. I would like to
> crawl the world but to fetch only certain domains.
>
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS)
> and on the back-end a database.
>
> For each url
>    filter is called
> end for
>
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
>
>
> The plugin reads the cache size, jdbc driver, connection string, table
> to use and domain field from nutch-site.xml
>
> Since I do not have the tools to add it to the svn and all, If someone
> is interested let me know and I can mail it.
>
> Regards,
>
> Gal
>
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!
Reply | Threaded
Open this post in threaded view
|

Re: New plugin

Gal Nitzan
John X wrote:

> Hi, Gal,
>
> Yes, I am interested. You can post the tarball to
> http://issues.apache.org/jira/browse/Nutch
>
> Thanks,
>
> John
>
> On Thu, Sep 29, 2005 at 09:53:42PM +0200, Gal Nitzan wrote:
>  
>> Hi,
>>
>> I have written (not much) a new plugin, based on the URLFilter
>> interface: urlfilter-db .
>>
>> The purpose of this plugin is to filter domains, i.e. I would like to
>> crawl the world but to fetch only certain domains.
>>
>> The plugin uses a caching system (SwarmCache, easier to deploy than JCS)
>> and on the back-end a database.
>>
>> For each url
>>    filter is called
>> end for
>>
>> filter
>>  get the domain name from url
>>   call cache.get domain
>>   if not in cache try the database
>>   if in database cache it and return it
>>   return null
>> end filter
>>
>>
>> The plugin reads the cache size, jdbc driver, connection string, table
>> to use and domain field from nutch-site.xml
>>
>> Since I do not have the tools to add it to the svn and all, If someone
>> is interested let me know and I can mail it.
>>
>> Regards,
>>
>> Gal
>>
>>    
> __________________________________________
> http://www.neasys.com - A Good Place to Be
> Come to visit us today!
>
> .
>
>  

Done. enjoy: http://issues.apache.org/jira/browse/NUTCH-100

Regards, Gal
12