nutch 0.9, mergesegs error

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

nutch 0.9, mergesegs error

John Mendenhall
I am running nutch 0.9.
I have run nutch mergesegs many times before.
The last couple times I have run, I get the following
errors:

-----
Merging 14 segments to /var/nutch/crawl/mergesegs_dir/20080201220906
SegmentMerger:   adding /var/nutch/crawl/segments/20080128132506
SegmentMerger:   adding /var/nutch/crawl/segments/20080129200011
SegmentMerger:   adding /var/nutch/crawl/segments/20080130000011
SegmentMerger:   adding /var/nutch/crawl/segments/20080130040010
SegmentMerger:   adding /var/nutch/crawl/segments/20080130080011
SegmentMerger:   adding /var/nutch/crawl/segments/20080130120010
SegmentMerger:   adding /var/nutch/crawl/segments/20080130155010
SegmentMerger:   adding /var/nutch/crawl/segments/20080130193010
SegmentMerger:   adding /var/nutch/crawl/segments/20080130231010
SegmentMerger:   adding /var/nutch/crawl/segments/20080131030010
SegmentMerger:   adding /var/nutch/crawl/segments/20080131070010
SegmentMerger:   adding /var/nutch/crawl/segments/20080131110011
SegmentMerger:   adding /var/nutch/crawl/segments/20080131150010
SegmentMerger:   adding /var/nutch/crawl/segments/20080131190011
SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text
task_0001_m_000075_0: Exception in thread "main" java.net.SocketTimeoutException: timed out waiting for rpc response
task_0001_m_000075_0:   at org.apache.hadoop.ipc.Client.call(Client.java:473)
task_0001_m_000075_0:   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
task_0001_m_000075_0:   at org.apache.hadoop.mapred.$Proxy0.reportDiagnosticInfo(Unknown Source)
task_0001_m_000075_0:   at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1454)
task_0001_m_000080_0: Exception in thread "main" java.net.SocketException: Socket closed
task_0001_m_000080_0:   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
task_0001_m_000080_0:   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
task_0001_m_000080_0:   at org.apache.hadoop.ipc.Client$Connection$2.write(Client.java:189)
task_0001_m_000080_0:   at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
task_0001_m_000080_0:   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
task_0001_m_000080_0:   at java.io.DataOutputStream.flush(DataOutputStream.java:106)
task_0001_m_000080_0:   at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:324)
task_0001_m_000080_0:   at org.apache.hadoop.ipc.Client.call(Client.java:461)
task_0001_m_000080_0:   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
task_0001_m_000080_0:   at org.apache.hadoop.mapred.$Proxy0.reportDiagnosticInfo(Unknown Source)
task_0001_m_000080_0:   at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1454)
task_0001_m_000072_1: log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Client).
task_0001_m_000072_1: log4j:WARN Please initialize the log4j system properly.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:590)
        at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:638)
-----

nutch mergesegs returns with status code of 1.

I have tried looking at why the log4j warning is happening.
All other runs seem fine.  Log4j seems to be setup for all
other instances where it is needed.

Where do I need to look to find out why nutch mergesegs is
crashing?

Why is log4j not finding the log4j.properties file?
The nutch script in nutch/bin already adds the conf
dir to the class path.

Thanks in advance for any assistance you can provide.

JohnM

--
john mendenhall
[hidden email]
surf utopia
internet services
Reply | Threaded
Open this post in threaded view
|

Re: nutch 0.9, mergesegs error

John Mendenhall
On Tue, 05 Feb 2008, John Mendenhall wrote:

> I am running nutch 0.9.
> I have run nutch mergesegs many times before.
> The last couple times I have run, I get the following
> errors:
>
> -----
> Merging 14 segments to /var/nutch/crawl/mergesegs_dir/20080201220906
> SegmentMerger:   adding /var/nutch/crawl/segments/20080128132506
> SegmentMerger:   adding /var/nutch/crawl/segments/20080129200011
> SegmentMerger:   adding /var/nutch/crawl/segments/20080130000011
> SegmentMerger:   adding /var/nutch/crawl/segments/20080130040010
> SegmentMerger:   adding /var/nutch/crawl/segments/20080130080011
> SegmentMerger:   adding /var/nutch/crawl/segments/20080130120010
> SegmentMerger:   adding /var/nutch/crawl/segments/20080130155010
> SegmentMerger:   adding /var/nutch/crawl/segments/20080130193010
> SegmentMerger:   adding /var/nutch/crawl/segments/20080130231010
> SegmentMerger:   adding /var/nutch/crawl/segments/20080131030010
> SegmentMerger:   adding /var/nutch/crawl/segments/20080131070010
> SegmentMerger:   adding /var/nutch/crawl/segments/20080131110011
> SegmentMerger:   adding /var/nutch/crawl/segments/20080131150010
> SegmentMerger:   adding /var/nutch/crawl/segments/20080131190011
> SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text
> task_0001_m_000075_0: Exception in thread "main" java.net.SocketTimeoutException: timed out waiting for rpc response
> task_0001_m_000075_0:   at org.apache.hadoop.ipc.Client.call(Client.java:473)
> task_0001_m_000075_0:   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> task_0001_m_000075_0:   at org.apache.hadoop.mapred.$Proxy0.reportDiagnosticInfo(Unknown Source)
> task_0001_m_000075_0:   at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1454)
> task_0001_m_000080_0: Exception in thread "main" java.net.SocketException: Socket closed
> task_0001_m_000080_0:   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
> task_0001_m_000080_0:   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> task_0001_m_000080_0:   at org.apache.hadoop.ipc.Client$Connection$2.write(Client.java:189)
> task_0001_m_000080_0:   at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> task_0001_m_000080_0:   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> task_0001_m_000080_0:   at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> task_0001_m_000080_0:   at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:324)
> task_0001_m_000080_0:   at org.apache.hadoop.ipc.Client.call(Client.java:461)
> task_0001_m_000080_0:   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> task_0001_m_000080_0:   at org.apache.hadoop.mapred.$Proxy0.reportDiagnosticInfo(Unknown Source)
> task_0001_m_000080_0:   at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1454)
> task_0001_m_000072_1: log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Client).
> task_0001_m_000072_1: log4j:WARN Please initialize the log4j system properly.
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:590)
>         at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:638)
> -----
>
> nutch mergesegs returns with status code of 1.
>
> I have tried looking at why the log4j warning is happening.
> All other runs seem fine.  Log4j seems to be setup for all
> other instances where it is needed.
>
> Where do I need to look to find out why nutch mergesegs is
> crashing?
>
> Why is log4j not finding the log4j.properties file?
> The nutch script in nutch/bin already adds the conf
> dir to the class path.
>
> Thanks in advance for any assistance you can provide.

Any thoughts on the above issues?
I have not received any responses.

Please let me know if anyone knows where I should start
debugging this.

Thanks!

JohnM

--
john mendenhall
[hidden email]
surf utopia
internet services
Reply | Threaded
Open this post in threaded view
|

Re: nutch 0.9, mergesegs error

John Mendenhall
In reply to this post by John Mendenhall
On Tue, 05 Feb 2008, John Mendenhall wrote:

> -----
> Merging 14 segments to /var/nutch/crawl/mergesegs_dir/20080201220906
> SegmentMerger:   adding /var/nutch/crawl/segments/20080128132506
> SegmentMerger:   adding ...
> SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text
> task_0001_m_000075_0: Exception in thread "main" java.net.SocketTimeoutException: timed out waiting for rpc response
> task_0001_m_000075_0:   at org.apache.hadoop.ipc.Client.call(Client.java:473)
> task_0001_m_000075_0:   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> task_0001_m_000075_0:   at org.apache.hadoop.mapred.$Proxy0.reportDiagnosticInfo(Unknown Source)
> task_0001_m_000075_0:   at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1454)
> task_0001_m_000080_0: Exception in thread "main" java.net.SocketException: Socket closed
> task_0001_m_000080_0:   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
> task_0001_m_000080_0:   at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> task_0001_m_000080_0:   at org.apache.hadoop.ipc.Client$Connection$2.write(Client.java:189)
> task_0001_m_000080_0:   at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> task_0001_m_000080_0:   at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> task_0001_m_000080_0:   at java.io.DataOutputStream.flush(DataOutputStream.java:106)
> task_0001_m_000080_0:   at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:324)
> task_0001_m_000080_0:   at org.apache.hadoop.ipc.Client.call(Client.java:461)
> task_0001_m_000080_0:   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> task_0001_m_000080_0:   at org.apache.hadoop.mapred.$Proxy0.reportDiagnosticInfo(Unknown Source)
> task_0001_m_000080_0:   at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1454)
> task_0001_m_000072_1: log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Client).
> task_0001_m_000072_1: log4j:WARN Please initialize the log4j system properly.
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>         at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:590)
>         at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:638)
> -----
>
> nutch mergesegs returns with status code of 1.
>
> I have tried looking at why the log4j warning is happening.
> All other runs seem fine.  Log4j seems to be setup for all
> other instances where it is needed.
>
> Where do I need to look to find out why nutch mergesegs is
> crashing?
>
> Why is log4j not finding the log4j.properties file?
> The nutch script in nutch/bin already adds the conf
> dir to the class path.
>
> Thanks in advance for any assistance you can provide.
>
> JohnM

I modified the configuration to use less memory.
I also rebooted all servers.
Then, I reran the index and it worked.

I currently have 3 servers, 1 serving as master and
slave.  Each has a different amount of memory available.
Each has a different processor type.

What is the rule of thumb for setting the heap size,
and the child process heap sizes for each server?

Thanks!

JohnM

--
john mendenhall
[hidden email]
surf utopia
internet services