Distributed fetching only happening in one node ?

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Distributed fetching only happening in one node ?

brainstorm-2-2
Hi,

I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
best suited network topology for inet crawling (frontend being a net
bottleneck), but I think it's fine for testing purposes.

I'm having issues with fetch mapreduce job:

According to ganglia monitoring (network traffic), and hadoop
administrative interfaces, fetch phase is only being executed in the
frontend node, where I launched "nutch crawl". Previous nutch phases
were executed neatly distributed on all nodes:

job_200807131223_0001 hadoop inject urls 100.00%
        2 2 100.00%
        1 1
job_200807131223_0002 hadoop crawldb crawl-ecxi/crawldb 100.00%
        3 3 100.00%
        1 1
job_200807131223_0003 hadoop generate: select
crawl-ecxi/segments/20080713123547 100.00%
        3 3 100.00%
        1 1
job_200807131223_0004 hadoop generate: partition
crawl-ecxi/segments/20080713123547 100.00%
        4 4 100.00%
        2 2

I've checked that:

1) Nodes have inet connectivity, firewall settings
2) There's enough space on local discs
3) Proper processes are running on nodes

frontend-node:
==========

[root@cluster ~]# jps
29232 NameNode
29489 DataNode
29860 JobTracker
29778 SecondaryNameNode
31122 Crawl
30137 TaskTracker
10989 Jps
1818 TaskTracker$Child

leaf nodes:
========

[root@cluster ~]# cluster-fork jps
compute-0-1:
23929 Jps
15568 TaskTracker
15361 DataNode
compute-0-2:
32272 TaskTracker
32065 DataNode
7197 Jps
2397 TaskTracker$Child
compute-0-3:
12054 DataNode
19584 Jps
14824 TaskTracker$Child
12261 TaskTracker

4) Logs only show fetching process (taking place only in the head node):

2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
http://valleycycles.net/
2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
robots.txt for http://www.getting-forward.org/:
java.net.UnknownHostException: www.getting-forward.org
2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
robots.txt for http://www.getting-forward.org/:
java.net.UnknownHostException: www.getting-forward.org

What am I missing ? Why there are no fetching instances on nodes ? I
used the following custom script to launch a pristine crawl each time:

#!/bin/sh

# 1) Stops hadoop daemons
# 2) Overwrites new url list on HDFS
# 3) Starts hadoop daemons
# 4) Performs a clean crawl

#export JAVA_HOME=/usr/lib/jvm/java-6-sun
export JAVA_HOME=/usr/java/jdk1.5.0_10

CRAWL_DIR=crawl-ecxi || $1
URL_DIR=urls || $2

echo $CRAWL_DIR
echo $URL_DIR

echo "Leaving safe mode..."
./hadoop dfsadmin -safemode leave

echo "Removing seed urls directory and previous crawled content..."
./hadoop dfs -rmr $URL_DIR
./hadoop dfs -rmr $CRAWL_DIR

echo "Removing past logs"

rm -rf ../logs/*

echo "Uploading seed urls..."
./hadoop dfs -put ../$URL_DIR $URL_DIR

#echo "Entering safe mode..."
#./hadoop dfsadmin -safemode enter

echo "******************"
echo "* STARTING CRAWL *"
echo "******************"

./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3


Next step I'm thinking on to fix the problem is to install
nutch+hadoop as specified in this past nutch-user mail:

http://www.mail-archive.com/nutch-user@.../msg10225.html

As I don't know if it's current practice on trunk (archived mail is
from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
it or if it's being worked on by someone... I haven't found a matching
bug on JIRA :_/
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
Boiling down the problem I'm stuck on this:

2008-07-14 16:43:24,976 WARN  dfs.DataNode -
192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
192.168.0.252:50010 got java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
        at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
        at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
        at java.lang.Thread.run(Thread.java:595)

Checked that firewall settings between node & frontend were not
blocking packets, and they don't... anyone knows why is this ? If not,
could you provide a convenient way to debug it ?

Thanks !

On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]> wrote:

> Hi,
>
> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
> best suited network topology for inet crawling (frontend being a net
> bottleneck), but I think it's fine for testing purposes.
>
> I'm having issues with fetch mapreduce job:
>
> According to ganglia monitoring (network traffic), and hadoop
> administrative interfaces, fetch phase is only being executed in the
> frontend node, where I launched "nutch crawl". Previous nutch phases
> were executed neatly distributed on all nodes:
>
> job_200807131223_0001   hadoop  inject urls     100.00%
>        2       2       100.00%
>        1       1
> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb      100.00%
>        3       3       100.00%
>        1       1
> job_200807131223_0003   hadoop  generate: select
> crawl-ecxi/segments/20080713123547      100.00%
>        3       3       100.00%
>        1       1
> job_200807131223_0004   hadoop  generate: partition
> crawl-ecxi/segments/20080713123547      100.00%
>        4       4       100.00%
>        2       2
>
> I've checked that:
>
> 1) Nodes have inet connectivity, firewall settings
> 2) There's enough space on local discs
> 3) Proper processes are running on nodes
>
> frontend-node:
> ==========
>
> [root@cluster ~]# jps
> 29232 NameNode
> 29489 DataNode
> 29860 JobTracker
> 29778 SecondaryNameNode
> 31122 Crawl
> 30137 TaskTracker
> 10989 Jps
> 1818 TaskTracker$Child
>
> leaf nodes:
> ========
>
> [root@cluster ~]# cluster-fork jps
> compute-0-1:
> 23929 Jps
> 15568 TaskTracker
> 15361 DataNode
> compute-0-2:
> 32272 TaskTracker
> 32065 DataNode
> 7197 Jps
> 2397 TaskTracker$Child
> compute-0-3:
> 12054 DataNode
> 19584 Jps
> 14824 TaskTracker$Child
> 12261 TaskTracker
>
> 4) Logs only show fetching process (taking place only in the head node):
>
> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
> http://valleycycles.net/
> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> robots.txt for http://www.getting-forward.org/:
> java.net.UnknownHostException: www.getting-forward.org
> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> robots.txt for http://www.getting-forward.org/:
> java.net.UnknownHostException: www.getting-forward.org
>
> What am I missing ? Why there are no fetching instances on nodes ? I
> used the following custom script to launch a pristine crawl each time:
>
> #!/bin/sh
>
> # 1) Stops hadoop daemons
> # 2) Overwrites new url list on HDFS
> # 3) Starts hadoop daemons
> # 4) Performs a clean crawl
>
> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
> export JAVA_HOME=/usr/java/jdk1.5.0_10
>
> CRAWL_DIR=crawl-ecxi || $1
> URL_DIR=urls || $2
>
> echo $CRAWL_DIR
> echo $URL_DIR
>
> echo "Leaving safe mode..."
> ./hadoop dfsadmin -safemode leave
>
> echo "Removing seed urls directory and previous crawled content..."
> ./hadoop dfs -rmr $URL_DIR
> ./hadoop dfs -rmr $CRAWL_DIR
>
> echo "Removing past logs"
>
> rm -rf ../logs/*
>
> echo "Uploading seed urls..."
> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>
> #echo "Entering safe mode..."
> #./hadoop dfsadmin -safemode enter
>
> echo "******************"
> echo "* STARTING CRAWL *"
> echo "******************"
>
> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>
>
> Next step I'm thinking on to fix the problem is to install
> nutch+hadoop as specified in this past nutch-user mail:
>
> http://www.mail-archive.com/nutch-user@.../msg10225.html
>
> As I don't know if it's current practice on trunk (archived mail is
> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
> it or if it's being worked on by someone... I haven't found a matching
> bug on JIRA :_/
>
Reply | Threaded
Open this post in threaded view
|

RE: Distributed fetching only happening in one node ?

Patrick Markiewicz
Hi brain,
        If I were you, I would download wireshark
(http://www.wireshark.org/download.html) to see what is happening at the
network layer and see if that provides any clues.  A socket exception
that you don't expect is usually due to one side of the conversation not
understanding the other side.  If you have 4 machines, then you have 4
possible places where default firewall rules could be causing an issue.
If it is not the firewall rules, the NAT rules could be a potential
source of error.  Also, even a router hardware error could cause a
problem.
        If you understand TCP, just make sure that you see all the
correct TCP stuff happening in wireshark.  If you don't understand
wireshark's display, let me know, and I'll pass on some quickstart
information.

        If you already know all of this, I don't have any way to help
you, as it looks like you're trying to accomplish something trickier
with nutch than I have ever attempted.

Patrick

-----Original Message-----
From: brainstorm [mailto:[hidden email]]
Sent: Tuesday, July 15, 2008 10:08 AM
To: [hidden email]
Subject: Re: Distributed fetching only happening in one node ?

Boiling down the problem I'm stuck on this:

2008-07-14 16:43:24,976 WARN  dfs.DataNode -
192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
192.168.0.252:50010 got java.net.SocketException: Connection reset
        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
        at
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
        at java.lang.Thread.run(Thread.java:595)

Checked that firewall settings between node & frontend were not
blocking packets, and they don't... anyone knows why is this ? If not,
could you provide a convenient way to debug it ?

Thanks !

On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]> wrote:

> Hi,
>
> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
> best suited network topology for inet crawling (frontend being a net
> bottleneck), but I think it's fine for testing purposes.
>
> I'm having issues with fetch mapreduce job:
>
> According to ganglia monitoring (network traffic), and hadoop
> administrative interfaces, fetch phase is only being executed in the
> frontend node, where I launched "nutch crawl". Previous nutch phases
> were executed neatly distributed on all nodes:
>
> job_200807131223_0001   hadoop  inject urls     100.00%
>        2       2       100.00%
>        1       1
> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
100.00%

>        3       3       100.00%
>        1       1
> job_200807131223_0003   hadoop  generate: select
> crawl-ecxi/segments/20080713123547      100.00%
>        3       3       100.00%
>        1       1
> job_200807131223_0004   hadoop  generate: partition
> crawl-ecxi/segments/20080713123547      100.00%
>        4       4       100.00%
>        2       2
>
> I've checked that:
>
> 1) Nodes have inet connectivity, firewall settings
> 2) There's enough space on local discs
> 3) Proper processes are running on nodes
>
> frontend-node:
> ==========
>
> [root@cluster ~]# jps
> 29232 NameNode
> 29489 DataNode
> 29860 JobTracker
> 29778 SecondaryNameNode
> 31122 Crawl
> 30137 TaskTracker
> 10989 Jps
> 1818 TaskTracker$Child
>
> leaf nodes:
> ========
>
> [root@cluster ~]# cluster-fork jps
> compute-0-1:
> 23929 Jps
> 15568 TaskTracker
> 15361 DataNode
> compute-0-2:
> 32272 TaskTracker
> 32065 DataNode
> 7197 Jps
> 2397 TaskTracker$Child
> compute-0-3:
> 12054 DataNode
> 19584 Jps
> 14824 TaskTracker$Child
> 12261 TaskTracker
>
> 4) Logs only show fetching process (taking place only in the head
node):

>
> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
> http://valleycycles.net/
> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> robots.txt for http://www.getting-forward.org/:
> java.net.UnknownHostException: www.getting-forward.org
> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> robots.txt for http://www.getting-forward.org/:
> java.net.UnknownHostException: www.getting-forward.org
>
> What am I missing ? Why there are no fetching instances on nodes ? I
> used the following custom script to launch a pristine crawl each time:
>
> #!/bin/sh
>
> # 1) Stops hadoop daemons
> # 2) Overwrites new url list on HDFS
> # 3) Starts hadoop daemons
> # 4) Performs a clean crawl
>
> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
> export JAVA_HOME=/usr/java/jdk1.5.0_10
>
> CRAWL_DIR=crawl-ecxi || $1
> URL_DIR=urls || $2
>
> echo $CRAWL_DIR
> echo $URL_DIR
>
> echo "Leaving safe mode..."
> ./hadoop dfsadmin -safemode leave
>
> echo "Removing seed urls directory and previous crawled content..."
> ./hadoop dfs -rmr $URL_DIR
> ./hadoop dfs -rmr $CRAWL_DIR
>
> echo "Removing past logs"
>
> rm -rf ../logs/*
>
> echo "Uploading seed urls..."
> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>
> #echo "Entering safe mode..."
> #./hadoop dfsadmin -safemode enter
>
> echo "******************"
> echo "* STARTING CRAWL *"
> echo "******************"
>
> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>
>
> Next step I'm thinking on to fix the problem is to install
> nutch+hadoop as specified in this past nutch-user mail:
>
> http://www.mail-archive.com/nutch-user@.../msg10225.html
>
> As I don't know if it's current practice on trunk (archived mail is
> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
> it or if it's being worked on by someone... I haven't found a matching
> bug on JIRA :_/
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
Yep, I know about wireshark, and wanted to avoid it to debug this
issue (perhaps there was a simple solution/known bug/issue)...

I just launched wireshark on frontend with filter tcp.port == 50010,
and now I'm diving on the tcp stream... let's see if I see the light
(RST flag somewhere ?), thanks anyway for replying ;)

Just for the record, the phase that stalls is fetcher during reduce:

Jobid User Name Map % Complete Map Total Maps Completed Reduce %
Complete Reduce Total Reduces Completed
job_200807151723_0005 hadoop fetch crawl-ecxi/segments/20080715172458 100.00%
        2 2 16.66%
       
        1 0

It's stuck on 16%, no traffic, no crawling, but still "running".

On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
<[hidden email]> wrote:

> Hi brain,
>        If I were you, I would download wireshark
> (http://www.wireshark.org/download.html) to see what is happening at the
> network layer and see if that provides any clues.  A socket exception
> that you don't expect is usually due to one side of the conversation not
> understanding the other side.  If you have 4 machines, then you have 4
> possible places where default firewall rules could be causing an issue.
> If it is not the firewall rules, the NAT rules could be a potential
> source of error.  Also, even a router hardware error could cause a
> problem.
>        If you understand TCP, just make sure that you see all the
> correct TCP stuff happening in wireshark.  If you don't understand
> wireshark's display, let me know, and I'll pass on some quickstart
> information.
>
>        If you already know all of this, I don't have any way to help
> you, as it looks like you're trying to accomplish something trickier
> with nutch than I have ever attempted.
>
> Patrick
>
> -----Original Message-----
> From: brainstorm [mailto:[hidden email]]
> Sent: Tuesday, July 15, 2008 10:08 AM
> To: [hidden email]
> Subject: Re: Distributed fetching only happening in one node ?
>
> Boiling down the problem I'm stuck on this:
>
> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>        at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>        at
> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>        at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>        at
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>        at
> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>        at java.lang.Thread.run(Thread.java:595)
>
> Checked that firewall settings between node & frontend were not
> blocking packets, and they don't... anyone knows why is this ? If not,
> could you provide a convenient way to debug it ?
>
> Thanks !
>
> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]> wrote:
>> Hi,
>>
>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>> best suited network topology for inet crawling (frontend being a net
>> bottleneck), but I think it's fine for testing purposes.
>>
>> I'm having issues with fetch mapreduce job:
>>
>> According to ganglia monitoring (network traffic), and hadoop
>> administrative interfaces, fetch phase is only being executed in the
>> frontend node, where I launched "nutch crawl". Previous nutch phases
>> were executed neatly distributed on all nodes:
>>
>> job_200807131223_0001   hadoop  inject urls     100.00%
>>        2       2       100.00%
>>        1       1
>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
> 100.00%
>>        3       3       100.00%
>>        1       1
>> job_200807131223_0003   hadoop  generate: select
>> crawl-ecxi/segments/20080713123547      100.00%
>>        3       3       100.00%
>>        1       1
>> job_200807131223_0004   hadoop  generate: partition
>> crawl-ecxi/segments/20080713123547      100.00%
>>        4       4       100.00%
>>        2       2
>>
>> I've checked that:
>>
>> 1) Nodes have inet connectivity, firewall settings
>> 2) There's enough space on local discs
>> 3) Proper processes are running on nodes
>>
>> frontend-node:
>> ==========
>>
>> [root@cluster ~]# jps
>> 29232 NameNode
>> 29489 DataNode
>> 29860 JobTracker
>> 29778 SecondaryNameNode
>> 31122 Crawl
>> 30137 TaskTracker
>> 10989 Jps
>> 1818 TaskTracker$Child
>>
>> leaf nodes:
>> ========
>>
>> [root@cluster ~]# cluster-fork jps
>> compute-0-1:
>> 23929 Jps
>> 15568 TaskTracker
>> 15361 DataNode
>> compute-0-2:
>> 32272 TaskTracker
>> 32065 DataNode
>> 7197 Jps
>> 2397 TaskTracker$Child
>> compute-0-3:
>> 12054 DataNode
>> 19584 Jps
>> 14824 TaskTracker$Child
>> 12261 TaskTracker
>>
>> 4) Logs only show fetching process (taking place only in the head
> node):
>>
>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>> http://valleycycles.net/
>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> robots.txt for http://www.getting-forward.org/:
>> java.net.UnknownHostException: www.getting-forward.org
>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> robots.txt for http://www.getting-forward.org/:
>> java.net.UnknownHostException: www.getting-forward.org
>>
>> What am I missing ? Why there are no fetching instances on nodes ? I
>> used the following custom script to launch a pristine crawl each time:
>>
>> #!/bin/sh
>>
>> # 1) Stops hadoop daemons
>> # 2) Overwrites new url list on HDFS
>> # 3) Starts hadoop daemons
>> # 4) Performs a clean crawl
>>
>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>>
>> CRAWL_DIR=crawl-ecxi || $1
>> URL_DIR=urls || $2
>>
>> echo $CRAWL_DIR
>> echo $URL_DIR
>>
>> echo "Leaving safe mode..."
>> ./hadoop dfsadmin -safemode leave
>>
>> echo "Removing seed urls directory and previous crawled content..."
>> ./hadoop dfs -rmr $URL_DIR
>> ./hadoop dfs -rmr $CRAWL_DIR
>>
>> echo "Removing past logs"
>>
>> rm -rf ../logs/*
>>
>> echo "Uploading seed urls..."
>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>>
>> #echo "Entering safe mode..."
>> #./hadoop dfsadmin -safemode enter
>>
>> echo "******************"
>> echo "* STARTING CRAWL *"
>> echo "******************"
>>
>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>>
>>
>> Next step I'm thinking on to fix the problem is to install
>> nutch+hadoop as specified in this past nutch-user mail:
>>
>> http://www.mail-archive.com/nutch-user@.../msg10225.html
>>
>> As I don't know if it's current practice on trunk (archived mail is
>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>> it or if it's being worked on by someone... I haven't found a matching
>> bug on JIRA :_/
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
While seeing DFS wireshark trace (and the corresponding RST's), the
crawl continued to next step... seems that this WARNING is actually
slowing down the whole crawling process (it took 36 minutes to
complete the previous fetch) with just 3 urls seed file :-!!!

I just posted a couple of exceptions/questions regarding DFS on hadoop
core mailing list.

PD: As a side note, the following error caught my attention:

Fetcher: starting
Fetcher: segment: crawl-ecxi/segments/20080715172458
Too many fetch-failures
task_200807151723_0005_m_000000_0: Fetcher: threads: 10
task_200807151723_0005_m_000000_0: fetching http://upc.es/
task_200807151723_0005_m_000000_0: fetching http://upc.edu/
task_200807151723_0005_m_000000_0: fetching http://upc.cat/
task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
with: org.apache.nutch.protocol.http.api.HttpException:
java.net.UnknownHostException: upc.cat

Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
exist, it just gets redirected to www.upc.cat :-/

On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[hidden email]> wrote:

> Yep, I know about wireshark, and wanted to avoid it to debug this
> issue (perhaps there was a simple solution/known bug/issue)...
>
> I just launched wireshark on frontend with filter tcp.port == 50010,
> and now I'm diving on the tcp stream... let's see if I see the light
> (RST flag somewhere ?), thanks anyway for replying ;)
>
> Just for the record, the phase that stalls is fetcher during reduce:
>
> Jobid   User    Name    Map % Complete  Map Total       Maps Completed  Reduce %
> Complete        Reduce Total    Reduces Completed
> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458        100.00%
>        2       2       16.66%
>
>        1       0
>
> It's stuck on 16%, no traffic, no crawling, but still "running".
>
> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
> <[hidden email]> wrote:
>> Hi brain,
>>        If I were you, I would download wireshark
>> (http://www.wireshark.org/download.html) to see what is happening at the
>> network layer and see if that provides any clues.  A socket exception
>> that you don't expect is usually due to one side of the conversation not
>> understanding the other side.  If you have 4 machines, then you have 4
>> possible places where default firewall rules could be causing an issue.
>> If it is not the firewall rules, the NAT rules could be a potential
>> source of error.  Also, even a router hardware error could cause a
>> problem.
>>        If you understand TCP, just make sure that you see all the
>> correct TCP stuff happening in wireshark.  If you don't understand
>> wireshark's display, let me know, and I'll pass on some quickstart
>> information.
>>
>>        If you already know all of this, I don't have any way to help
>> you, as it looks like you're trying to accomplish something trickier
>> with nutch than I have ever attempted.
>>
>> Patrick
>>
>> -----Original Message-----
>> From: brainstorm [mailto:[hidden email]]
>> Sent: Tuesday, July 15, 2008 10:08 AM
>> To: [hidden email]
>> Subject: Re: Distributed fetching only happening in one node ?
>>
>> Boiling down the problem I'm stuck on this:
>>
>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>>        at
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>>        at
>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>        at
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>        at
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>        at
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>>        at
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>>        at
>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>>        at java.lang.Thread.run(Thread.java:595)
>>
>> Checked that firewall settings between node & frontend were not
>> blocking packets, and they don't... anyone knows why is this ? If not,
>> could you provide a convenient way to debug it ?
>>
>> Thanks !
>>
>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]> wrote:
>>> Hi,
>>>
>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>>> best suited network topology for inet crawling (frontend being a net
>>> bottleneck), but I think it's fine for testing purposes.
>>>
>>> I'm having issues with fetch mapreduce job:
>>>
>>> According to ganglia monitoring (network traffic), and hadoop
>>> administrative interfaces, fetch phase is only being executed in the
>>> frontend node, where I launched "nutch crawl". Previous nutch phases
>>> were executed neatly distributed on all nodes:
>>>
>>> job_200807131223_0001   hadoop  inject urls     100.00%
>>>        2       2       100.00%
>>>        1       1
>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>> 100.00%
>>>        3       3       100.00%
>>>        1       1
>>> job_200807131223_0003   hadoop  generate: select
>>> crawl-ecxi/segments/20080713123547      100.00%
>>>        3       3       100.00%
>>>        1       1
>>> job_200807131223_0004   hadoop  generate: partition
>>> crawl-ecxi/segments/20080713123547      100.00%
>>>        4       4       100.00%
>>>        2       2
>>>
>>> I've checked that:
>>>
>>> 1) Nodes have inet connectivity, firewall settings
>>> 2) There's enough space on local discs
>>> 3) Proper processes are running on nodes
>>>
>>> frontend-node:
>>> ==========
>>>
>>> [root@cluster ~]# jps
>>> 29232 NameNode
>>> 29489 DataNode
>>> 29860 JobTracker
>>> 29778 SecondaryNameNode
>>> 31122 Crawl
>>> 30137 TaskTracker
>>> 10989 Jps
>>> 1818 TaskTracker$Child
>>>
>>> leaf nodes:
>>> ========
>>>
>>> [root@cluster ~]# cluster-fork jps
>>> compute-0-1:
>>> 23929 Jps
>>> 15568 TaskTracker
>>> 15361 DataNode
>>> compute-0-2:
>>> 32272 TaskTracker
>>> 32065 DataNode
>>> 7197 Jps
>>> 2397 TaskTracker$Child
>>> compute-0-3:
>>> 12054 DataNode
>>> 19584 Jps
>>> 14824 TaskTracker$Child
>>> 12261 TaskTracker
>>>
>>> 4) Logs only show fetching process (taking place only in the head
>> node):
>>>
>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>>> http://valleycycles.net/
>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>> robots.txt for http://www.getting-forward.org/:
>>> java.net.UnknownHostException: www.getting-forward.org
>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>> robots.txt for http://www.getting-forward.org/:
>>> java.net.UnknownHostException: www.getting-forward.org
>>>
>>> What am I missing ? Why there are no fetching instances on nodes ? I
>>> used the following custom script to launch a pristine crawl each time:
>>>
>>> #!/bin/sh
>>>
>>> # 1) Stops hadoop daemons
>>> # 2) Overwrites new url list on HDFS
>>> # 3) Starts hadoop daemons
>>> # 4) Performs a clean crawl
>>>
>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>>>
>>> CRAWL_DIR=crawl-ecxi || $1
>>> URL_DIR=urls || $2
>>>
>>> echo $CRAWL_DIR
>>> echo $URL_DIR
>>>
>>> echo "Leaving safe mode..."
>>> ./hadoop dfsadmin -safemode leave
>>>
>>> echo "Removing seed urls directory and previous crawled content..."
>>> ./hadoop dfs -rmr $URL_DIR
>>> ./hadoop dfs -rmr $CRAWL_DIR
>>>
>>> echo "Removing past logs"
>>>
>>> rm -rf ../logs/*
>>>
>>> echo "Uploading seed urls..."
>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>>>
>>> #echo "Entering safe mode..."
>>> #./hadoop dfsadmin -safemode enter
>>>
>>> echo "******************"
>>> echo "* STARTING CRAWL *"
>>> echo "******************"
>>>
>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>>>
>>>
>>> Next step I'm thinking on to fix the problem is to install
>>> nutch+hadoop as specified in this past nutch-user mail:
>>>
>>> http://www.mail-archive.com/nutch-user@.../msg10225.html
>>>
>>> As I don't know if it's current practice on trunk (archived mail is
>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>>> it or if it's being worked on by someone... I haven't found a matching
>>> bug on JIRA :_/
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
the warnings... BUT, on a 7-node nutch cluster:

1) Fetching is only happening on *one* node despite several values
tested on settings:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
export HADOOP_HEAPSIZE

I've played with mapreduce (hadoop-site.xml) settings as advised on:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

But nutch keeps crawling only using one node, instead of seven
nodes... anybody knows why ?

I've had a look at the code, searching for:

conf.setNumMapTasks(int num), but found none: so I guess that the
number of mappers & reducers are not limited programatically.

2) Even on a single node, the fetching is really slow: 1 url or page
per second, at most.

Can anybody shed some light into this ? Pointing which class/code I
should look into to modify this behaviour will help also.

Anybody has a distributed nutch crawling cluster working with all
nodes fetching at fetch phase ?

I even did some numbers using wordcount example using 7 nodes at 100%
cpu usage using a 425MB parsedtext file:

maps reduces heapsize time
2 2 500 3m43.049s
4 4 500 4m41.846s
8 8 500 4m29.344s
16 16 500 3m43.672s
32 32 500 3m41.367s
64 64 500 4m27.275s
128 128 500 4m35.233s
256 256 500 3m41.916s
                       
                       
2 2 2000 4m31.434s
4 4 2000
8 8 2000
16 16 2000 4m32.213s
32 32 2000
64 64 2000
128 128 2000
256 256 2000 4m38.310s

Thanks in advance,
Roman

On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[hidden email]> wrote:

> While seeing DFS wireshark trace (and the corresponding RST's), the
> crawl continued to next step... seems that this WARNING is actually
> slowing down the whole crawling process (it took 36 minutes to
> complete the previous fetch) with just 3 urls seed file :-!!!
>
> I just posted a couple of exceptions/questions regarding DFS on hadoop
> core mailing list.
>
> PD: As a side note, the following error caught my attention:
>
> Fetcher: starting
> Fetcher: segment: crawl-ecxi/segments/20080715172458
> Too many fetch-failures
> task_200807151723_0005_m_000000_0: Fetcher: threads: 10
> task_200807151723_0005_m_000000_0: fetching http://upc.es/
> task_200807151723_0005_m_000000_0: fetching http://upc.edu/
> task_200807151723_0005_m_000000_0: fetching http://upc.cat/
> task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
> with: org.apache.nutch.protocol.http.api.HttpException:
> java.net.UnknownHostException: upc.cat
>
> Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
> exist, it just gets redirected to www.upc.cat :-/
>
> On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[hidden email]> wrote:
>> Yep, I know about wireshark, and wanted to avoid it to debug this
>> issue (perhaps there was a simple solution/known bug/issue)...
>>
>> I just launched wireshark on frontend with filter tcp.port == 50010,
>> and now I'm diving on the tcp stream... let's see if I see the light
>> (RST flag somewhere ?), thanks anyway for replying ;)
>>
>> Just for the record, the phase that stalls is fetcher during reduce:
>>
>> Jobid   User    Name    Map % Complete  Map Total       Maps Completed  Reduce %
>> Complete        Reduce Total    Reduces Completed
>> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458        100.00%
>>        2       2       16.66%
>>
>>        1       0
>>
>> It's stuck on 16%, no traffic, no crawling, but still "running".
>>
>> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>> <[hidden email]> wrote:
>>> Hi brain,
>>>        If I were you, I would download wireshark
>>> (http://www.wireshark.org/download.html) to see what is happening at the
>>> network layer and see if that provides any clues.  A socket exception
>>> that you don't expect is usually due to one side of the conversation not
>>> understanding the other side.  If you have 4 machines, then you have 4
>>> possible places where default firewall rules could be causing an issue.
>>> If it is not the firewall rules, the NAT rules could be a potential
>>> source of error.  Also, even a router hardware error could cause a
>>> problem.
>>>        If you understand TCP, just make sure that you see all the
>>> correct TCP stuff happening in wireshark.  If you don't understand
>>> wireshark's display, let me know, and I'll pass on some quickstart
>>> information.
>>>
>>>        If you already know all of this, I don't have any way to help
>>> you, as it looks like you're trying to accomplish something trickier
>>> with nutch than I have ever attempted.
>>>
>>> Patrick
>>>
>>> -----Original Message-----
>>> From: brainstorm [mailto:[hidden email]]
>>> Sent: Tuesday, July 15, 2008 10:08 AM
>>> To: [hidden email]
>>> Subject: Re: Distributed fetching only happening in one node ?
>>>
>>> Boiling down the problem I'm stuck on this:
>>>
>>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>>>        at
>>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>>>        at
>>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>>        at
>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>>        at
>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>        at
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>>>        at
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>>>        at
>>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>>>        at java.lang.Thread.run(Thread.java:595)
>>>
>>> Checked that firewall settings between node & frontend were not
>>> blocking packets, and they don't... anyone knows why is this ? If not,
>>> could you provide a convenient way to debug it ?
>>>
>>> Thanks !
>>>
>>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>>>> best suited network topology for inet crawling (frontend being a net
>>>> bottleneck), but I think it's fine for testing purposes.
>>>>
>>>> I'm having issues with fetch mapreduce job:
>>>>
>>>> According to ganglia monitoring (network traffic), and hadoop
>>>> administrative interfaces, fetch phase is only being executed in the
>>>> frontend node, where I launched "nutch crawl". Previous nutch phases
>>>> were executed neatly distributed on all nodes:
>>>>
>>>> job_200807131223_0001   hadoop  inject urls     100.00%
>>>>        2       2       100.00%
>>>>        1       1
>>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>>> 100.00%
>>>>        3       3       100.00%
>>>>        1       1
>>>> job_200807131223_0003   hadoop  generate: select
>>>> crawl-ecxi/segments/20080713123547      100.00%
>>>>        3       3       100.00%
>>>>        1       1
>>>> job_200807131223_0004   hadoop  generate: partition
>>>> crawl-ecxi/segments/20080713123547      100.00%
>>>>        4       4       100.00%
>>>>        2       2
>>>>
>>>> I've checked that:
>>>>
>>>> 1) Nodes have inet connectivity, firewall settings
>>>> 2) There's enough space on local discs
>>>> 3) Proper processes are running on nodes
>>>>
>>>> frontend-node:
>>>> ==========
>>>>
>>>> [root@cluster ~]# jps
>>>> 29232 NameNode
>>>> 29489 DataNode
>>>> 29860 JobTracker
>>>> 29778 SecondaryNameNode
>>>> 31122 Crawl
>>>> 30137 TaskTracker
>>>> 10989 Jps
>>>> 1818 TaskTracker$Child
>>>>
>>>> leaf nodes:
>>>> ========
>>>>
>>>> [root@cluster ~]# cluster-fork jps
>>>> compute-0-1:
>>>> 23929 Jps
>>>> 15568 TaskTracker
>>>> 15361 DataNode
>>>> compute-0-2:
>>>> 32272 TaskTracker
>>>> 32065 DataNode
>>>> 7197 Jps
>>>> 2397 TaskTracker$Child
>>>> compute-0-3:
>>>> 12054 DataNode
>>>> 19584 Jps
>>>> 14824 TaskTracker$Child
>>>> 12261 TaskTracker
>>>>
>>>> 4) Logs only show fetching process (taking place only in the head
>>> node):
>>>>
>>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>>>> http://valleycycles.net/
>>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>>> robots.txt for http://www.getting-forward.org/:
>>>> java.net.UnknownHostException: www.getting-forward.org
>>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>>> robots.txt for http://www.getting-forward.org/:
>>>> java.net.UnknownHostException: www.getting-forward.org
>>>>
>>>> What am I missing ? Why there are no fetching instances on nodes ? I
>>>> used the following custom script to launch a pristine crawl each time:
>>>>
>>>> #!/bin/sh
>>>>
>>>> # 1) Stops hadoop daemons
>>>> # 2) Overwrites new url list on HDFS
>>>> # 3) Starts hadoop daemons
>>>> # 4) Performs a clean crawl
>>>>
>>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>>>>
>>>> CRAWL_DIR=crawl-ecxi || $1
>>>> URL_DIR=urls || $2
>>>>
>>>> echo $CRAWL_DIR
>>>> echo $URL_DIR
>>>>
>>>> echo "Leaving safe mode..."
>>>> ./hadoop dfsadmin -safemode leave
>>>>
>>>> echo "Removing seed urls directory and previous crawled content..."
>>>> ./hadoop dfs -rmr $URL_DIR
>>>> ./hadoop dfs -rmr $CRAWL_DIR
>>>>
>>>> echo "Removing past logs"
>>>>
>>>> rm -rf ../logs/*
>>>>
>>>> echo "Uploading seed urls..."
>>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>>>>
>>>> #echo "Entering safe mode..."
>>>> #./hadoop dfsadmin -safemode enter
>>>>
>>>> echo "******************"
>>>> echo "* STARTING CRAWL *"
>>>> echo "******************"
>>>>
>>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>>>>
>>>>
>>>> Next step I'm thinking on to fix the problem is to install
>>>> nutch+hadoop as specified in this past nutch-user mail:
>>>>
>>>> http://www.mail-archive.com/nutch-user@.../msg10225.html
>>>>
>>>> As I don't know if it's current practice on trunk (archived mail is
>>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>>>> it or if it's being worked on by someone... I haven't found a matching
>>>> bug on JIRA :_/
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

Alexander Aristov
Hi

1. You should have set
mapred.map.tasks
and
mapred.reduce.tasks parameters They are set to 2 and 1 by default.

2. You can specify number of threads to perform fetching. Also there is a
parameter that slows down fetching from one URL,so called polite fetching to
not DOS the site.

So check you configuration.

Alex

2008/8/5 brainstorm <[hidden email]>

> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
> the warnings... BUT, on a 7-node nutch cluster:
>
> 1) Fetching is only happening on *one* node despite several values
> tested on settings:
> mapred.tasktracker.map.tasks.maximum
> mapred.tasktracker.reduce.tasks.maximum
> export HADOOP_HEAPSIZE
>
> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> But nutch keeps crawling only using one node, instead of seven
> nodes... anybody knows why ?
>
> I've had a look at the code, searching for:
>
> conf.setNumMapTasks(int num), but found none: so I guess that the
> number of mappers & reducers are not limited programatically.
>
> 2) Even on a single node, the fetching is really slow: 1 url or page
> per second, at most.
>
> Can anybody shed some light into this ? Pointing which class/code I
> should look into to modify this behaviour will help also.
>
> Anybody has a distributed nutch crawling cluster working with all
> nodes fetching at fetch phase ?
>
> I even did some numbers using wordcount example using 7 nodes at 100%
> cpu usage using a 425MB parsedtext file:
>
> maps    reduces heapsize        time
> 2       2       500     3m43.049s
> 4       4       500     4m41.846s
> 8       8       500     4m29.344s
> 16      16      500     3m43.672s
> 32      32      500     3m41.367s
> 64      64      500     4m27.275s
> 128     128     500     4m35.233s
> 256     256     500     3m41.916s
>
>
> 2       2       2000    4m31.434s
> 4       4       2000
> 8       8       2000
> 16      16      2000    4m32.213s
> 32      32      2000
> 64      64      2000
> 128     128     2000
> 256     256     2000    4m38.310s
>
> Thanks in advance,
> Roman
>
> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[hidden email]> wrote:
> > While seeing DFS wireshark trace (and the corresponding RST's), the
> > crawl continued to next step... seems that this WARNING is actually
> > slowing down the whole crawling process (it took 36 minutes to
> > complete the previous fetch) with just 3 urls seed file :-!!!
> >
> > I just posted a couple of exceptions/questions regarding DFS on hadoop
> > core mailing list.
> >
> > PD: As a side note, the following error caught my attention:
> >
> > Fetcher: starting
> > Fetcher: segment: crawl-ecxi/segments/20080715172458
> > Too many fetch-failures
> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
> > with: org.apache.nutch.protocol.http.api.HttpException:
> > java.net.UnknownHostException: upc.cat
> >
> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
> > exist, it just gets redirected to www.upc.cat :-/
> >
> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[hidden email]> wrote:
> >> Yep, I know about wireshark, and wanted to avoid it to debug this
> >> issue (perhaps there was a simple solution/known bug/issue)...
> >>
> >> I just launched wireshark on frontend with filter tcp.port == 50010,
> >> and now I'm diving on the tcp stream... let's see if I see the light
> >> (RST flag somewhere ?), thanks anyway for replying ;)
> >>
> >> Just for the record, the phase that stalls is fetcher during reduce:
> >>
> >> Jobid   User    Name    Map % Complete  Map Total       Maps Completed
>  Reduce %
> >> Complete        Reduce Total    Reduces Completed
> >> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458
>        100.00%
> >>        2       2       16.66%
> >>
> >>        1       0
> >>
> >> It's stuck on 16%, no traffic, no crawling, but still "running".
> >>
> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
> >> <[hidden email]> wrote:
> >>> Hi brain,
> >>>        If I were you, I would download wireshark
> >>> (http://www.wireshark.org/download.html) to see what is happening at
> the
> >>> network layer and see if that provides any clues.  A socket exception
> >>> that you don't expect is usually due to one side of the conversation
> not
> >>> understanding the other side.  If you have 4 machines, then you have 4
> >>> possible places where default firewall rules could be causing an issue.
> >>> If it is not the firewall rules, the NAT rules could be a potential
> >>> source of error.  Also, even a router hardware error could cause a
> >>> problem.
> >>>        If you understand TCP, just make sure that you see all the
> >>> correct TCP stuff happening in wireshark.  If you don't understand
> >>> wireshark's display, let me know, and I'll pass on some quickstart
> >>> information.
> >>>
> >>>        If you already know all of this, I don't have any way to help
> >>> you, as it looks like you're trying to accomplish something trickier
> >>> with nutch than I have ever attempted.
> >>>
> >>> Patrick
> >>>
> >>> -----Original Message-----
> >>> From: brainstorm [mailto:[hidden email]]
> >>> Sent: Tuesday, July 15, 2008 10:08 AM
> >>> To: [hidden email]
> >>> Subject: Re: Distributed fetching only happening in one node ?
> >>>
> >>> Boiling down the problem I'm stuck on this:
> >>>
> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
> >>>        at
> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
> >>>        at
> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> >>>        at
> >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> >>>        at
> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >>>        at
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
> >>>        at
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
> >>>        at
> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
> >>>        at java.lang.Thread.run(Thread.java:595)
> >>>
> >>> Checked that firewall settings between node & frontend were not
> >>> blocking packets, and they don't... anyone knows why is this ? If not,
> >>> could you provide a convenient way to debug it ?
> >>>
> >>> Thanks !
> >>>
> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]>
> wrote:
> >>>> Hi,
> >>>>
> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
> >>>> best suited network topology for inet crawling (frontend being a net
> >>>> bottleneck), but I think it's fine for testing purposes.
> >>>>
> >>>> I'm having issues with fetch mapreduce job:
> >>>>
> >>>> According to ganglia monitoring (network traffic), and hadoop
> >>>> administrative interfaces, fetch phase is only being executed in the
> >>>> frontend node, where I launched "nutch crawl". Previous nutch phases
> >>>> were executed neatly distributed on all nodes:
> >>>>
> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
> >>>>        2       2       100.00%
> >>>>        1       1
> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
> >>> 100.00%
> >>>>        3       3       100.00%
> >>>>        1       1
> >>>> job_200807131223_0003   hadoop  generate: select
> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>>>        3       3       100.00%
> >>>>        1       1
> >>>> job_200807131223_0004   hadoop  generate: partition
> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>>>        4       4       100.00%
> >>>>        2       2
> >>>>
> >>>> I've checked that:
> >>>>
> >>>> 1) Nodes have inet connectivity, firewall settings
> >>>> 2) There's enough space on local discs
> >>>> 3) Proper processes are running on nodes
> >>>>
> >>>> frontend-node:
> >>>> ==========
> >>>>
> >>>> [root@cluster ~]# jps
> >>>> 29232 NameNode
> >>>> 29489 DataNode
> >>>> 29860 JobTracker
> >>>> 29778 SecondaryNameNode
> >>>> 31122 Crawl
> >>>> 30137 TaskTracker
> >>>> 10989 Jps
> >>>> 1818 TaskTracker$Child
> >>>>
> >>>> leaf nodes:
> >>>> ========
> >>>>
> >>>> [root@cluster ~]# cluster-fork jps
> >>>> compute-0-1:
> >>>> 23929 Jps
> >>>> 15568 TaskTracker
> >>>> 15361 DataNode
> >>>> compute-0-2:
> >>>> 32272 TaskTracker
> >>>> 32065 DataNode
> >>>> 7197 Jps
> >>>> 2397 TaskTracker$Child
> >>>> compute-0-3:
> >>>> 12054 DataNode
> >>>> 19584 Jps
> >>>> 14824 TaskTracker$Child
> >>>> 12261 TaskTracker
> >>>>
> >>>> 4) Logs only show fetching process (taking place only in the head
> >>> node):
> >>>>
> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
> >>>> http://valleycycles.net/
> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>>> robots.txt for http://www.getting-forward.org/:
> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>>> robots.txt for http://www.getting-forward.org/:
> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>>>
> >>>> What am I missing ? Why there are no fetching instances on nodes ? I
> >>>> used the following custom script to launch a pristine crawl each time:
> >>>>
> >>>> #!/bin/sh
> >>>>
> >>>> # 1) Stops hadoop daemons
> >>>> # 2) Overwrites new url list on HDFS
> >>>> # 3) Starts hadoop daemons
> >>>> # 4) Performs a clean crawl
> >>>>
> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
> >>>>
> >>>> CRAWL_DIR=crawl-ecxi || $1
> >>>> URL_DIR=urls || $2
> >>>>
> >>>> echo $CRAWL_DIR
> >>>> echo $URL_DIR
> >>>>
> >>>> echo "Leaving safe mode..."
> >>>> ./hadoop dfsadmin -safemode leave
> >>>>
> >>>> echo "Removing seed urls directory and previous crawled content..."
> >>>> ./hadoop dfs -rmr $URL_DIR
> >>>> ./hadoop dfs -rmr $CRAWL_DIR
> >>>>
> >>>> echo "Removing past logs"
> >>>>
> >>>> rm -rf ../logs/*
> >>>>
> >>>> echo "Uploading seed urls..."
> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
> >>>>
> >>>> #echo "Entering safe mode..."
> >>>> #./hadoop dfsadmin -safemode enter
> >>>>
> >>>> echo "******************"
> >>>> echo "* STARTING CRAWL *"
> >>>> echo "******************"
> >>>>
> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
> >>>>
> >>>>
> >>>> Next step I'm thinking on to fix the problem is to install
> >>>> nutch+hadoop as specified in this past nutch-user mail:
> >>>>
> >>>>
> http://www.mail-archive.com/nutch-user@.../msg10225.html
> >>>>
> >>>> As I don't know if it's current practice on trunk (archived mail is
> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
> >>>> it or if it's being worked on by someone... I haven't found a matching
> >>>> bug on JIRA :_/
> >>>>
> >>>
> >>
> >
>



--
Best Regards
Alexander Aristov
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
Right, I've checked before with mapred.map.tasks to 2 and
mapred.reduce.tasks to 1.

I've also played with several values on the following settings:

<property>
  <name>fetcher.server.delay</name>
  <value>1.5</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>3</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

Only one node executes the fetch phase anyway :_(

Thanks for the hint anyway... more ideas ?

On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov
<[hidden email]> wrote:

> Hi
>
> 1. You should have set
> mapred.map.tasks
> and
> mapred.reduce.tasks parameters They are set to 2 and 1 by default.
>
> 2. You can specify number of threads to perform fetching. Also there is a
> parameter that slows down fetching from one URL,so called polite fetching to
> not DOS the site.
>
> So check you configuration.
>
> Alex
>
> 2008/8/5 brainstorm <[hidden email]>
>
>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
>> the warnings... BUT, on a 7-node nutch cluster:
>>
>> 1) Fetching is only happening on *one* node despite several values
>> tested on settings:
>> mapred.tasktracker.map.tasks.maximum
>> mapred.tasktracker.reduce.tasks.maximum
>> export HADOOP_HEAPSIZE
>>
>> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>>
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>
>> But nutch keeps crawling only using one node, instead of seven
>> nodes... anybody knows why ?
>>
>> I've had a look at the code, searching for:
>>
>> conf.setNumMapTasks(int num), but found none: so I guess that the
>> number of mappers & reducers are not limited programatically.
>>
>> 2) Even on a single node, the fetching is really slow: 1 url or page
>> per second, at most.
>>
>> Can anybody shed some light into this ? Pointing which class/code I
>> should look into to modify this behaviour will help also.
>>
>> Anybody has a distributed nutch crawling cluster working with all
>> nodes fetching at fetch phase ?
>>
>> I even did some numbers using wordcount example using 7 nodes at 100%
>> cpu usage using a 425MB parsedtext file:
>>
>> maps    reduces heapsize        time
>> 2       2       500     3m43.049s
>> 4       4       500     4m41.846s
>> 8       8       500     4m29.344s
>> 16      16      500     3m43.672s
>> 32      32      500     3m41.367s
>> 64      64      500     4m27.275s
>> 128     128     500     4m35.233s
>> 256     256     500     3m41.916s
>>
>>
>> 2       2       2000    4m31.434s
>> 4       4       2000
>> 8       8       2000
>> 16      16      2000    4m32.213s
>> 32      32      2000
>> 64      64      2000
>> 128     128     2000
>> 256     256     2000    4m38.310s
>>
>> Thanks in advance,
>> Roman
>>
>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[hidden email]> wrote:
>> > While seeing DFS wireshark trace (and the corresponding RST's), the
>> > crawl continued to next step... seems that this WARNING is actually
>> > slowing down the whole crawling process (it took 36 minutes to
>> > complete the previous fetch) with just 3 urls seed file :-!!!
>> >
>> > I just posted a couple of exceptions/questions regarding DFS on hadoop
>> > core mailing list.
>> >
>> > PD: As a side note, the following error caught my attention:
>> >
>> > Fetcher: starting
>> > Fetcher: segment: crawl-ecxi/segments/20080715172458
>> > Too many fetch-failures
>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
>> > with: org.apache.nutch.protocol.http.api.HttpException:
>> > java.net.UnknownHostException: upc.cat
>> >
>> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
>> > exist, it just gets redirected to www.upc.cat :-/
>> >
>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[hidden email]> wrote:
>> >> Yep, I know about wireshark, and wanted to avoid it to debug this
>> >> issue (perhaps there was a simple solution/known bug/issue)...
>> >>
>> >> I just launched wireshark on frontend with filter tcp.port == 50010,
>> >> and now I'm diving on the tcp stream... let's see if I see the light
>> >> (RST flag somewhere ?), thanks anyway for replying ;)
>> >>
>> >> Just for the record, the phase that stalls is fetcher during reduce:
>> >>
>> >> Jobid   User    Name    Map % Complete  Map Total       Maps Completed
>>  Reduce %
>> >> Complete        Reduce Total    Reduces Completed
>> >> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458
>>        100.00%
>> >>        2       2       16.66%
>> >>
>> >>        1       0
>> >>
>> >> It's stuck on 16%, no traffic, no crawling, but still "running".
>> >>
>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>> >> <[hidden email]> wrote:
>> >>> Hi brain,
>> >>>        If I were you, I would download wireshark
>> >>> (http://www.wireshark.org/download.html) to see what is happening at
>> the
>> >>> network layer and see if that provides any clues.  A socket exception
>> >>> that you don't expect is usually due to one side of the conversation
>> not
>> >>> understanding the other side.  If you have 4 machines, then you have 4
>> >>> possible places where default firewall rules could be causing an issue.
>> >>> If it is not the firewall rules, the NAT rules could be a potential
>> >>> source of error.  Also, even a router hardware error could cause a
>> >>> problem.
>> >>>        If you understand TCP, just make sure that you see all the
>> >>> correct TCP stuff happening in wireshark.  If you don't understand
>> >>> wireshark's display, let me know, and I'll pass on some quickstart
>> >>> information.
>> >>>
>> >>>        If you already know all of this, I don't have any way to help
>> >>> you, as it looks like you're trying to accomplish something trickier
>> >>> with nutch than I have ever attempted.
>> >>>
>> >>> Patrick
>> >>>
>> >>> -----Original Message-----
>> >>> From: brainstorm [mailto:[hidden email]]
>> >>> Sent: Tuesday, July 15, 2008 10:08 AM
>> >>> To: [hidden email]
>> >>> Subject: Re: Distributed fetching only happening in one node ?
>> >>>
>> >>> Boiling down the problem I'm stuck on this:
>> >>>
>> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>> >>>        at
>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>> >>>        at
>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>> >>>        at
>> >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> >>>        at
>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> >>>        at
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>> >>>        at
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>> >>>        at
>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>> >>>        at java.lang.Thread.run(Thread.java:595)
>> >>>
>> >>> Checked that firewall settings between node & frontend were not
>> >>> blocking packets, and they don't... anyone knows why is this ? If not,
>> >>> could you provide a convenient way to debug it ?
>> >>>
>> >>> Thanks !
>> >>>
>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]>
>> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>> >>>> best suited network topology for inet crawling (frontend being a net
>> >>>> bottleneck), but I think it's fine for testing purposes.
>> >>>>
>> >>>> I'm having issues with fetch mapreduce job:
>> >>>>
>> >>>> According to ganglia monitoring (network traffic), and hadoop
>> >>>> administrative interfaces, fetch phase is only being executed in the
>> >>>> frontend node, where I launched "nutch crawl". Previous nutch phases
>> >>>> were executed neatly distributed on all nodes:
>> >>>>
>> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
>> >>>>        2       2       100.00%
>> >>>>        1       1
>> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>> >>> 100.00%
>> >>>>        3       3       100.00%
>> >>>>        1       1
>> >>>> job_200807131223_0003   hadoop  generate: select
>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>>>        3       3       100.00%
>> >>>>        1       1
>> >>>> job_200807131223_0004   hadoop  generate: partition
>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>>>        4       4       100.00%
>> >>>>        2       2
>> >>>>
>> >>>> I've checked that:
>> >>>>
>> >>>> 1) Nodes have inet connectivity, firewall settings
>> >>>> 2) There's enough space on local discs
>> >>>> 3) Proper processes are running on nodes
>> >>>>
>> >>>> frontend-node:
>> >>>> ==========
>> >>>>
>> >>>> [root@cluster ~]# jps
>> >>>> 29232 NameNode
>> >>>> 29489 DataNode
>> >>>> 29860 JobTracker
>> >>>> 29778 SecondaryNameNode
>> >>>> 31122 Crawl
>> >>>> 30137 TaskTracker
>> >>>> 10989 Jps
>> >>>> 1818 TaskTracker$Child
>> >>>>
>> >>>> leaf nodes:
>> >>>> ========
>> >>>>
>> >>>> [root@cluster ~]# cluster-fork jps
>> >>>> compute-0-1:
>> >>>> 23929 Jps
>> >>>> 15568 TaskTracker
>> >>>> 15361 DataNode
>> >>>> compute-0-2:
>> >>>> 32272 TaskTracker
>> >>>> 32065 DataNode
>> >>>> 7197 Jps
>> >>>> 2397 TaskTracker$Child
>> >>>> compute-0-3:
>> >>>> 12054 DataNode
>> >>>> 19584 Jps
>> >>>> 14824 TaskTracker$Child
>> >>>> 12261 TaskTracker
>> >>>>
>> >>>> 4) Logs only show fetching process (taking place only in the head
>> >>> node):
>> >>>>
>> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>> >>>> http://valleycycles.net/
>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>>>
>> >>>> What am I missing ? Why there are no fetching instances on nodes ? I
>> >>>> used the following custom script to launch a pristine crawl each time:
>> >>>>
>> >>>> #!/bin/sh
>> >>>>
>> >>>> # 1) Stops hadoop daemons
>> >>>> # 2) Overwrites new url list on HDFS
>> >>>> # 3) Starts hadoop daemons
>> >>>> # 4) Performs a clean crawl
>> >>>>
>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>> >>>>
>> >>>> CRAWL_DIR=crawl-ecxi || $1
>> >>>> URL_DIR=urls || $2
>> >>>>
>> >>>> echo $CRAWL_DIR
>> >>>> echo $URL_DIR
>> >>>>
>> >>>> echo "Leaving safe mode..."
>> >>>> ./hadoop dfsadmin -safemode leave
>> >>>>
>> >>>> echo "Removing seed urls directory and previous crawled content..."
>> >>>> ./hadoop dfs -rmr $URL_DIR
>> >>>> ./hadoop dfs -rmr $CRAWL_DIR
>> >>>>
>> >>>> echo "Removing past logs"
>> >>>>
>> >>>> rm -rf ../logs/*
>> >>>>
>> >>>> echo "Uploading seed urls..."
>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>> >>>>
>> >>>> #echo "Entering safe mode..."
>> >>>> #./hadoop dfsadmin -safemode enter
>> >>>>
>> >>>> echo "******************"
>> >>>> echo "* STARTING CRAWL *"
>> >>>> echo "******************"
>> >>>>
>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>> >>>>
>> >>>>
>> >>>> Next step I'm thinking on to fix the problem is to install
>> >>>> nutch+hadoop as specified in this past nutch-user mail:
>> >>>>
>> >>>>
>> http://www.mail-archive.com/nutch-user@.../msg10225.html
>> >>>>
>> >>>> As I don't know if it's current practice on trunk (archived mail is
>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>> >>>> it or if it's being worked on by someone... I haven't found a matching
>> >>>> bug on JIRA :_/
>> >>>>
>> >>>
>> >>
>> >
>>
>
>
>
> --
> Best Regards
> Alexander Aristov
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
Correction: Only 2 nodes doing map operation on fetch (nodes 7 and 2).

On Tue, Aug 5, 2008 at 9:11 AM, brainstorm <[hidden email]> wrote:

> Right, I've checked before with mapred.map.tasks to 2 and
> mapred.reduce.tasks to 1.
>
> I've also played with several values on the following settings:
>
> <property>
>  <name>fetcher.server.delay</name>
>  <value>1.5</value>
>  <description>The number of seconds the fetcher will delay between
>   successive requests to the same server.</description>
> </property>
>
> <property>
>  <name>http.max.delays</name>
>  <value>3</value>
>  <description>The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.</description>
> </property>
>
> Only one node executes the fetch phase anyway :_(
>
> Thanks for the hint anyway... more ideas ?
>
> On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov
> <[hidden email]> wrote:
>> Hi
>>
>> 1. You should have set
>> mapred.map.tasks
>> and
>> mapred.reduce.tasks parameters They are set to 2 and 1 by default.
>>
>> 2. You can specify number of threads to perform fetching. Also there is a
>> parameter that slows down fetching from one URL,so called polite fetching to
>> not DOS the site.
>>
>> So check you configuration.
>>
>> Alex
>>
>> 2008/8/5 brainstorm <[hidden email]>
>>
>>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
>>> the warnings... BUT, on a 7-node nutch cluster:
>>>
>>> 1) Fetching is only happening on *one* node despite several values
>>> tested on settings:
>>> mapred.tasktracker.map.tasks.maximum
>>> mapred.tasktracker.reduce.tasks.maximum
>>> export HADOOP_HEAPSIZE
>>>
>>> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>>>
>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>
>>> But nutch keeps crawling only using one node, instead of seven
>>> nodes... anybody knows why ?
>>>
>>> I've had a look at the code, searching for:
>>>
>>> conf.setNumMapTasks(int num), but found none: so I guess that the
>>> number of mappers & reducers are not limited programatically.
>>>
>>> 2) Even on a single node, the fetching is really slow: 1 url or page
>>> per second, at most.
>>>
>>> Can anybody shed some light into this ? Pointing which class/code I
>>> should look into to modify this behaviour will help also.
>>>
>>> Anybody has a distributed nutch crawling cluster working with all
>>> nodes fetching at fetch phase ?
>>>
>>> I even did some numbers using wordcount example using 7 nodes at 100%
>>> cpu usage using a 425MB parsedtext file:
>>>
>>> maps    reduces heapsize        time
>>> 2       2       500     3m43.049s
>>> 4       4       500     4m41.846s
>>> 8       8       500     4m29.344s
>>> 16      16      500     3m43.672s
>>> 32      32      500     3m41.367s
>>> 64      64      500     4m27.275s
>>> 128     128     500     4m35.233s
>>> 256     256     500     3m41.916s
>>>
>>>
>>> 2       2       2000    4m31.434s
>>> 4       4       2000
>>> 8       8       2000
>>> 16      16      2000    4m32.213s
>>> 32      32      2000
>>> 64      64      2000
>>> 128     128     2000
>>> 256     256     2000    4m38.310s
>>>
>>> Thanks in advance,
>>> Roman
>>>
>>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[hidden email]> wrote:
>>> > While seeing DFS wireshark trace (and the corresponding RST's), the
>>> > crawl continued to next step... seems that this WARNING is actually
>>> > slowing down the whole crawling process (it took 36 minutes to
>>> > complete the previous fetch) with just 3 urls seed file :-!!!
>>> >
>>> > I just posted a couple of exceptions/questions regarding DFS on hadoop
>>> > core mailing list.
>>> >
>>> > PD: As a side note, the following error caught my attention:
>>> >
>>> > Fetcher: starting
>>> > Fetcher: segment: crawl-ecxi/segments/20080715172458
>>> > Too many fetch-failures
>>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
>>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
>>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
>>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
>>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
>>> > with: org.apache.nutch.protocol.http.api.HttpException:
>>> > java.net.UnknownHostException: upc.cat
>>> >
>>> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
>>> > exist, it just gets redirected to www.upc.cat :-/
>>> >
>>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[hidden email]> wrote:
>>> >> Yep, I know about wireshark, and wanted to avoid it to debug this
>>> >> issue (perhaps there was a simple solution/known bug/issue)...
>>> >>
>>> >> I just launched wireshark on frontend with filter tcp.port == 50010,
>>> >> and now I'm diving on the tcp stream... let's see if I see the light
>>> >> (RST flag somewhere ?), thanks anyway for replying ;)
>>> >>
>>> >> Just for the record, the phase that stalls is fetcher during reduce:
>>> >>
>>> >> Jobid   User    Name    Map % Complete  Map Total       Maps Completed
>>>  Reduce %
>>> >> Complete        Reduce Total    Reduces Completed
>>> >> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458
>>>        100.00%
>>> >>        2       2       16.66%
>>> >>
>>> >>        1       0
>>> >>
>>> >> It's stuck on 16%, no traffic, no crawling, but still "running".
>>> >>
>>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>>> >> <[hidden email]> wrote:
>>> >>> Hi brain,
>>> >>>        If I were you, I would download wireshark
>>> >>> (http://www.wireshark.org/download.html) to see what is happening at
>>> the
>>> >>> network layer and see if that provides any clues.  A socket exception
>>> >>> that you don't expect is usually due to one side of the conversation
>>> not
>>> >>> understanding the other side.  If you have 4 machines, then you have 4
>>> >>> possible places where default firewall rules could be causing an issue.
>>> >>> If it is not the firewall rules, the NAT rules could be a potential
>>> >>> source of error.  Also, even a router hardware error could cause a
>>> >>> problem.
>>> >>>        If you understand TCP, just make sure that you see all the
>>> >>> correct TCP stuff happening in wireshark.  If you don't understand
>>> >>> wireshark's display, let me know, and I'll pass on some quickstart
>>> >>> information.
>>> >>>
>>> >>>        If you already know all of this, I don't have any way to help
>>> >>> you, as it looks like you're trying to accomplish something trickier
>>> >>> with nutch than I have ever attempted.
>>> >>>
>>> >>> Patrick
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: brainstorm [mailto:[hidden email]]
>>> >>> Sent: Tuesday, July 15, 2008 10:08 AM
>>> >>> To: [hidden email]
>>> >>> Subject: Re: Distributed fetching only happening in one node ?
>>> >>>
>>> >>> Boiling down the problem I'm stuck on this:
>>> >>>
>>> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>>> >>>        at
>>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>>> >>>        at
>>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>> >>>        at
>>> >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>> >>>        at
>>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>> >>>        at
>>> >>>
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>>> >>>        at
>>> >>>
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>>> >>>        at
>>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>>> >>>        at java.lang.Thread.run(Thread.java:595)
>>> >>>
>>> >>> Checked that firewall settings between node & frontend were not
>>> >>> blocking packets, and they don't... anyone knows why is this ? If not,
>>> >>> could you provide a convenient way to debug it ?
>>> >>>
>>> >>> Thanks !
>>> >>>
>>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]>
>>> wrote:
>>> >>>> Hi,
>>> >>>>
>>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>>> >>>> best suited network topology for inet crawling (frontend being a net
>>> >>>> bottleneck), but I think it's fine for testing purposes.
>>> >>>>
>>> >>>> I'm having issues with fetch mapreduce job:
>>> >>>>
>>> >>>> According to ganglia monitoring (network traffic), and hadoop
>>> >>>> administrative interfaces, fetch phase is only being executed in the
>>> >>>> frontend node, where I launched "nutch crawl". Previous nutch phases
>>> >>>> were executed neatly distributed on all nodes:
>>> >>>>
>>> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
>>> >>>>        2       2       100.00%
>>> >>>>        1       1
>>> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>>> >>> 100.00%
>>> >>>>        3       3       100.00%
>>> >>>>        1       1
>>> >>>> job_200807131223_0003   hadoop  generate: select
>>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>>> >>>>        3       3       100.00%
>>> >>>>        1       1
>>> >>>> job_200807131223_0004   hadoop  generate: partition
>>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>>> >>>>        4       4       100.00%
>>> >>>>        2       2
>>> >>>>
>>> >>>> I've checked that:
>>> >>>>
>>> >>>> 1) Nodes have inet connectivity, firewall settings
>>> >>>> 2) There's enough space on local discs
>>> >>>> 3) Proper processes are running on nodes
>>> >>>>
>>> >>>> frontend-node:
>>> >>>> ==========
>>> >>>>
>>> >>>> [root@cluster ~]# jps
>>> >>>> 29232 NameNode
>>> >>>> 29489 DataNode
>>> >>>> 29860 JobTracker
>>> >>>> 29778 SecondaryNameNode
>>> >>>> 31122 Crawl
>>> >>>> 30137 TaskTracker
>>> >>>> 10989 Jps
>>> >>>> 1818 TaskTracker$Child
>>> >>>>
>>> >>>> leaf nodes:
>>> >>>> ========
>>> >>>>
>>> >>>> [root@cluster ~]# cluster-fork jps
>>> >>>> compute-0-1:
>>> >>>> 23929 Jps
>>> >>>> 15568 TaskTracker
>>> >>>> 15361 DataNode
>>> >>>> compute-0-2:
>>> >>>> 32272 TaskTracker
>>> >>>> 32065 DataNode
>>> >>>> 7197 Jps
>>> >>>> 2397 TaskTracker$Child
>>> >>>> compute-0-3:
>>> >>>> 12054 DataNode
>>> >>>> 19584 Jps
>>> >>>> 14824 TaskTracker$Child
>>> >>>> 12261 TaskTracker
>>> >>>>
>>> >>>> 4) Logs only show fetching process (taking place only in the head
>>> >>> node):
>>> >>>>
>>> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>>> >>>> http://valleycycles.net/
>>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>> >>>> robots.txt for http://www.getting-forward.org/:
>>> >>>> java.net.UnknownHostException: www.getting-forward.org
>>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>> >>>> robots.txt for http://www.getting-forward.org/:
>>> >>>> java.net.UnknownHostException: www.getting-forward.org
>>> >>>>
>>> >>>> What am I missing ? Why there are no fetching instances on nodes ? I
>>> >>>> used the following custom script to launch a pristine crawl each time:
>>> >>>>
>>> >>>> #!/bin/sh
>>> >>>>
>>> >>>> # 1) Stops hadoop daemons
>>> >>>> # 2) Overwrites new url list on HDFS
>>> >>>> # 3) Starts hadoop daemons
>>> >>>> # 4) Performs a clean crawl
>>> >>>>
>>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>>> >>>>
>>> >>>> CRAWL_DIR=crawl-ecxi || $1
>>> >>>> URL_DIR=urls || $2
>>> >>>>
>>> >>>> echo $CRAWL_DIR
>>> >>>> echo $URL_DIR
>>> >>>>
>>> >>>> echo "Leaving safe mode..."
>>> >>>> ./hadoop dfsadmin -safemode leave
>>> >>>>
>>> >>>> echo "Removing seed urls directory and previous crawled content..."
>>> >>>> ./hadoop dfs -rmr $URL_DIR
>>> >>>> ./hadoop dfs -rmr $CRAWL_DIR
>>> >>>>
>>> >>>> echo "Removing past logs"
>>> >>>>
>>> >>>> rm -rf ../logs/*
>>> >>>>
>>> >>>> echo "Uploading seed urls..."
>>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>>> >>>>
>>> >>>> #echo "Entering safe mode..."
>>> >>>> #./hadoop dfsadmin -safemode enter
>>> >>>>
>>> >>>> echo "******************"
>>> >>>> echo "* STARTING CRAWL *"
>>> >>>> echo "******************"
>>> >>>>
>>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>>> >>>>
>>> >>>>
>>> >>>> Next step I'm thinking on to fix the problem is to install
>>> >>>> nutch+hadoop as specified in this past nutch-user mail:
>>> >>>>
>>> >>>>
>>> http://www.mail-archive.com/nutch-user@.../msg10225.html
>>> >>>>
>>> >>>> As I don't know if it's current practice on trunk (archived mail is
>>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>>> >>>> it or if it's being worked on by someone... I haven't found a matching
>>> >>>> bug on JIRA :_/
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>
>>
>>
>> --
>> Best Regards
>> Alexander Aristov
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

Alexander Aristov
Still not clear.

What values for mapred.map.tasks and mapred.reduce.tasks do you have now?
Check the hadoop-site.xml file as it may affect your configuration also.

Alexander

2008/8/5 brainstorm <[hidden email]>

> Correction: Only 2 nodes doing map operation on fetch (nodes 7 and 2).
>
> On Tue, Aug 5, 2008 at 9:11 AM, brainstorm <[hidden email]> wrote:
> > Right, I've checked before with mapred.map.tasks to 2 and
> > mapred.reduce.tasks to 1.
> >
> > I've also played with several values on the following settings:
> >
> > <property>
> >  <name>fetcher.server.delay</name>
> >  <value>1.5</value>
> >  <description>The number of seconds the fetcher will delay between
> >   successive requests to the same server.</description>
> > </property>
> >
> > <property>
> >  <name>http.max.delays</name>
> >  <value>3</value>
> >  <description>The number of times a thread will delay when trying to
> >  fetch a page.  Each time it finds that a host is busy, it will wait
> >  fetcher.server.delay.  After http.max.delays attepts, it will give
> >  up on the page for now.</description>
> > </property>
> >
> > Only one node executes the fetch phase anyway :_(
> >
> > Thanks for the hint anyway... more ideas ?
> >
> > On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov
> > <[hidden email]> wrote:
> >> Hi
> >>
> >> 1. You should have set
> >> mapred.map.tasks
> >> and
> >> mapred.reduce.tasks parameters They are set to 2 and 1 by default.
> >>
> >> 2. You can specify number of threads to perform fetching. Also there is
> a
> >> parameter that slows down fetching from one URL,so called polite
> fetching to
> >> not DOS the site.
> >>
> >> So check you configuration.
> >>
> >> Alex
> >>
> >> 2008/8/5 brainstorm <[hidden email]>
> >>
> >>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
> >>> the warnings... BUT, on a 7-node nutch cluster:
> >>>
> >>> 1) Fetching is only happening on *one* node despite several values
> >>> tested on settings:
> >>> mapred.tasktracker.map.tasks.maximum
> >>> mapred.tasktracker.reduce.tasks.maximum
> >>> export HADOOP_HEAPSIZE
> >>>
> >>> I've played with mapreduce (hadoop-site.xml) settings as advised on:
> >>>
> >>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> >>>
> >>> But nutch keeps crawling only using one node, instead of seven
> >>> nodes... anybody knows why ?
> >>>
> >>> I've had a look at the code, searching for:
> >>>
> >>> conf.setNumMapTasks(int num), but found none: so I guess that the
> >>> number of mappers & reducers are not limited programatically.
> >>>
> >>> 2) Even on a single node, the fetching is really slow: 1 url or page
> >>> per second, at most.
> >>>
> >>> Can anybody shed some light into this ? Pointing which class/code I
> >>> should look into to modify this behaviour will help also.
> >>>
> >>> Anybody has a distributed nutch crawling cluster working with all
> >>> nodes fetching at fetch phase ?
> >>>
> >>> I even did some numbers using wordcount example using 7 nodes at 100%
> >>> cpu usage using a 425MB parsedtext file:
> >>>
> >>> maps    reduces heapsize        time
> >>> 2       2       500     3m43.049s
> >>> 4       4       500     4m41.846s
> >>> 8       8       500     4m29.344s
> >>> 16      16      500     3m43.672s
> >>> 32      32      500     3m41.367s
> >>> 64      64      500     4m27.275s
> >>> 128     128     500     4m35.233s
> >>> 256     256     500     3m41.916s
> >>>
> >>>
> >>> 2       2       2000    4m31.434s
> >>> 4       4       2000
> >>> 8       8       2000
> >>> 16      16      2000    4m32.213s
> >>> 32      32      2000
> >>> 64      64      2000
> >>> 128     128     2000
> >>> 256     256     2000    4m38.310s
> >>>
> >>> Thanks in advance,
> >>> Roman
> >>>
> >>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[hidden email]>
> wrote:
> >>> > While seeing DFS wireshark trace (and the corresponding RST's), the
> >>> > crawl continued to next step... seems that this WARNING is actually
> >>> > slowing down the whole crawling process (it took 36 minutes to
> >>> > complete the previous fetch) with just 3 urls seed file :-!!!
> >>> >
> >>> > I just posted a couple of exceptions/questions regarding DFS on
> hadoop
> >>> > core mailing list.
> >>> >
> >>> > PD: As a side note, the following error caught my attention:
> >>> >
> >>> > Fetcher: starting
> >>> > Fetcher: segment: crawl-ecxi/segments/20080715172458
> >>> > Too many fetch-failures
> >>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
> >>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
> >>> > with: org.apache.nutch.protocol.http.api.HttpException:
> >>> > java.net.UnknownHostException: upc.cat
> >>> >
> >>> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
> >>> > exist, it just gets redirected to www.upc.cat :-/
> >>> >
> >>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[hidden email]>
> wrote:
> >>> >> Yep, I know about wireshark, and wanted to avoid it to debug this
> >>> >> issue (perhaps there was a simple solution/known bug/issue)...
> >>> >>
> >>> >> I just launched wireshark on frontend with filter tcp.port == 50010,
> >>> >> and now I'm diving on the tcp stream... let's see if I see the light
> >>> >> (RST flag somewhere ?), thanks anyway for replying ;)
> >>> >>
> >>> >> Just for the record, the phase that stalls is fetcher during reduce:
> >>> >>
> >>> >> Jobid   User    Name    Map % Complete  Map Total       Maps
> Completed
> >>>  Reduce %
> >>> >> Complete        Reduce Total    Reduces Completed
> >>> >> job_200807151723_0005   hadoop  fetch
> crawl-ecxi/segments/20080715172458
> >>>        100.00%
> >>> >>        2       2       16.66%
> >>> >>
> >>> >>        1       0
> >>> >>
> >>> >> It's stuck on 16%, no traffic, no crawling, but still "running".
> >>> >>
> >>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
> >>> >> <[hidden email]> wrote:
> >>> >>> Hi brain,
> >>> >>>        If I were you, I would download wireshark
> >>> >>> (http://www.wireshark.org/download.html) to see what is happening
> at
> >>> the
> >>> >>> network layer and see if that provides any clues.  A socket
> exception
> >>> >>> that you don't expect is usually due to one side of the
> conversation
> >>> not
> >>> >>> understanding the other side.  If you have 4 machines, then you
> have 4
> >>> >>> possible places where default firewall rules could be causing an
> issue.
> >>> >>> If it is not the firewall rules, the NAT rules could be a potential
> >>> >>> source of error.  Also, even a router hardware error could cause a
> >>> >>> problem.
> >>> >>>        If you understand TCP, just make sure that you see all the
> >>> >>> correct TCP stuff happening in wireshark.  If you don't understand
> >>> >>> wireshark's display, let me know, and I'll pass on some quickstart
> >>> >>> information.
> >>> >>>
> >>> >>>        If you already know all of this, I don't have any way to
> help
> >>> >>> you, as it looks like you're trying to accomplish something
> trickier
> >>> >>> with nutch than I have ever attempted.
> >>> >>>
> >>> >>> Patrick
> >>> >>>
> >>> >>> -----Original Message-----
> >>> >>> From: brainstorm [mailto:[hidden email]]
> >>> >>> Sent: Tuesday, July 15, 2008 10:08 AM
> >>> >>> To: [hidden email]
> >>> >>> Subject: Re: Distributed fetching only happening in one node ?
> >>> >>>
> >>> >>> Boiling down the problem I'm stuck on this:
> >>> >>>
> >>> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
> >>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
> >>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
> >>> >>>        at
> >>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
> >>> >>>        at
> >>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> >>> >>>        at
> >>> >>>
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> >>> >>>        at
> >>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> >>> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >>> >>>        at
> >>> >>>
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
> >>> >>>        at
> >>> >>>
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
> >>> >>>        at
> >>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
> >>> >>>        at java.lang.Thread.run(Thread.java:595)
> >>> >>>
> >>> >>> Checked that firewall settings between node & frontend were not
> >>> >>> blocking packets, and they don't... anyone knows why is this ? If
> not,
> >>> >>> could you provide a convenient way to debug it ?
> >>> >>>
> >>> >>> Thanks !
> >>> >>>
> >>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]>
> >>> wrote:
> >>> >>>> Hi,
> >>> >>>>
> >>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
> >>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
> >>> >>>> best suited network topology for inet crawling (frontend being a
> net
> >>> >>>> bottleneck), but I think it's fine for testing purposes.
> >>> >>>>
> >>> >>>> I'm having issues with fetch mapreduce job:
> >>> >>>>
> >>> >>>> According to ganglia monitoring (network traffic), and hadoop
> >>> >>>> administrative interfaces, fetch phase is only being executed in
> the
> >>> >>>> frontend node, where I launched "nutch crawl". Previous nutch
> phases
> >>> >>>> were executed neatly distributed on all nodes:
> >>> >>>>
> >>> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
> >>> >>>>        2       2       100.00%
> >>> >>>>        1       1
> >>> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
> >>> >>> 100.00%
> >>> >>>>        3       3       100.00%
> >>> >>>>        1       1
> >>> >>>> job_200807131223_0003   hadoop  generate: select
> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>> >>>>        3       3       100.00%
> >>> >>>>        1       1
> >>> >>>> job_200807131223_0004   hadoop  generate: partition
> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>> >>>>        4       4       100.00%
> >>> >>>>        2       2
> >>> >>>>
> >>> >>>> I've checked that:
> >>> >>>>
> >>> >>>> 1) Nodes have inet connectivity, firewall settings
> >>> >>>> 2) There's enough space on local discs
> >>> >>>> 3) Proper processes are running on nodes
> >>> >>>>
> >>> >>>> frontend-node:
> >>> >>>> ==========
> >>> >>>>
> >>> >>>> [root@cluster ~]# jps
> >>> >>>> 29232 NameNode
> >>> >>>> 29489 DataNode
> >>> >>>> 29860 JobTracker
> >>> >>>> 29778 SecondaryNameNode
> >>> >>>> 31122 Crawl
> >>> >>>> 30137 TaskTracker
> >>> >>>> 10989 Jps
> >>> >>>> 1818 TaskTracker$Child
> >>> >>>>
> >>> >>>> leaf nodes:
> >>> >>>> ========
> >>> >>>>
> >>> >>>> [root@cluster ~]# cluster-fork jps
> >>> >>>> compute-0-1:
> >>> >>>> 23929 Jps
> >>> >>>> 15568 TaskTracker
> >>> >>>> 15361 DataNode
> >>> >>>> compute-0-2:
> >>> >>>> 32272 TaskTracker
> >>> >>>> 32065 DataNode
> >>> >>>> 7197 Jps
> >>> >>>> 2397 TaskTracker$Child
> >>> >>>> compute-0-3:
> >>> >>>> 12054 DataNode
> >>> >>>> 19584 Jps
> >>> >>>> 14824 TaskTracker$Child
> >>> >>>> 12261 TaskTracker
> >>> >>>>
> >>> >>>> 4) Logs only show fetching process (taking place only in the head
> >>> >>> node):
> >>> >>>>
> >>> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
> >>> >>>> http://valleycycles.net/
> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>> >>>> robots.txt for http://www.getting-forward.org/:
> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>> >>>> robots.txt for http://www.getting-forward.org/:
> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>> >>>>
> >>> >>>> What am I missing ? Why there are no fetching instances on nodes ?
> I
> >>> >>>> used the following custom script to launch a pristine crawl each
> time:
> >>> >>>>
> >>> >>>> #!/bin/sh
> >>> >>>>
> >>> >>>> # 1) Stops hadoop daemons
> >>> >>>> # 2) Overwrites new url list on HDFS
> >>> >>>> # 3) Starts hadoop daemons
> >>> >>>> # 4) Performs a clean crawl
> >>> >>>>
> >>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
> >>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
> >>> >>>>
> >>> >>>> CRAWL_DIR=crawl-ecxi || $1
> >>> >>>> URL_DIR=urls || $2
> >>> >>>>
> >>> >>>> echo $CRAWL_DIR
> >>> >>>> echo $URL_DIR
> >>> >>>>
> >>> >>>> echo "Leaving safe mode..."
> >>> >>>> ./hadoop dfsadmin -safemode leave
> >>> >>>>
> >>> >>>> echo "Removing seed urls directory and previous crawled
> content..."
> >>> >>>> ./hadoop dfs -rmr $URL_DIR
> >>> >>>> ./hadoop dfs -rmr $CRAWL_DIR
> >>> >>>>
> >>> >>>> echo "Removing past logs"
> >>> >>>>
> >>> >>>> rm -rf ../logs/*
> >>> >>>>
> >>> >>>> echo "Uploading seed urls..."
> >>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
> >>> >>>>
> >>> >>>> #echo "Entering safe mode..."
> >>> >>>> #./hadoop dfsadmin -safemode enter
> >>> >>>>
> >>> >>>> echo "******************"
> >>> >>>> echo "* STARTING CRAWL *"
> >>> >>>> echo "******************"
> >>> >>>>
> >>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
> >>> >>>>
> >>> >>>>
> >>> >>>> Next step I'm thinking on to fix the problem is to install
> >>> >>>> nutch+hadoop as specified in this past nutch-user mail:
> >>> >>>>
> >>> >>>>
> >>> http://www.mail-archive.com/nutch-user@.../msg10225.html
> >>> >>>>
> >>> >>>> As I don't know if it's current practice on trunk (archived mail
> is
> >>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to
> fix
> >>> >>>> it or if it's being worked on by someone... I haven't found a
> matching
> >>> >>>> bug on JIRA :_/
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Best Regards
> >> Alexander Aristov
> >>
> >
>



--
Best Regards
Alexander Aristov
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
values 2 and 1 respectively *in the past*, same results. Right now, I
have 32 for both: same results as those settings are just a hint for
nutch.

Regarding number of threads *per host* I tried with 10 and 20 in the
past, same results.

I appreciate your support Alexander, thank you :)

On Tue, Aug 5, 2008 at 9:17 AM, Alexander Aristov
<[hidden email]> wrote:

> Still not clear.
>
> What values for mapred.map.tasks and mapred.reduce.tasks do you have now?
> Check the hadoop-site.xml file as it may affect your configuration also.
>
> Alexander
>
> 2008/8/5 brainstorm <[hidden email]>
>
>> Correction: Only 2 nodes doing map operation on fetch (nodes 7 and 2).
>>
>> On Tue, Aug 5, 2008 at 9:11 AM, brainstorm <[hidden email]> wrote:
>> > Right, I've checked before with mapred.map.tasks to 2 and
>> > mapred.reduce.tasks to 1.
>> >
>> > I've also played with several values on the following settings:
>> >
>> > <property>
>> >  <name>fetcher.server.delay</name>
>> >  <value>1.5</value>
>> >  <description>The number of seconds the fetcher will delay between
>> >   successive requests to the same server.</description>
>> > </property>
>> >
>> > <property>
>> >  <name>http.max.delays</name>
>> >  <value>3</value>
>> >  <description>The number of times a thread will delay when trying to
>> >  fetch a page.  Each time it finds that a host is busy, it will wait
>> >  fetcher.server.delay.  After http.max.delays attepts, it will give
>> >  up on the page for now.</description>
>> > </property>
>> >
>> > Only one node executes the fetch phase anyway :_(
>> >
>> > Thanks for the hint anyway... more ideas ?
>> >
>> > On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov
>> > <[hidden email]> wrote:
>> >> Hi
>> >>
>> >> 1. You should have set
>> >> mapred.map.tasks
>> >> and
>> >> mapred.reduce.tasks parameters They are set to 2 and 1 by default.
>> >>
>> >> 2. You can specify number of threads to perform fetching. Also there is
>> a
>> >> parameter that slows down fetching from one URL,so called polite
>> fetching to
>> >> not DOS the site.
>> >>
>> >> So check you configuration.
>> >>
>> >> Alex
>> >>
>> >> 2008/8/5 brainstorm <[hidden email]>
>> >>
>> >>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
>> >>> the warnings... BUT, on a 7-node nutch cluster:
>> >>>
>> >>> 1) Fetching is only happening on *one* node despite several values
>> >>> tested on settings:
>> >>> mapred.tasktracker.map.tasks.maximum
>> >>> mapred.tasktracker.reduce.tasks.maximum
>> >>> export HADOOP_HEAPSIZE
>> >>>
>> >>> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>> >>>
>> >>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>> >>>
>> >>> But nutch keeps crawling only using one node, instead of seven
>> >>> nodes... anybody knows why ?
>> >>>
>> >>> I've had a look at the code, searching for:
>> >>>
>> >>> conf.setNumMapTasks(int num), but found none: so I guess that the
>> >>> number of mappers & reducers are not limited programatically.
>> >>>
>> >>> 2) Even on a single node, the fetching is really slow: 1 url or page
>> >>> per second, at most.
>> >>>
>> >>> Can anybody shed some light into this ? Pointing which class/code I
>> >>> should look into to modify this behaviour will help also.
>> >>>
>> >>> Anybody has a distributed nutch crawling cluster working with all
>> >>> nodes fetching at fetch phase ?
>> >>>
>> >>> I even did some numbers using wordcount example using 7 nodes at 100%
>> >>> cpu usage using a 425MB parsedtext file:
>> >>>
>> >>> maps    reduces heapsize        time
>> >>> 2       2       500     3m43.049s
>> >>> 4       4       500     4m41.846s
>> >>> 8       8       500     4m29.344s
>> >>> 16      16      500     3m43.672s
>> >>> 32      32      500     3m41.367s
>> >>> 64      64      500     4m27.275s
>> >>> 128     128     500     4m35.233s
>> >>> 256     256     500     3m41.916s
>> >>>
>> >>>
>> >>> 2       2       2000    4m31.434s
>> >>> 4       4       2000
>> >>> 8       8       2000
>> >>> 16      16      2000    4m32.213s
>> >>> 32      32      2000
>> >>> 64      64      2000
>> >>> 128     128     2000
>> >>> 256     256     2000    4m38.310s
>> >>>
>> >>> Thanks in advance,
>> >>> Roman
>> >>>
>> >>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[hidden email]>
>> wrote:
>> >>> > While seeing DFS wireshark trace (and the corresponding RST's), the
>> >>> > crawl continued to next step... seems that this WARNING is actually
>> >>> > slowing down the whole crawling process (it took 36 minutes to
>> >>> > complete the previous fetch) with just 3 urls seed file :-!!!
>> >>> >
>> >>> > I just posted a couple of exceptions/questions regarding DFS on
>> hadoop
>> >>> > core mailing list.
>> >>> >
>> >>> > PD: As a side note, the following error caught my attention:
>> >>> >
>> >>> > Fetcher: starting
>> >>> > Fetcher: segment: crawl-ecxi/segments/20080715172458
>> >>> > Too many fetch-failures
>> >>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
>> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
>> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
>> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
>> >>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
>> >>> > with: org.apache.nutch.protocol.http.api.HttpException:
>> >>> > java.net.UnknownHostException: upc.cat
>> >>> >
>> >>> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
>> >>> > exist, it just gets redirected to www.upc.cat :-/
>> >>> >
>> >>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[hidden email]>
>> wrote:
>> >>> >> Yep, I know about wireshark, and wanted to avoid it to debug this
>> >>> >> issue (perhaps there was a simple solution/known bug/issue)...
>> >>> >>
>> >>> >> I just launched wireshark on frontend with filter tcp.port == 50010,
>> >>> >> and now I'm diving on the tcp stream... let's see if I see the light
>> >>> >> (RST flag somewhere ?), thanks anyway for replying ;)
>> >>> >>
>> >>> >> Just for the record, the phase that stalls is fetcher during reduce:
>> >>> >>
>> >>> >> Jobid   User    Name    Map % Complete  Map Total       Maps
>> Completed
>> >>>  Reduce %
>> >>> >> Complete        Reduce Total    Reduces Completed
>> >>> >> job_200807151723_0005   hadoop  fetch
>> crawl-ecxi/segments/20080715172458
>> >>>        100.00%
>> >>> >>        2       2       16.66%
>> >>> >>
>> >>> >>        1       0
>> >>> >>
>> >>> >> It's stuck on 16%, no traffic, no crawling, but still "running".
>> >>> >>
>> >>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>> >>> >> <[hidden email]> wrote:
>> >>> >>> Hi brain,
>> >>> >>>        If I were you, I would download wireshark
>> >>> >>> (http://www.wireshark.org/download.html) to see what is happening
>> at
>> >>> the
>> >>> >>> network layer and see if that provides any clues.  A socket
>> exception
>> >>> >>> that you don't expect is usually due to one side of the
>> conversation
>> >>> not
>> >>> >>> understanding the other side.  If you have 4 machines, then you
>> have 4
>> >>> >>> possible places where default firewall rules could be causing an
>> issue.
>> >>> >>> If it is not the firewall rules, the NAT rules could be a potential
>> >>> >>> source of error.  Also, even a router hardware error could cause a
>> >>> >>> problem.
>> >>> >>>        If you understand TCP, just make sure that you see all the
>> >>> >>> correct TCP stuff happening in wireshark.  If you don't understand
>> >>> >>> wireshark's display, let me know, and I'll pass on some quickstart
>> >>> >>> information.
>> >>> >>>
>> >>> >>>        If you already know all of this, I don't have any way to
>> help
>> >>> >>> you, as it looks like you're trying to accomplish something
>> trickier
>> >>> >>> with nutch than I have ever attempted.
>> >>> >>>
>> >>> >>> Patrick
>> >>> >>>
>> >>> >>> -----Original Message-----
>> >>> >>> From: brainstorm [mailto:[hidden email]]
>> >>> >>> Sent: Tuesday, July 15, 2008 10:08 AM
>> >>> >>> To: [hidden email]
>> >>> >>> Subject: Re: Distributed fetching only happening in one node ?
>> >>> >>>
>> >>> >>> Boiling down the problem I'm stuck on this:
>> >>> >>>
>> >>> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>> >>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>> >>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>> >>> >>>        at
>> >>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>> >>> >>>        at
>> >>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>> >>> >>>        at
>> >>> >>>
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> >>> >>>        at
>> >>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>> >>> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> >>> >>>        at
>> >>> >>>
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>> >>> >>>        at
>> >>> >>>
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>> >>> >>>        at
>> >>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>> >>> >>>        at java.lang.Thread.run(Thread.java:595)
>> >>> >>>
>> >>> >>> Checked that firewall settings between node & frontend were not
>> >>> >>> blocking packets, and they don't... anyone knows why is this ? If
>> not,
>> >>> >>> could you provide a convenient way to debug it ?
>> >>> >>>
>> >>> >>> Thanks !
>> >>> >>>
>> >>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[hidden email]>
>> >>> wrote:
>> >>> >>>> Hi,
>> >>> >>>>
>> >>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>> >>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>> >>> >>>> best suited network topology for inet crawling (frontend being a
>> net
>> >>> >>>> bottleneck), but I think it's fine for testing purposes.
>> >>> >>>>
>> >>> >>>> I'm having issues with fetch mapreduce job:
>> >>> >>>>
>> >>> >>>> According to ganglia monitoring (network traffic), and hadoop
>> >>> >>>> administrative interfaces, fetch phase is only being executed in
>> the
>> >>> >>>> frontend node, where I launched "nutch crawl". Previous nutch
>> phases
>> >>> >>>> were executed neatly distributed on all nodes:
>> >>> >>>>
>> >>> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
>> >>> >>>>        2       2       100.00%
>> >>> >>>>        1       1
>> >>> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>> >>> >>> 100.00%
>> >>> >>>>        3       3       100.00%
>> >>> >>>>        1       1
>> >>> >>>> job_200807131223_0003   hadoop  generate: select
>> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>> >>>>        3       3       100.00%
>> >>> >>>>        1       1
>> >>> >>>> job_200807131223_0004   hadoop  generate: partition
>> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>> >>>>        4       4       100.00%
>> >>> >>>>        2       2
>> >>> >>>>
>> >>> >>>> I've checked that:
>> >>> >>>>
>> >>> >>>> 1) Nodes have inet connectivity, firewall settings
>> >>> >>>> 2) There's enough space on local discs
>> >>> >>>> 3) Proper processes are running on nodes
>> >>> >>>>
>> >>> >>>> frontend-node:
>> >>> >>>> ==========
>> >>> >>>>
>> >>> >>>> [root@cluster ~]# jps
>> >>> >>>> 29232 NameNode
>> >>> >>>> 29489 DataNode
>> >>> >>>> 29860 JobTracker
>> >>> >>>> 29778 SecondaryNameNode
>> >>> >>>> 31122 Crawl
>> >>> >>>> 30137 TaskTracker
>> >>> >>>> 10989 Jps
>> >>> >>>> 1818 TaskTracker$Child
>> >>> >>>>
>> >>> >>>> leaf nodes:
>> >>> >>>> ========
>> >>> >>>>
>> >>> >>>> [root@cluster ~]# cluster-fork jps
>> >>> >>>> compute-0-1:
>> >>> >>>> 23929 Jps
>> >>> >>>> 15568 TaskTracker
>> >>> >>>> 15361 DataNode
>> >>> >>>> compute-0-2:
>> >>> >>>> 32272 TaskTracker
>> >>> >>>> 32065 DataNode
>> >>> >>>> 7197 Jps
>> >>> >>>> 2397 TaskTracker$Child
>> >>> >>>> compute-0-3:
>> >>> >>>> 12054 DataNode
>> >>> >>>> 19584 Jps
>> >>> >>>> 14824 TaskTracker$Child
>> >>> >>>> 12261 TaskTracker
>> >>> >>>>
>> >>> >>>> 4) Logs only show fetching process (taking place only in the head
>> >>> >>> node):
>> >>> >>>>
>> >>> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>> >>> >>>> http://valleycycles.net/
>> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>> >>>>
>> >>> >>>> What am I missing ? Why there are no fetching instances on nodes ?
>> I
>> >>> >>>> used the following custom script to launch a pristine crawl each
>> time:
>> >>> >>>>
>> >>> >>>> #!/bin/sh
>> >>> >>>>
>> >>> >>>> # 1) Stops hadoop daemons
>> >>> >>>> # 2) Overwrites new url list on HDFS
>> >>> >>>> # 3) Starts hadoop daemons
>> >>> >>>> # 4) Performs a clean crawl
>> >>> >>>>
>> >>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>> >>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>> >>> >>>>
>> >>> >>>> CRAWL_DIR=crawl-ecxi || $1
>> >>> >>>> URL_DIR=urls || $2
>> >>> >>>>
>> >>> >>>> echo $CRAWL_DIR
>> >>> >>>> echo $URL_DIR
>> >>> >>>>
>> >>> >>>> echo "Leaving safe mode..."
>> >>> >>>> ./hadoop dfsadmin -safemode leave
>> >>> >>>>
>> >>> >>>> echo "Removing seed urls directory and previous crawled
>> content..."
>> >>> >>>> ./hadoop dfs -rmr $URL_DIR
>> >>> >>>> ./hadoop dfs -rmr $CRAWL_DIR
>> >>> >>>>
>> >>> >>>> echo "Removing past logs"
>> >>> >>>>
>> >>> >>>> rm -rf ../logs/*
>> >>> >>>>
>> >>> >>>> echo "Uploading seed urls..."
>> >>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>> >>> >>>>
>> >>> >>>> #echo "Entering safe mode..."
>> >>> >>>> #./hadoop dfsadmin -safemode enter
>> >>> >>>>
>> >>> >>>> echo "******************"
>> >>> >>>> echo "* STARTING CRAWL *"
>> >>> >>>> echo "******************"
>> >>> >>>>
>> >>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> Next step I'm thinking on to fix the problem is to install
>> >>> >>>> nutch+hadoop as specified in this past nutch-user mail:
>> >>> >>>>
>> >>> >>>>
>> >>> http://www.mail-archive.com/nutch-user@.../msg10225.html
>> >>> >>>>
>> >>> >>>> As I don't know if it's current practice on trunk (archived mail
>> is
>> >>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to
>> fix
>> >>> >>>> it or if it's being worked on by someone... I haven't found a
>> matching
>> >>> >>>> bug on JIRA :_/
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards
>> >> Alexander Aristov
>> >>
>> >
>>
>
>
>
> --
> Best Regards
> Alexander Aristov
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

Andrzej Białecki-2
brainstorm wrote:
> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
> values 2 and 1 respectively *in the past*, same results. Right now, I
> have 32 for both: same results as those settings are just a hint for
> nutch.
>
> Regarding number of threads *per host* I tried with 10 and 20 in the
> past, same results.

Indeed, the default number of maps and reduces can be changed for any
particular job - the number of maps is adjusted according to the number
of input splits (InputFormat.getSplits()), and the number of reduces can
be adjusted programmatically in the application.

Back to your issue: I suspect that your fetchlist is highly homogenous,
i.e. contains urls from a single host. Nutch makes sure that all urls
from a single host end up in a single map task, to ensure the politeness
settings, so that's probably why you see only a single map task fetching
all urls.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
Andrzej, thanks for your advise... I was using a 20MB url list
provided by our customers, I've to write a script to determine the
homogeneusness of the input seed urls file.

As a preliminar test, I've run a crawl using the integrated nutch DMOZ
parser (as suggested on the official nutch tutorial), which I assume
that chooses urls in a more heterogeneous fashion. The resulting url
list, is a random enough sample ? ... In fact, being a directory, the
number of repeated urls should be low, isn't it ?

Bad news is that I'm getting the same results, just two nodes[1] are
actually fetching :_( So I guess the problem is somewhere else (I
already left the number of map & reduces to 2 and 1 as suggested in
this thread).

Any further ideas/tests/fixes ?

Thanks a lot for your patience and support,
Roman

[1] one of them being the frontend (invariably) and the other one, a
random node on each new crawl.

On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <[hidden email]> wrote:

> brainstorm wrote:
>>
>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>> values 2 and 1 respectively *in the past*, same results. Right now, I
>> have 32 for both: same results as those settings are just a hint for
>> nutch.
>>
>> Regarding number of threads *per host* I tried with 10 and 20 in the
>> past, same results.
>
> Indeed, the default number of maps and reduces can be changed for any
> particular job - the number of maps is adjusted according to the number of
> input splits (InputFormat.getSplits()), and the number of reduces can be
> adjusted programmatically in the application.
>
> Back to your issue: I suspect that your fetchlist is highly homogenous, i.e.
> contains urls from a single host. Nutch makes sure that all urls from a
> single host end up in a single map task, to ensure the politeness settings,
> so that's probably why you see only a single map task fetching all urls.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
In reply to this post by Andrzej Białecki-2
On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <[hidden email]> wrote:

> brainstorm wrote:
>>
>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>> values 2 and 1 respectively *in the past*, same results. Right now, I
>> have 32 for both: same results as those settings are just a hint for
>> nutch.
>>
>> Regarding number of threads *per host* I tried with 10 and 20 in the
>> past, same results.
>
> Indeed, the default number of maps and reduces can be changed for any
> particular job - the number of maps is adjusted according to the number of
> input splits (InputFormat.getSplits()), and the number of reduces can be
> adjusted programmatically in the application.



For now, my focus is on using nutch commandline tool:

$ bin/nutch crawl $URL_DIR_DFS -dir $CRAWL_DIR -depth 5

I assume (perhaps incorrectly), that nutch will determine the number
of maps & reduces dynamically. Is it true or should I switch to a
custom coded crawler using nutch API ?

Btw, having a look at "getSplits", I suspect that Fetcher does
precisely what I don't want it to do: does not split inputs... then,
the less input splits, the less maps will be spread on nodes on fetch
phase, am I wrong ?:

public class Fetcher
(...)

    /** Don't split inputs, to keep things polite. */
    public InputSplit[] getSplits(JobConf job, int nSplits)
      throws IOException {
      Path[] files = listPaths(job);
      FileSystem fs = FileSystem.get(job);
      InputSplit[] splits = new InputSplit[files.length];
      for (int i = 0; i < files.length; i++) {
        splits[i] = new FileSplit(files[i], 0,
            fs.getFileStatus(files[i]).getLen(), (String[])null);
      }
      return splits;
    }
  }


Thanks in advance,
Roman



> Back to your issue: I suspect that your fetchlist is highly homogenous, i.e.
> contains urls from a single host. Nutch makes sure that all urls from a
> single host end up in a single map task, to ensure the politeness settings,
> so that's probably why you see only a single map task fetching all urls.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
It was wondering... if I split the input urls like this:

url1.txt url2.txt ... urlN.txt

Will this input spread map jobs to N nodes ? Right now I'm using just
one (big) urls.txt file (just 2 nodes actually fetching).

Thanks in advance,
Roman

On Wed, Aug 6, 2008 at 11:51 PM, brainstorm <[hidden email]> wrote:

> On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <[hidden email]> wrote:
>> brainstorm wrote:
>>>
>>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>>> values 2 and 1 respectively *in the past*, same results. Right now, I
>>> have 32 for both: same results as those settings are just a hint for
>>> nutch.
>>>
>>> Regarding number of threads *per host* I tried with 10 and 20 in the
>>> past, same results.
>>
>> Indeed, the default number of maps and reduces can be changed for any
>> particular job - the number of maps is adjusted according to the number of
>> input splits (InputFormat.getSplits()), and the number of reduces can be
>> adjusted programmatically in the application.
>
>
>
> For now, my focus is on using nutch commandline tool:
>
> $ bin/nutch crawl $URL_DIR_DFS -dir $CRAWL_DIR -depth 5
>
> I assume (perhaps incorrectly), that nutch will determine the number
> of maps & reduces dynamically. Is it true or should I switch to a
> custom coded crawler using nutch API ?
>
> Btw, having a look at "getSplits", I suspect that Fetcher does
> precisely what I don't want it to do: does not split inputs... then,
> the less input splits, the less maps will be spread on nodes on fetch
> phase, am I wrong ?:
>
> public class Fetcher
> (...)
>
>    /** Don't split inputs, to keep things polite. */
>    public InputSplit[] getSplits(JobConf job, int nSplits)
>      throws IOException {
>      Path[] files = listPaths(job);
>      FileSystem fs = FileSystem.get(job);
>      InputSplit[] splits = new InputSplit[files.length];
>      for (int i = 0; i < files.length; i++) {
>        splits[i] = new FileSplit(files[i], 0,
>            fs.getFileStatus(files[i]).getLen(), (String[])null);
>      }
>      return splits;
>    }
>  }
>
>
> Thanks in advance,
> Roman
>
>
>
>> Back to your issue: I suspect that your fetchlist is highly homogenous, i.e.
>> contains urls from a single host. Nutch makes sure that all urls from a
>> single host end up in a single map task, to ensure the politeness settings,
>> so that's probably why you see only a single map task fetching all urls.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

Alexander Aristov
it should have no difference as all usrls from all files in the directory
are injected first.

Alex

2008/8/8 brainstorm <[hidden email]>

> It was wondering... if I split the input urls like this:
>
> url1.txt url2.txt ... urlN.txt
>
> Will this input spread map jobs to N nodes ? Right now I'm using just
> one (big) urls.txt file (just 2 nodes actually fetching).
>
> Thanks in advance,
> Roman
>
> On Wed, Aug 6, 2008 at 11:51 PM, brainstorm <[hidden email]> wrote:
> > On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <[hidden email]> wrote:
> >> brainstorm wrote:
> >>>
> >>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
> >>> values 2 and 1 respectively *in the past*, same results. Right now, I
> >>> have 32 for both: same results as those settings are just a hint for
> >>> nutch.
> >>>
> >>> Regarding number of threads *per host* I tried with 10 and 20 in the
> >>> past, same results.
> >>
> >> Indeed, the default number of maps and reduces can be changed for any
> >> particular job - the number of maps is adjusted according to the number
> of
> >> input splits (InputFormat.getSplits()), and the number of reduces can be
> >> adjusted programmatically in the application.
> >
> >
> >
> > For now, my focus is on using nutch commandline tool:
> >
> > $ bin/nutch crawl $URL_DIR_DFS -dir $CRAWL_DIR -depth 5
> >
> > I assume (perhaps incorrectly), that nutch will determine the number
> > of maps & reduces dynamically. Is it true or should I switch to a
> > custom coded crawler using nutch API ?
> >
> > Btw, having a look at "getSplits", I suspect that Fetcher does
> > precisely what I don't want it to do: does not split inputs... then,
> > the less input splits, the less maps will be spread on nodes on fetch
> > phase, am I wrong ?:
> >
> > public class Fetcher
> > (...)
> >
> >    /** Don't split inputs, to keep things polite. */
> >    public InputSplit[] getSplits(JobConf job, int nSplits)
> >      throws IOException {
> >      Path[] files = listPaths(job);
> >      FileSystem fs = FileSystem.get(job);
> >      InputSplit[] splits = new InputSplit[files.length];
> >      for (int i = 0; i < files.length; i++) {
> >        splits[i] = new FileSplit(files[i], 0,
> >            fs.getFileStatus(files[i]).getLen(), (String[])null);
> >      }
> >      return splits;
> >    }
> >  }
> >
> >
> > Thanks in advance,
> > Roman
> >
> >
> >
> >> Back to your issue: I suspect that your fetchlist is highly homogenous,
> i.e.
> >> contains urls from a single host. Nutch makes sure that all urls from a
> >> single host end up in a single map task, to ensure the politeness
> settings,
> >> so that's probably why you see only a single map task fetching all urls.
> >>
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >>  ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >
>



--
Best Regards
Alexander Aristov
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

Andrzej Białecki-2
In reply to this post by brainstorm-2-2
brainstorm wrote:
> It was wondering... if I split the input urls like this:
>
> url1.txt url2.txt ... urlN.txt
>
> Will this input spread map jobs to N nodes ? Right now I'm using just

No, it won't - because these files are first added to a crawldb, and
only then Generator creates partial fetchlists out of the whole crawldb.

Here's how it works:

* Generator first prepares the list of candidate urls for fetching

* then it applies limits e.g. maximum number of urls per host

* and finally partitions the fetchlist so that all urls from the same
host end up in the same partition. The number of output partitions from
Generator is equal to the default number of map tasks. Why? because
Fetcher will create one map task per each partition in the fetchlist.

So - please check how many part-NNNNN files you have in the generated
fetchlist.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

soila
In reply to this post by Andrzej Białecki-2
Hi Andrzej,

I am experiencing similar problems distributing the fetch across multiple nodes. I am crawling a single host in an intranet and I would like to know how I can modify nutch's behavior so that it distributes the search over multiple nodes.

Soila
Andrzej Bialecki wrote
brainstorm wrote:
> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
> values 2 and 1 respectively *in the past*, same results. Right now, I
> have 32 for both: same results as those settings are just a hint for
> nutch.
>
> Regarding number of threads *per host* I tried with 10 and 20 in the
> past, same results.

Indeed, the default number of maps and reduces can be changed for any
particular job - the number of maps is adjusted according to the number
of input splits (InputFormat.getSplits()), and the number of reduces can
be adjusted programmatically in the application.

Back to your issue: I suspect that your fetchlist is highly homogenous,
i.e. contains urls from a single host. Nutch makes sure that all urls
from a single host end up in a single map task, to ensure the politeness
settings, so that's probably why you see only a single map task fetching
all urls.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

Jordan Mendler
I am looking to do the same thing. If anyone finds a way, please post here.

Thanks,
Jordan

On Sun, Aug 10, 2008 at 11:31 AM, soila <[hidden email]> wrote:

>
> Hi Andrzej,
>
> I am experiencing similar problems distributing the fetch across multiple
> nodes. I am crawling a single host in an intranet and I would like to know
> how I can modify nutch's behavior so that it distributes the search over
> multiple nodes.
>
> Soila
>
> Andrzej Bialecki wrote:
> >
> > brainstorm wrote:
> >> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
> >> values 2 and 1 respectively *in the past*, same results. Right now, I
> >> have 32 for both: same results as those settings are just a hint for
> >> nutch.
> >>
> >> Regarding number of threads *per host* I tried with 10 and 20 in the
> >> past, same results.
> >
> > Indeed, the default number of maps and reduces can be changed for any
> > particular job - the number of maps is adjusted according to the number
> > of input splits (InputFormat.getSplits()), and the number of reduces can
> > be adjusted programmatically in the application.
> >
> > Back to your issue: I suspect that your fetchlist is highly homogenous,
> > i.e. contains urls from a single host. Nutch makes sure that all urls
> > from a single host end up in a single map task, to ensure the politeness
> > settings, so that's probably why you see only a single map task fetching
> > all urls.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Distributed-fetching-only-happening-in-one-node---tp18429531p18915705.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Distributed fetching only happening in one node ?

brainstorm-2-2
In reply to this post by Andrzej Białecki-2
On Fri, Aug 8, 2008 at 1:18 PM, Andrzej Bialecki <[hidden email]> wrote:

> brainstorm wrote:
>>
>> It was wondering... if I split the input urls like this:
>>
>> url1.txt url2.txt ... urlN.txt
>>
>> Will this input spread map jobs to N nodes ? Right now I'm using just
>
> No, it won't - because these files are first added to a crawldb, and only
> then Generator creates partial fetchlists out of the whole crawldb.
>
> Here's how it works:
>
> * Generator first prepares the list of candidate urls for fetching
>
> * then it applies limits e.g. maximum number of urls per host
>
> * and finally partitions the fetchlist so that all urls from the same host
> end up in the same partition. The number of output partitions from Generator
> is equal to the default number of map tasks. Why? because Fetcher will
> create one map task per each partition in the fetchlist.



Somebody said that 2 mapred.map.tasks was ok for a 7 node cluster
setup, but using greater values for mapred.map.tasks (tested from 2
till 256) does not alter the output/fix the problem, no additional
part-XXXX are generated for each map and no additional nodes
participate on fetching :/

What should I do ?



> So - please check how many part-NNNNN files you have in the generated
> fetchlist.



This is one example crawled segment:

/user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000

As you see, just one part-NNNN file is generated... in the conf file
(nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
suggested in previous emails).



Thanks for your support ! ;)


>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
12