Hadoop 'wordcount' program hanging in the Reduce phase.

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Hadoop 'wordcount' program hanging in the Reduce phase.

Gaurav Agarwal
Hi Everyone!
I am new user to Hadoop and trying to set up a small cluster using Hadoop (Release Mar 02) on Ubuntu 6.10 (Edgy) ; but I am facing some issues doing that.

I am trying to run the Hadoop 'wordcount' example program which come bundled with it. I am able to successfully run the program on a single node cluster (that is using my local machine only). But, when I try to run the same program on a cluster of two machines, the program hangs in the 'reduce' phase.


Settings:

Master Node: 192.168.1.150 (dennis-laptop)
Slave Node: 192.168.1.201 (traal)

User Account on both Master and Slave is named : Hadoop

Password-less ssh login to Slave from the Master is working.

JAVA_HOME is set appropriately in the hadoop-env.sh file on both Master/Slave.

MASTER

1) conf/slaves
localhost
hadoop@192.168.1.201

2) conf/master
localhost

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>

SLAVE

1) conf/slaves
localhost

2) conf/master
hadoop@192.168.1.150

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>


CONSOLE OUTPUT
bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to process : 1
07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%


The only exception I can see from the log files is in the 'TaskTracker' log file:

2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 Copying task_0001_m_000001_0 output from dennis-laptop.
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: File /tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not created
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:301)
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:262)

2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 adding host traal to penalty box, next contact in 99 seconds

I am attaching the master log files just in case anyone wants to check them.

Any help will be greatly appreciated!

-gaurav

hadoop-hadoop-tasktracker-dennis-laptop.log</br>hadoop-hadoop-jobtracker-dennis-laptop.log</br>hadoop-hadoop-namenode-dennis-laptop.log</br>hadoop-hadoop-datanode-dennis-laptop.log
Reply | Threaded
Open this post in threaded view
|

RE: Hadoop 'wordcount' program hanging in the Reduce phase.

Richard Yang-3
Hi Gaurav,

Does this error always happen??  
Our settings are similar.
Mine contains some error messages about IOExceptions, not able to obtain
certain blocks, not able to create a new block.  Although the program hung
some time, in most cases, they were able to complete with correct results.
Btw, I am running the grep sample program on version 0.11.2.

Best Regards
 
Richard Yang
[hidden email]
[hidden email]
 
 
-----Original Message-----
From: Gaurav Agarwal [mailto:[hidden email]]
Sent: Wednesday, March 07, 2007 12:22 AM
To: [hidden email]
Subject: Hadoop 'wordcount' program hanging in the Reduce phase.


Hi Everyone!
I am new user to Hadoop and trying to set up a small cluster using Hadoop.
but I am facing some issues doing that.

I am trying to run the Hadoop 'wordcount' example program which come bundled
with it. I am able to successfully run the program on a single node cluster
(that is using my local machine only). But, when I try to run the same
program on a cluster of two machines, the program hangs in the 'reduce'
phase.


Settings:

Master Node: 192.168.1.150 (dennis-laptop)
Slave Node: 192.168.1.201 (traal)

User Account on both Master and Slave is named : Hadoop

Password-less ssh login to Slave from the Master is working.

JAVA_HOME is set appropriately in the hadoop-env.sh file on both
Master/Slave.

MASTER

1) conf/slaves
localhost
hadoop@192.168.1.201

2) conf/master
localhost

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>

SLAVE

1) conf/slaves
localhost

2) conf/master
hadoop@192.168.1.150

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>


CONSOLE OUTPUT
bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to process
: 1
07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%


The only exception I can see from the log files is in the 'TaskTracker' log
file:

2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 Copying task_0001_m_000001_0 output from dennis-laptop.
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
java.io.IOException: File
/tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not created
at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceT
askRunner.java:301)
at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunn
er.java:262)

2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 adding host traal to penalty box, next contact in 99
seconds

I am attaching the master log files just in case anyone wants to check them.

Any help will be greatly appreciated!

-gaurav

http://www.nabble.com/file/7013/hadoop-hadoop-tasktracker-dennis-laptop.log
hadoop-hadoop-tasktracker-dennis-laptop.log </br>
http://www.nabble.com/file/7012/hadoop-hadoop-jobtracker-dennis-laptop.log
hadoop-hadoop-jobtracker-dennis-laptop.log </br>
http://www.nabble.com/file/7011/hadoop-hadoop-namenode-dennis-laptop.log
hadoop-hadoop-namenode-dennis-laptop.log </br>
http://www.nabble.com/file/7010/hadoop-hadoop-datanode-dennis-laptop.log
hadoop-hadoop-datanode-dennis-laptop.log
--
View this message in context:
http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-p
hase.-tf3360661.html#a9348424
Sent from the Hadoop Users mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

答复: Hadoop 'wordcount' program hanging in the Reduce phase.

张茂森
In reply to this post by Gaurav Agarwal
In my opinion, you should make the conf setting files both in master and
slave node to be same. That means that the files in conf/slaves should be
same between your small cluster.
-----邮件原件-----
发件人: Gaurav Agarwal [mailto:[hidden email]]
发送时间: 2007年3月7日 16:22
收件人: [hidden email]
主题: Hadoop 'wordcount' program hanging in the Reduce phase.


Hi Everyone!
I am new user to Hadoop and trying to set up a small cluster using Hadoop.
but I am facing some issues doing that.

I am trying to run the Hadoop 'wordcount' example program which come bundled
with it. I am able to successfully run the program on a single node cluster
(that is using my local machine only). But, when I try to run the same
program on a cluster of two machines, the program hangs in the 'reduce'
phase.


Settings:

Master Node: 192.168.1.150 (dennis-laptop)
Slave Node: 192.168.1.201 (traal)

User Account on both Master and Slave is named : Hadoop

Password-less ssh login to Slave from the Master is working.

JAVA_HOME is set appropriately in the hadoop-env.sh file on both
Master/Slave.

MASTER

1) conf/slaves
localhost
hadoop@192.168.1.201

2) conf/master
localhost

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>

SLAVE

1) conf/slaves
localhost

2) conf/master
hadoop@192.168.1.150

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>


CONSOLE OUTPUT
bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to process
: 1
07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%


The only exception I can see from the log files is in the 'TaskTracker' log
file:

2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 Copying task_0001_m_000001_0 output from dennis-laptop.
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
java.io.IOException: File
/tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not created
at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceT
askRunner.java:301)
at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunn
er.java:262)

2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 adding host traal to penalty box, next contact in 99
seconds

I am attaching the master log files just in case anyone wants to check them.

Any help will be greatly appreciated!

-gaurav

http://www.nabble.com/file/7013/hadoop-hadoop-tasktracker-dennis-laptop.log
hadoop-hadoop-tasktracker-dennis-laptop.log </br>
http://www.nabble.com/file/7012/hadoop-hadoop-jobtracker-dennis-laptop.log
hadoop-hadoop-jobtracker-dennis-laptop.log </br>
http://www.nabble.com/file/7011/hadoop-hadoop-namenode-dennis-laptop.log
hadoop-hadoop-namenode-dennis-laptop.log </br>
http://www.nabble.com/file/7010/hadoop-hadoop-datanode-dennis-laptop.log
hadoop-hadoop-datanode-dennis-laptop.log
--
View this message in context:
http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-p
hase.-tf3360661.html#a9348424
Sent from the Hadoop Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 'wordcount' program hanging in the Reduce phase.

jaylac
In reply to this post by Gaurav Agarwal

Hi gaurav

Im also a beginner.... Still i try to tell my views... This may not be correct...

You said u've created user name called "Hadoop" on both system.. But in slave file u've witten as hadoop@192.168.1.201... Is it not case sensitive? So try changing it to Hadoop@192.168.1.201 on both the systems...

Also use the port 9010 and 9011 in the hadoop-site.html file...

But these might be no way related to ur problem.... Still try these and let me know.

Regards,
Jaya

Gaurav Agarwal wrote
Hi Everyone!
I am new user to Hadoop and trying to set up a small cluster using Hadoop (Release Mar 02) on Ubuntu 6.10 (Edgy) ; but I am facing some issues doing that.

I am trying to run the Hadoop 'wordcount' example program which come bundled with it. I am able to successfully run the program on a single node cluster (that is using my local machine only). But, when I try to run the same program on a cluster of two machines, the program hangs in the 'reduce' phase.


Settings:

Master Node: 192.168.1.150 (dennis-laptop)
Slave Node: 192.168.1.201 (traal)

User Account on both Master and Slave is named : Hadoop

Password-less ssh login to Slave from the Master is working.

JAVA_HOME is set appropriately in the hadoop-env.sh file on both Master/Slave.

MASTER

1) conf/slaves
localhost
hadoop@192.168.1.201

2) conf/master
localhost

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>

SLAVE

1) conf/slaves
localhost

2) conf/master
hadoop@192.168.1.150

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>


CONSOLE OUTPUT
bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to process : 1
07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%


The only exception I can see from the log files is in the 'TaskTracker' log file:

2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 Copying task_0001_m_000001_0 output from dennis-laptop.
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: File /tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not created
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:301)
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:262)

2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 adding host traal to penalty box, next contact in 99 seconds

I am attaching the master log files just in case anyone wants to check them.

Any help will be greatly appreciated!

-gaurav

hadoop-hadoop-tasktracker-dennis-laptop.log</br>hadoop-hadoop-jobtracker-dennis-laptop.log</br>hadoop-hadoop-namenode-dennis-laptop.log</br>hadoop-hadoop-datanode-dennis-laptop.log
Reply | Threaded
Open this post in threaded view
|

RE: Hadoop 'wordcount' program hanging in the Reduce phase.

Gaurav Agarwal
In reply to this post by Richard Yang-3
Hi Richard,

I am facing this error very consistently. I have tried the another nightly build (4 Mar) but that gave same exception.

thanks,
gaurav


Richard Yang-3 wrote
Hi Gaurav,

Does this error always happen??  
Our settings are similar.
Mine contains some error messages about IOExceptions, not able to obtain
certain blocks, not able to create a new block.  Although the program hung
some time, in most cases, they were able to complete with correct results.
Btw, I am running the grep sample program on version 0.11.2.

Best Regards
 
Richard Yang
richardyang@richardyang.net
kusanagiyang@gmail.com
 
 
-----Original Message-----
From: Gaurav Agarwal [mailto:gauravagarwal_4@yahoo.com]
Sent: Wednesday, March 07, 2007 12:22 AM
To: hadoop-user@lucene.apache.org
Subject: Hadoop 'wordcount' program hanging in the Reduce phase.


Hi Everyone!
I am new user to Hadoop and trying to set up a small cluster using Hadoop.
but I am facing some issues doing that.

I am trying to run the Hadoop 'wordcount' example program which come bundled
with it. I am able to successfully run the program on a single node cluster
(that is using my local machine only). But, when I try to run the same
program on a cluster of two machines, the program hangs in the 'reduce'
phase.


Settings:

Master Node: 192.168.1.150 (dennis-laptop)
Slave Node: 192.168.1.201 (traal)

User Account on both Master and Slave is named : Hadoop

Password-less ssh login to Slave from the Master is working.

JAVA_HOME is set appropriately in the hadoop-env.sh file on both
Master/Slave.

MASTER

1) conf/slaves
localhost
hadoop@192.168.1.201

2) conf/master
localhost

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>

SLAVE

1) conf/slaves
localhost

2) conf/master
hadoop@192.168.1.150

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>


CONSOLE OUTPUT
bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to process
: 1
07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%


The only exception I can see from the log files is in the 'TaskTracker' log
file:

2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 Copying task_0001_m_000001_0 output from dennis-laptop.
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
java.io.IOException: File
/tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not created
at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceT
askRunner.java:301)
at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunn
er.java:262)

2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 adding host traal to penalty box, next contact in 99
seconds

I am attaching the master log files just in case anyone wants to check them.

Any help will be greatly appreciated!

-gaurav

http://www.nabble.com/file/7013/hadoop-hadoop-tasktracker-dennis-laptop.log
hadoop-hadoop-tasktracker-dennis-laptop.log </br>
http://www.nabble.com/file/7012/hadoop-hadoop-jobtracker-dennis-laptop.log
hadoop-hadoop-jobtracker-dennis-laptop.log </br>
http://www.nabble.com/file/7011/hadoop-hadoop-namenode-dennis-laptop.log
hadoop-hadoop-namenode-dennis-laptop.log </br>
http://www.nabble.com/file/7010/hadoop-hadoop-datanode-dennis-laptop.log
hadoop-hadoop-datanode-dennis-laptop.log
--
View this message in context:
http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-p
hase.-tf3360661.html#a9348424
Sent from the Hadoop Users mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 'wordcount' program hanging in the Reduce phase.

Brian Wedel-2
I am experimenting on a small cluster as well (4 machines) and I had
success with the following configuration:

 - configuration files on both the master and slaves are the same
 - in the master/slave lists I only used the ip address (not
localhost) and ommited the user e.g. (hadoop@)
 - in the fs.default.name configuration variable use
hdfs://<host>:<port> (I don't know if this is necessary - but it seems
you can specify other types of filesystems - not sure which is
default)
 - use 0.12.0 release - I was using 0.11.2 and was getting some odd
errors that disappeared  when I upgraded
 - I don't run a datanode daemon on the same machine a the namenode --
this was a problem when I was trying the hadoop-streaming contributed
package for scripting.  Not sure if it matters for the examples

This configuration worked me.
-Brian

On 3/7/07, Gaurav Agarwal <[hidden email]> wrote:

>
> Hi Richard,
>
> I am facing this error very consistently. I have tried the another nightly
> build (4 Mar) but that gave same exception.
>
> thanks,
> gaurav
>
>
>
> Richard Yang-3 wrote:
> >
> > Hi Gaurav,
> >
> > Does this error always happen??
> > Our settings are similar.
> > Mine contains some error messages about IOExceptions, not able to obtain
> > certain blocks, not able to create a new block.  Although the program hung
> > some time, in most cases, they were able to complete with correct results.
> > Btw, I am running the grep sample program on version 0.11.2.
> >
> > Best Regards
> >
> > Richard Yang
> > [hidden email]
> > [hidden email]
> >
> >
> > -----Original Message-----
> > From: Gaurav Agarwal [mailto:[hidden email]]
> > Sent: Wednesday, March 07, 2007 12:22 AM
> > To: [hidden email]
> > Subject: Hadoop 'wordcount' program hanging in the Reduce phase.
> >
> >
> > Hi Everyone!
> > I am new user to Hadoop and trying to set up a small cluster using Hadoop.
> > but I am facing some issues doing that.
> >
> > I am trying to run the Hadoop 'wordcount' example program which come
> > bundled
> > with it. I am able to successfully run the program on a single node
> > cluster
> > (that is using my local machine only). But, when I try to run the same
> > program on a cluster of two machines, the program hangs in the 'reduce'
> > phase.
> >
> >
> > Settings:
> >
> > Master Node: 192.168.1.150 (dennis-laptop)
> > Slave Node: 192.168.1.201 (traal)
> >
> > User Account on both Master and Slave is named : Hadoop
> >
> > Password-less ssh login to Slave from the Master is working.
> >
> > JAVA_HOME is set appropriately in the hadoop-env.sh file on both
> > Master/Slave.
> >
> > MASTER
> >
> > 1) conf/slaves
> > localhost
> > hadoop@192.168.1.201
> >
> > 2) conf/master
> > localhost
> >
> > 3) conf/hadoop-site.xml
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <!-- Put site-specific property overrides in this file. -->
> >
> > <configuration>
> > <property>
> >          <name>fs.default.name</name>
> >          <value>192.168.1.150:50000</value>
> >     </property>
> >
> >     <property>
> >          <name>mapred.job.tracker</name>
> >          <value>192.168.1.150:50001</value>
> >      </property>
> >
> >     <property>
> >          <name>dfs.replication</name>
> >          <value>2</value>
> >     </property>
> > </configuration>
> >
> > SLAVE
> >
> > 1) conf/slaves
> > localhost
> >
> > 2) conf/master
> > hadoop@192.168.1.150
> >
> > 3) conf/hadoop-site.xml
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <!-- Put site-specific property overrides in this file. -->
> >
> > <configuration>
> > <property>
> >          <name>fs.default.name</name>
> >          <value>192.168.1.150:50000</value>
> >     </property>
> >
> >     <property>
> >          <name>mapred.job.tracker</name>
> >          <value>192.168.1.150:50001</value>
> >      </property>
> >
> >     <property>
> >          <name>dfs.replication</name>
> >          <value>2</value>
> >     </property>
> > </configuration>
> >
> >
> > CONSOLE OUTPUT
> > bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
> > 07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to
> > process
> > : 1
> > 07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
> > 07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
> > 07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
> > 07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
> > 07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
> > 07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
> > 07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
> > 07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
> > 07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
> > 07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
> > 07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
> > 07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%
> >
> >
> > The only exception I can see from the log files is in the 'TaskTracker'
> > log
> > file:
> >
> > 2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
> > 2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 Copying task_0001_m_000001_0 output from
> > dennis-laptop.
> > 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
> > 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
> > java.io.IOException: File
> > /tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not
> > created
> > at
> > org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceT
> > askRunner.java:301)
> > at
> > org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunn
> > er.java:262)
> >
> > 2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 adding host traal to penalty box, next contact in 99
> > seconds
> >
> > I am attaching the master log files just in case anyone wants to check
> > them.
> >
> > Any help will be greatly appreciated!
> >
> > -gaurav
> >
> > http://www.nabble.com/file/7013/hadoop-hadoop-tasktracker-dennis-laptop.log
> > hadoop-hadoop-tasktracker-dennis-laptop.log </br>
> > http://www.nabble.com/file/7012/hadoop-hadoop-jobtracker-dennis-laptop.log
> > hadoop-hadoop-jobtracker-dennis-laptop.log </br>
> > http://www.nabble.com/file/7011/hadoop-hadoop-namenode-dennis-laptop.log
> > hadoop-hadoop-namenode-dennis-laptop.log </br>
> > http://www.nabble.com/file/7010/hadoop-hadoop-datanode-dennis-laptop.log
> > hadoop-hadoop-datanode-dennis-laptop.log
> > --
> > View this message in context:
> > http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-p
> > hase.-tf3360661.html#a9348424
> > Sent from the Hadoop Users mailing list archive at Nabble.com.
> >
> >
> >
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-phase.-tf3360661.html#a9357648
> Sent from the Hadoop Users mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 'wordcount' program hanging in the Reduce phase.

Gaurav Agarwal
In reply to this post by jaylac
Thanks for the reply. The user name is created properly - 'hadoop' . That is not  the problem as the jobtracker is able to start tasks on the slave machine.

I tried to play around and observed that if I make the remote machine as the only slave (as opposed to the master node acting as one of the slaves too), the tasks run fine.

It could be that making a node function as both slave and master is a bad idea (although I do not see any reason for not  doing so). I will try to get access to more slave machines and see if my guess is correct.

thanks,
gaurav
jaylac wrote
Hi gaurav

Im also a beginner.... Still i try to tell my views... This may not be correct...

You said u've created user name called "Hadoop" on both system.. But in slave file u've witten as hadoop@192.168.1.201... Is it not case sensitive? So try changing it to Hadoop@192.168.1.201 on both the systems...

Also use the port 9010 and 9011 in the hadoop-site.html file...

But these might be no way related to ur problem.... Still try these and let me know.

Regards,
Jaya

Gaurav Agarwal wrote
Hi Everyone!
I am new user to Hadoop and trying to set up a small cluster using Hadoop (Release Mar 02) on Ubuntu 6.10 (Edgy) ; but I am facing some issues doing that.

I am trying to run the Hadoop 'wordcount' example program which come bundled with it. I am able to successfully run the program on a single node cluster (that is using my local machine only). But, when I try to run the same program on a cluster of two machines, the program hangs in the 'reduce' phase.


Settings:

Master Node: 192.168.1.150 (dennis-laptop)
Slave Node: 192.168.1.201 (traal)

User Account on both Master and Slave is named : Hadoop

Password-less ssh login to Slave from the Master is working.

JAVA_HOME is set appropriately in the hadoop-env.sh file on both Master/Slave.

MASTER

1) conf/slaves
localhost
hadoop@192.168.1.201

2) conf/master
localhost

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>

SLAVE

1) conf/slaves
localhost

2) conf/master
hadoop@192.168.1.150

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>


CONSOLE OUTPUT
bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to process : 1
07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%


The only exception I can see from the log files is in the 'TaskTracker' log file:

2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 Copying task_0001_m_000001_0 output from dennis-laptop.
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: File /tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not created
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:301)
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:262)

2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 adding host traal to penalty box, next contact in 99 seconds

I am attaching the master log files just in case anyone wants to check them.

Any help will be greatly appreciated!

-gaurav

hadoop-hadoop-tasktracker-dennis-laptop.log</br>hadoop-hadoop-jobtracker-dennis-laptop.log</br>hadoop-hadoop-namenode-dennis-laptop.log</br>hadoop-hadoop-datanode-dennis-laptop.log
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 'wordcount' program hanging in the Reduce phase.

Gaurav Agarwal
In reply to this post by Brian Wedel-2
Hi Brian,

I tried the configuration changes suggested by you but it did not work for me. (I am beginning to get a feeling that make a node function as both master and slave is a bad idea!).

Can you experiment this for me on your cluster:

Config: 2 node cluster.
Node1: Acts as both master and slave
Node 2: Acts as slave only.

An input file of ~5M

Word count example program using the command:
bin/hadoop jar hadoop-0.12.0-examples.jar wordcount -m 4 input output

I really appreciate your help. Thanks in advance!

-gaurav

Brian Wedel-2 wrote
I am experimenting on a small cluster as well (4 machines) and I had
success with the following configuration:

 - configuration files on both the master and slaves are the same
 - in the master/slave lists I only used the ip address (not
localhost) and ommited the user e.g. (hadoop@)
 - in the fs.default.name configuration variable use
hdfs://<host>:<port> (I don't know if this is necessary - but it seems
you can specify other types of filesystems - not sure which is
default)
 - use 0.12.0 release - I was using 0.11.2 and was getting some odd
errors that disappeared  when I upgraded
 - I don't run a datanode daemon on the same machine a the namenode --
this was a problem when I was trying the hadoop-streaming contributed
package for scripting.  Not sure if it matters for the examples

This configuration worked me.
-Brian

On 3/7/07, Gaurav Agarwal <gauravagarwal_4@yahoo.com> wrote:
>
> Hi Richard,
>
> I am facing this error very consistently. I have tried the another nightly
> build (4 Mar) but that gave same exception.
>
> thanks,
> gaurav
>
>
>
> Richard Yang-3 wrote:
> >
> > Hi Gaurav,
> >
> > Does this error always happen??
> > Our settings are similar.
> > Mine contains some error messages about IOExceptions, not able to obtain
> > certain blocks, not able to create a new block.  Although the program hung
> > some time, in most cases, they were able to complete with correct results.
> > Btw, I am running the grep sample program on version 0.11.2.
> >
> > Best Regards
> >
> > Richard Yang
> > richardyang@richardyang.net
> > kusanagiyang@gmail.com
> >
> >
> > -----Original Message-----
> > From: Gaurav Agarwal [mailto:gauravagarwal_4@yahoo.com]
> > Sent: Wednesday, March 07, 2007 12:22 AM
> > To: hadoop-user@lucene.apache.org
> > Subject: Hadoop 'wordcount' program hanging in the Reduce phase.
> >
> >
> > Hi Everyone!
> > I am new user to Hadoop and trying to set up a small cluster using Hadoop.
> > but I am facing some issues doing that.
> >
> > I am trying to run the Hadoop 'wordcount' example program which come
> > bundled
> > with it. I am able to successfully run the program on a single node
> > cluster
> > (that is using my local machine only). But, when I try to run the same
> > program on a cluster of two machines, the program hangs in the 'reduce'
> > phase.
> >
> >
> > Settings:
> >
> > Master Node: 192.168.1.150 (dennis-laptop)
> > Slave Node: 192.168.1.201 (traal)
> >
> > User Account on both Master and Slave is named : Hadoop
> >
> > Password-less ssh login to Slave from the Master is working.
> >
> > JAVA_HOME is set appropriately in the hadoop-env.sh file on both
> > Master/Slave.
> >
> > MASTER
> >
> > 1) conf/slaves
> > localhost
> > hadoop@192.168.1.201
> >
> > 2) conf/master
> > localhost
> >
> > 3) conf/hadoop-site.xml
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> >
> >
> > <configuration>
> > <property>
> >          <name>fs.default.name</name>
> >          <value>192.168.1.150:50000</value>
> >     </property>
> >
> >     <property>
> >          <name>mapred.job.tracker</name>
> >          <value>192.168.1.150:50001</value>
> >      </property>
> >
> >     <property>
> >          <name>dfs.replication</name>
> >          <value>2</value>
> >     </property>
> > </configuration>
> >
> > SLAVE
> >
> > 1) conf/slaves
> > localhost
> >
> > 2) conf/master
> > hadoop@192.168.1.150
> >
> > 3) conf/hadoop-site.xml
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> >
> >
> > <configuration>
> > <property>
> >          <name>fs.default.name</name>
> >          <value>192.168.1.150:50000</value>
> >     </property>
> >
> >     <property>
> >          <name>mapred.job.tracker</name>
> >          <value>192.168.1.150:50001</value>
> >      </property>
> >
> >     <property>
> >          <name>dfs.replication</name>
> >          <value>2</value>
> >     </property>
> > </configuration>
> >
> >
> > CONSOLE OUTPUT
> > bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
> > 07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to
> > process
> > : 1
> > 07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
> > 07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
> > 07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
> > 07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
> > 07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
> > 07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
> > 07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
> > 07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
> > 07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
> > 07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
> > 07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
> > 07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%
> >
> >
> > The only exception I can see from the log files is in the 'TaskTracker'
> > log
> > file:
> >
> > 2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
> > 2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 Copying task_0001_m_000001_0 output from
> > dennis-laptop.
> > 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
> > 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
> > java.io.IOException: File
> > /tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not
> > created
> > at
> > org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceT
> > askRunner.java:301)
> > at
> > org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunn
> > er.java:262)
> >
> > 2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner:
> > task_0001_r_000000_0 adding host traal to penalty box, next contact in 99
> > seconds
> >
> > I am attaching the master log files just in case anyone wants to check
> > them.
> >
> > Any help will be greatly appreciated!
> >
> > -gaurav
> >
> > http://www.nabble.com/file/7013/hadoop-hadoop-tasktracker-dennis-laptop.log
> > hadoop-hadoop-tasktracker-dennis-laptop.log </br>
> > http://www.nabble.com/file/7012/hadoop-hadoop-jobtracker-dennis-laptop.log
> > hadoop-hadoop-jobtracker-dennis-laptop.log </br>
> > http://www.nabble.com/file/7011/hadoop-hadoop-namenode-dennis-laptop.log
> > hadoop-hadoop-namenode-dennis-laptop.log </br>
> > http://www.nabble.com/file/7010/hadoop-hadoop-datanode-dennis-laptop.log
> > hadoop-hadoop-datanode-dennis-laptop.log
> > --
> > View this message in context:
> > http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-p
> > hase.-tf3360661.html#a9348424
> > Sent from the Hadoop Users mailing list archive at Nabble.com.
> >
> >
> >
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-phase.-tf3360661.html#a9357648
> Sent from the Hadoop Users mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: 答复: Hadoop 'wordcount' program hanging in the Reduce phase.

Gaurav Agarwal
In reply to this post by 张茂森
Hi, I tried that.. same problem! thx
张茂森 wrote
In my opinion, you should make the conf setting files both in master and
slave node to be same. That means that the files in conf/slaves should be
same between your small cluster.
-----邮件原件-----
发件人: Gaurav Agarwal [mailto:gauravagarwal_4@yahoo.com]
发送时间: 2007年3月7日 16:22
收件人: hadoop-user@lucene.apache.org
主题: Hadoop 'wordcount' program hanging in the Reduce phase.


Hi Everyone!
I am new user to Hadoop and trying to set up a small cluster using Hadoop.
but I am facing some issues doing that.

I am trying to run the Hadoop 'wordcount' example program which come bundled
with it. I am able to successfully run the program on a single node cluster
(that is using my local machine only). But, when I try to run the same
program on a cluster of two machines, the program hangs in the 'reduce'
phase.


Settings:

Master Node: 192.168.1.150 (dennis-laptop)
Slave Node: 192.168.1.201 (traal)

User Account on both Master and Slave is named : Hadoop

Password-less ssh login to Slave from the Master is working.

JAVA_HOME is set appropriately in the hadoop-env.sh file on both
Master/Slave.

MASTER

1) conf/slaves
localhost
hadoop@192.168.1.201

2) conf/master
localhost

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>

SLAVE

1) conf/slaves
localhost

2) conf/master
hadoop@192.168.1.150

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>


CONSOLE OUTPUT
bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to process
: 1
07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%


The only exception I can see from the log files is in the 'TaskTracker' log
file:

2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 Copying task_0001_m_000001_0 output from dennis-laptop.
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
java.io.IOException: File
/tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not created
at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceT
askRunner.java:301)
at
org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunn
er.java:262)

2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner:
task_0001_r_000000_0 adding host traal to penalty box, next contact in 99
seconds

I am attaching the master log files just in case anyone wants to check them.

Any help will be greatly appreciated!

-gaurav

http://www.nabble.com/file/7013/hadoop-hadoop-tasktracker-dennis-laptop.log
hadoop-hadoop-tasktracker-dennis-laptop.log </br>
http://www.nabble.com/file/7012/hadoop-hadoop-jobtracker-dennis-laptop.log
hadoop-hadoop-jobtracker-dennis-laptop.log </br>
http://www.nabble.com/file/7011/hadoop-hadoop-namenode-dennis-laptop.log
hadoop-hadoop-namenode-dennis-laptop.log </br>
http://www.nabble.com/file/7010/hadoop-hadoop-datanode-dennis-laptop.log
hadoop-hadoop-datanode-dennis-laptop.log
--
View this message in context:
http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-p
hase.-tf3360661.html#a9348424
Sent from the Hadoop Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop 'wordcount' program hanging in the Reduce phase.

Gaurav Agarwal
In reply to this post by Gaurav Agarwal
Problem resolved!!

This looks like a bug in 0.12.0 version (there is some thread in the developer area regarding a race-condition which results in the hung reduce job) . I moved back to 0.11.2 and this got resolved!
Thanks a lot to all of you and specially Jaya for pointing it out.

regards
gaurav

Gaurav Agarwal wrote
Hi Everyone!
I am new user to Hadoop and trying to set up a small cluster using Hadoop (Release Mar 02) on Ubuntu 6.10 (Edgy) ; but I am facing some issues doing that.

I am trying to run the Hadoop 'wordcount' example program which come bundled with it. I am able to successfully run the program on a single node cluster (that is using my local machine only). But, when I try to run the same program on a cluster of two machines, the program hangs in the 'reduce' phase.


Settings:

Master Node: 192.168.1.150 (dennis-laptop)
Slave Node: 192.168.1.201 (traal)

User Account on both Master and Slave is named : Hadoop

Password-less ssh login to Slave from the Master is working.

JAVA_HOME is set appropriately in the hadoop-env.sh file on both Master/Slave.

MASTER

1) conf/slaves
localhost
hadoop@192.168.1.201

2) conf/master
localhost

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>

SLAVE

1) conf/slaves
localhost

2) conf/master
hadoop@192.168.1.150

3) conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
         <name>fs.default.name</name>
         <value>192.168.1.150:50000</value>
    </property>

    <property>
         <name>mapred.job.tracker</name>
         <value>192.168.1.150:50001</value>
     </property>
       
    <property>
         <name>dfs.replication</name>
         <value>2</value>
    </property>
</configuration>


CONSOLE OUTPUT
bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to process : 1
07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%


The only exception I can see from the log files is in the 'TaskTracker' log file:

2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 Copying task_0001_m_000001_0 output from dennis-laptop.
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: File /tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not created
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:301)
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:262)

2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner: task_0001_r_000000_0 adding host traal to penalty box, next contact in 99 seconds

I am attaching the master log files just in case anyone wants to check them.

Any help will be greatly appreciated!

-gaurav

hadoop-hadoop-tasktracker-dennis-laptop.log</br>hadoop-hadoop-jobtracker-dennis-laptop.log</br>hadoop-hadoop-namenode-dennis-laptop.log</br>hadoop-hadoop-datanode-dennis-laptop.log