Reduce hangs

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Reduce hangs

Yunhong Gu1

Hi,

If someone knows how to fix the problem described below, please help me
out. Thanks!

I am testing Hadoop on 2-node cluster and the "reduce" always hangs at
some stage, even if I use different clusters. My OS is Debian Linux kernel
2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision is 0.15.2. Java version is
1.5.0_01-b08.

I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar mrbench" and when
the map stage finishes, the reduce stage will hang somewhere in the
middle, sometimes at 0%. I also tried any other mapreduce program I can
find in the example jar package but they all hang.

The log file simply print
2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801181424_0004_r_000000_0 0.0% reduce > copy >
2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801181424_0004_r_000000_0 0.0% reduce > copy >
2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801181424_0004_r_000000_0 0.0% reduce > copy >

forever.

The program does work if I start Hadoop only on single node.

Below is my hadoop-site.xml configuration:

<configuration>

<property>
    <name>fs.default.name</name>
    <value>10.0.0.1:60000</value>
</property>

<property>
    <name>mapred.job.tracker</name>
    <value>10.0.0.1:60001</value>
</property>

<property>
    <name>dfs.data.dir</name>
    <value>/raid/hadoop/data</value>
</property>

<property>
    <name>mapred.local.dir</name>
    <value>/raid/hadoop/mapred</value>
</property>

<property>
   <name>hadoop.tmp.dir</name>
   <value>/raid/hadoop/tmp</value>
</property>

<property>
   <name>mapred.child.java.opts</name>
   <value>-Xmx1024m</value>
</property>

<property>
   <name>mapred.tasktracker.tasks.maximum</name>
   <value>4</value>
</property>

<!--
<property>
   <name>mapred.map.tasks</name>
   <value>7</value>
</property>

<property>
   <name>mapred.reduce.tasks</name>
   <value>3</value>
</property>
-->

<property>
   <name>fs.inmemory.size.mb</name>
   <value>200</value>
</property>

<property>
   <name>dfs.block.size</name>
   <value>134217728</value>
</property>

<property>
   <name>io.sort.factor</name>
   <value>100</value>
</property>

<property>
   <name>io.sort.mb</name>
   <value>200</value>
</property>

<property>
   <name>io.file.buffer.size</name>
   <value>131072</value>
</property>

</configuration>

Reply | Threaded
Open this post in threaded view
|

Re: Reduce hangs

Miles Osborne
I had the same problem.  If I recall, the fix is to add the following to
your hadoop-site.xml file:

<property>
<name>mapred.reduce.copy.backoff</name>
<value>5</value>
</property>

See hadoop-1984

Miles


On 18/01/2008, Yunhong Gu1 <[hidden email]> wrote:

>
>
> Hi,
>
> If someone knows how to fix the problem described below, please help me
> out. Thanks!
>
> I am testing Hadoop on 2-node cluster and the "reduce" always hangs at
> some stage, even if I use different clusters. My OS is Debian Linux kernel
> 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision is 0.15.2. Java version is
> 1.5.0_01-b08.
>
> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar mrbench" and when
> the map stage finishes, the reduce stage will hang somewhere in the
> middle, sometimes at 0%. I also tried any other mapreduce program I can
> find in the example jar package but they all hang.
>
> The log file simply print
> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>
> forever.
>
> The program does work if I start Hadoop only on single node.
>
> Below is my hadoop-site.xml configuration:
>
> <configuration>
>
> <property>
>     <name>fs.default.name</name>
>     <value>10.0.0.1:60000</value>
> </property>
>
> <property>
>     <name>mapred.job.tracker</name>
>     <value>10.0.0.1:60001</value>
> </property>
>
> <property>
>     <name>dfs.data.dir</name>
>     <value>/raid/hadoop/data</value>
> </property>
>
> <property>
>     <name>mapred.local.dir</name>
>     <value>/raid/hadoop/mapred</value>
> </property>
>
> <property>
>    <name>hadoop.tmp.dir</name>
>    <value>/raid/hadoop/tmp</value>
> </property>
>
> <property>
>    <name>mapred.child.java.opts</name>
>    <value>-Xmx1024m</value>
> </property>
>
> <property>
>    <name>mapred.tasktracker.tasks.maximum</name>
>    <value>4</value>
> </property>
>
> <!--
> <property>
>    <name>mapred.map.tasks</name>
>    <value>7</value>
> </property>
>
> <property>
>    <name>mapred.reduce.tasks</name>
>    <value>3</value>
> </property>
> -->
>
> <property>
>    <name>fs.inmemory.size.mb</name>
>    <value>200</value>
> </property>
>
> <property>
>    <name>dfs.block.size</name>
>    <value>134217728</value>
> </property>
>
> <property>
>    <name>io.sort.factor</name>
>    <value>100</value>
> </property>
>
> <property>
>    <name>io.sort.mb</name>
>    <value>200</value>
> </property>
>
> <property>
>    <name>io.file.buffer.size</name>
>    <value>131072</value>
> </property>
>
> </configuration>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reduce hangs

Yunhong Gu1

Hi, Miles,

Thanks for your information. I applied this but the problem still exists.
By the way, when this happens, the CPUs are idle and doing nothing.

Yunhong

On Fri, 18 Jan 2008, Miles Osborne wrote:

> I had the same problem.  If I recall, the fix is to add the following to
> your hadoop-site.xml file:
>
> <property>
> <name>mapred.reduce.copy.backoff</name>
> <value>5</value>
> </property>
>
> See hadoop-1984
>
> Miles
>
>
> On 18/01/2008, Yunhong Gu1 <[hidden email]> wrote:
>>
>>
>> Hi,
>>
>> If someone knows how to fix the problem described below, please help me
>> out. Thanks!
>>
>> I am testing Hadoop on 2-node cluster and the "reduce" always hangs at
>> some stage, even if I use different clusters. My OS is Debian Linux kernel
>> 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision is 0.15.2. Java version is
>> 1.5.0_01-b08.
>>
>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar mrbench" and when
>> the map stage finishes, the reduce stage will hang somewhere in the
>> middle, sometimes at 0%. I also tried any other mapreduce program I can
>> find in the example jar package but they all hang.
>>
>> The log file simply print
>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>
>> forever.
>>
>> The program does work if I start Hadoop only on single node.
>>
>> Below is my hadoop-site.xml configuration:
>>
>> <configuration>
>>
>> <property>
>>     <name>fs.default.name</name>
>>     <value>10.0.0.1:60000</value>
>> </property>
>>
>> <property>
>>     <name>mapred.job.tracker</name>
>>     <value>10.0.0.1:60001</value>
>> </property>
>>
>> <property>
>>     <name>dfs.data.dir</name>
>>     <value>/raid/hadoop/data</value>
>> </property>
>>
>> <property>
>>     <name>mapred.local.dir</name>
>>     <value>/raid/hadoop/mapred</value>
>> </property>
>>
>> <property>
>>    <name>hadoop.tmp.dir</name>
>>    <value>/raid/hadoop/tmp</value>
>> </property>
>>
>> <property>
>>    <name>mapred.child.java.opts</name>
>>    <value>-Xmx1024m</value>
>> </property>
>>
>> <property>
>>    <name>mapred.tasktracker.tasks.maximum</name>
>>    <value>4</value>
>> </property>
>>
>> <!--
>> <property>
>>    <name>mapred.map.tasks</name>
>>    <value>7</value>
>> </property>
>>
>> <property>
>>    <name>mapred.reduce.tasks</name>
>>    <value>3</value>
>> </property>
>> -->
>>
>> <property>
>>    <name>fs.inmemory.size.mb</name>
>>    <value>200</value>
>> </property>
>>
>> <property>
>>    <name>dfs.block.size</name>
>>    <value>134217728</value>
>> </property>
>>
>> <property>
>>    <name>io.sort.factor</name>
>>    <value>100</value>
>> </property>
>>
>> <property>
>>    <name>io.sort.mb</name>
>>    <value>200</value>
>> </property>
>>
>> <property>
>>    <name>io.file.buffer.size</name>
>>    <value>131072</value>
>> </property>
>>
>> </configuration>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reduce hangs

Miles Osborne
I think it takes a while to actually work, so be patient!

Miles

On 18/01/2008, Yunhong Gu1 <[hidden email]> wrote:

>
>
> Hi, Miles,
>
> Thanks for your information. I applied this but the problem still exists.
> By the way, when this happens, the CPUs are idle and doing nothing.
>
> Yunhong
>
> On Fri, 18 Jan 2008, Miles Osborne wrote:
>
> > I had the same problem.  If I recall, the fix is to add the following to
> > your hadoop-site.xml file:
> >
> > <property>
> > <name>mapred.reduce.copy.backoff</name>
> > <value>5</value>
> > </property>
> >
> > See hadoop-1984
> >
> > Miles
> >
> >
> > On 18/01/2008, Yunhong Gu1 <[hidden email]> wrote:
> >>
> >>
> >> Hi,
> >>
> >> If someone knows how to fix the problem described below, please help me
> >> out. Thanks!
> >>
> >> I am testing Hadoop on 2-node cluster and the "reduce" always hangs at
> >> some stage, even if I use different clusters. My OS is Debian Linux
> kernel
> >> 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision is 0.15.2. Java version
> is
> >> 1.5.0_01-b08.
> >>
> >> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar mrbench" and
> when
> >> the map stage finishes, the reduce stage will hang somewhere in the
> >> middle, sometimes at 0%. I also tried any other mapreduce program I can
> >> find in the example jar package but they all hang.
> >>
> >> The log file simply print
> >> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
> >> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> >> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
> >> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> >> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
> >> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> >>
> >> forever.
> >>
> >> The program does work if I start Hadoop only on single node.
> >>
> >> Below is my hadoop-site.xml configuration:
> >>
> >> <configuration>
> >>
> >> <property>
> >>     <name>fs.default.name</name>
> >>     <value>10.0.0.1:60000</value>
> >> </property>
> >>
> >> <property>
> >>     <name>mapred.job.tracker</name>
> >>     <value>10.0.0.1:60001</value>
> >> </property>
> >>
> >> <property>
> >>     <name>dfs.data.dir</name>
> >>     <value>/raid/hadoop/data</value>
> >> </property>
> >>
> >> <property>
> >>     <name>mapred.local.dir</name>
> >>     <value>/raid/hadoop/mapred</value>
> >> </property>
> >>
> >> <property>
> >>    <name>hadoop.tmp.dir</name>
> >>    <value>/raid/hadoop/tmp</value>
> >> </property>
> >>
> >> <property>
> >>    <name>mapred.child.java.opts</name>
> >>    <value>-Xmx1024m</value>
> >> </property>
> >>
> >> <property>
> >>    <name>mapred.tasktracker.tasks.maximum</name>
> >>    <value>4</value>
> >> </property>
> >>
> >> <!--
> >> <property>
> >>    <name>mapred.map.tasks</name>
> >>    <value>7</value>
> >> </property>
> >>
> >> <property>
> >>    <name>mapred.reduce.tasks</name>
> >>    <value>3</value>
> >> </property>
> >> -->
> >>
> >> <property>
> >>    <name>fs.inmemory.size.mb</name>
> >>    <value>200</value>
> >> </property>
> >>
> >> <property>
> >>    <name>dfs.block.size</name>
> >>    <value>134217728</value>
> >> </property>
> >>
> >> <property>
> >>    <name>io.sort.factor</name>
> >>    <value>100</value>
> >> </property>
> >>
> >> <property>
> >>    <name>io.sort.mb</name>
> >>    <value>200</value>
> >> </property>
> >>
> >> <property>
> >>    <name>io.file.buffer.size</name>
> >>    <value>131072</value>
> >> </property>
> >>
> >> </configuration>
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Reduce hangs

Jason Venner-2
In reply to this post by Yunhong Gu1
When this was happening to us, there was a block replication error and
one node was in an endless loop trying to replicate a block to another
node which would not accept it. In our case most of the cluster was idle
but a cpu on the machine trying send the block was heavily used.

We never were able to isolate the cause, and it stopped happening for us
when we upgraded to 0.15.1

---
Attributor is hiring Hadoop Wranglers, contact if interested.

Yunhong Gu1 wrote:

>
> Hi,
>
> If someone knows how to fix the problem described below, please help
> me out. Thanks!
>
> I am testing Hadoop on 2-node cluster and the "reduce" always hangs at
> some stage, even if I use different clusters. My OS is Debian Linux
> kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision is 0.15.2. Java
> version is 1.5.0_01-b08.
>
> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar mrbench" and
> when the map stage finishes, the reduce stage will hang somewhere in
> the middle, sometimes at 0%. I also tried any other mapreduce program
> I can find in the example jar package but they all hang.
>
> The log file simply print
> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>
> forever.
>
> The program does work if I start Hadoop only on single node.
>
> Below is my hadoop-site.xml configuration:
>
> <configuration>
>
> <property>
>    <name>fs.default.name</name>
>    <value>10.0.0.1:60000</value>
> </property>
>
> <property>
>    <name>mapred.job.tracker</name>
>    <value>10.0.0.1:60001</value>
> </property>
>
> <property>
>    <name>dfs.data.dir</name>
>    <value>/raid/hadoop/data</value>
> </property>
>
> <property>
>    <name>mapred.local.dir</name>
>    <value>/raid/hadoop/mapred</value>
> </property>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/raid/hadoop/tmp</value>
> </property>
>
> <property>
>   <name>mapred.child.java.opts</name>
>   <value>-Xmx1024m</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.tasks.maximum</name>
>   <value>4</value>
> </property>
>
> <!--
> <property>
>   <name>mapred.map.tasks</name>
>   <value>7</value>
> </property>
>
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>3</value>
> </property>
> -->
>
> <property>
>   <name>fs.inmemory.size.mb</name>
>   <value>200</value>
> </property>
>
> <property>
>   <name>dfs.block.size</name>
>   <value>134217728</value>
> </property>
>
> <property>
>   <name>io.sort.factor</name>
>   <value>100</value>
> </property>
>
> <property>
>   <name>io.sort.mb</name>
>   <value>200</value>
> </property>
>
> <property>
>   <name>io.file.buffer.size</name>
>   <value>131072</value>
> </property>
>
> </configuration>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reduce hangs

Yunhong Gu1

I am using 0.15.2, and in my case, CPUs on both nodes are idle. It looks
like the program is trapped into a synchronization deadlock or some
waiting state that will never be awaken.

Yunhong

On Fri, 18 Jan 2008, Jason Venner wrote:

> When this was happening to us, there was a block replication error and one
> node was in an endless loop trying to replicate a block to another node which
> would not accept it. In our case most of the cluster was idle but a cpu on
> the machine trying send the block was heavily used.
>
> We never were able to isolate the cause, and it stopped happening for us when
> we upgraded to 0.15.1
>
> ---
> Attributor is hiring Hadoop Wranglers, contact if interested.
>
> Yunhong Gu1 wrote:
>>
>> Hi,
>>
>> If someone knows how to fix the problem described below, please help me
>> out. Thanks!
>>
>> I am testing Hadoop on 2-node cluster and the "reduce" always hangs at some
>> stage, even if I use different clusters. My OS is Debian Linux kernel 2.6
>> (AMD Opteron w/ 4GB Mem). Hadoop verision is 0.15.2. Java version is
>> 1.5.0_01-b08.
>>
>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar mrbench" and when
>> the map stage finishes, the reduce stage will hang somewhere in the middle,
>> sometimes at 0%. I also tried any other mapreduce program I can find in the
>> example jar package but they all hang.
>>
>> The log file simply print
>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>
>> forever.
>>
>> The program does work if I start Hadoop only on single node.
>>
>> Below is my hadoop-site.xml configuration:
>>
>> <configuration>
>>
>> <property>
>>    <name>fs.default.name</name>
>>    <value>10.0.0.1:60000</value>
>> </property>
>>
>> <property>
>>    <name>mapred.job.tracker</name>
>>    <value>10.0.0.1:60001</value>
>> </property>
>>
>> <property>
>>    <name>dfs.data.dir</name>
>>    <value>/raid/hadoop/data</value>
>> </property>
>>
>> <property>
>>    <name>mapred.local.dir</name>
>>    <value>/raid/hadoop/mapred</value>
>> </property>
>>
>> <property>
>>   <name>hadoop.tmp.dir</name>
>>   <value>/raid/hadoop/tmp</value>
>> </property>
>>
>> <property>
>>   <name>mapred.child.java.opts</name>
>>   <value>-Xmx1024m</value>
>> </property>
>>
>> <property>
>>   <name>mapred.tasktracker.tasks.maximum</name>
>>   <value>4</value>
>> </property>
>>
>> <!--
>> <property>
>>   <name>mapred.map.tasks</name>
>>   <value>7</value>
>> </property>
>>
>> <property>
>>   <name>mapred.reduce.tasks</name>
>>   <value>3</value>
>> </property>
>> -->
>>
>> <property>
>>   <name>fs.inmemory.size.mb</name>
>>   <value>200</value>
>> </property>
>>
>> <property>
>>   <name>dfs.block.size</name>
>>   <value>134217728</value>
>> </property>
>>
>> <property>
>>   <name>io.sort.factor</name>
>>   <value>100</value>
>> </property>
>>
>> <property>
>>   <name>io.sort.mb</name>
>>   <value>200</value>
>> </property>
>>
>> <property>
>>   <name>io.file.buffer.size</name>
>>   <value>131072</value>
>> </property>
>>
>> </configuration>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reduce hangs

Yunhong Gu1
In reply to this post by Miles Osborne

The program "mrbench" takes 1 second on a single node, so I think waiting
for 1 minute should be long enough. And I also restarted Hadoop after I
updated the config file.

Yunhong


On Fri, 18 Jan 2008, Miles Osborne wrote:

> I think it takes a while to actually work, so be patient!
>
> Miles
>
> On 18/01/2008, Yunhong Gu1 <[hidden email]> wrote:
>>
>>
>> Hi, Miles,
>>
>> Thanks for your information. I applied this but the problem still exists.
>> By the way, when this happens, the CPUs are idle and doing nothing.
>>
>> Yunhong
>>
>> On Fri, 18 Jan 2008, Miles Osborne wrote:
>>
>>> I had the same problem.  If I recall, the fix is to add the following to
>>> your hadoop-site.xml file:
>>>
>>> <property>
>>> <name>mapred.reduce.copy.backoff</name>
>>> <value>5</value>
>>> </property>
>>>
>>> See hadoop-1984
>>>
>>> Miles
>>>
>>>
>>> On 18/01/2008, Yunhong Gu1 <[hidden email]> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> If someone knows how to fix the problem described below, please help me
>>>> out. Thanks!
>>>>
>>>> I am testing Hadoop on 2-node cluster and the "reduce" always hangs at
>>>> some stage, even if I use different clusters. My OS is Debian Linux
>> kernel
>>>> 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision is 0.15.2. Java version
>> is
>>>> 1.5.0_01-b08.
>>>>
>>>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar mrbench" and
>> when
>>>> the map stage finishes, the reduce stage will hang somewhere in the
>>>> middle, sometimes at 0%. I also tried any other mapreduce program I can
>>>> find in the example jar package but they all hang.
>>>>
>>>> The log file simply print
>>>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>
>>>> forever.
>>>>
>>>> The program does work if I start Hadoop only on single node.
>>>>
>>>> Below is my hadoop-site.xml configuration:
>>>>
>>>> <configuration>
>>>>
>>>> <property>
>>>>     <name>fs.default.name</name>
>>>>     <value>10.0.0.1:60000</value>
>>>> </property>
>>>>
>>>> <property>
>>>>     <name>mapred.job.tracker</name>
>>>>     <value>10.0.0.1:60001</value>
>>>> </property>
>>>>
>>>> <property>
>>>>     <name>dfs.data.dir</name>
>>>>     <value>/raid/hadoop/data</value>
>>>> </property>
>>>>
>>>> <property>
>>>>     <name>mapred.local.dir</name>
>>>>     <value>/raid/hadoop/mapred</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>hadoop.tmp.dir</name>
>>>>    <value>/raid/hadoop/tmp</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>mapred.child.java.opts</name>
>>>>    <value>-Xmx1024m</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>mapred.tasktracker.tasks.maximum</name>
>>>>    <value>4</value>
>>>> </property>
>>>>
>>>> <!--
>>>> <property>
>>>>    <name>mapred.map.tasks</name>
>>>>    <value>7</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>mapred.reduce.tasks</name>
>>>>    <value>3</value>
>>>> </property>
>>>> -->
>>>>
>>>> <property>
>>>>    <name>fs.inmemory.size.mb</name>
>>>>    <value>200</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>dfs.block.size</name>
>>>>    <value>134217728</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>io.sort.factor</name>
>>>>    <value>100</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>io.sort.mb</name>
>>>>    <value>200</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>io.file.buffer.size</name>
>>>>    <value>131072</value>
>>>> </property>
>>>>
>>>> </configuration>
>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reduce hangs

Konstantin Shvachko
In reply to this post by Yunhong Gu1
Looks like we still have this unsolved mysterious problem:

http://issues.apache.org/jira/browse/HADOOP-1374

Could it be related to HADOOP-1246? Arun?

Thanks,
--Konstantin

Yunhong Gu1 wrote:

>
> Hi,
>
> If someone knows how to fix the problem described below, please help me
> out. Thanks!
>
> I am testing Hadoop on 2-node cluster and the "reduce" always hangs at
> some stage, even if I use different clusters. My OS is Debian Linux
> kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision is 0.15.2. Java
> version is 1.5.0_01-b08.
>
> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar mrbench" and
> when the map stage finishes, the reduce stage will hang somewhere in the
> middle, sometimes at 0%. I also tried any other mapreduce program I can
> find in the example jar package but they all hang.
>
> The log file simply print
> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>
> forever.
>
> The program does work if I start Hadoop only on single node.
>
> Below is my hadoop-site.xml configuration:
>
> <configuration>
>
> <property>
>    <name>fs.default.name</name>
>    <value>10.0.0.1:60000</value>
> </property>
>
> <property>
>    <name>mapred.job.tracker</name>
>    <value>10.0.0.1:60001</value>
> </property>
>
> <property>
>    <name>dfs.data.dir</name>
>    <value>/raid/hadoop/data</value>
> </property>
>
> <property>
>    <name>mapred.local.dir</name>
>    <value>/raid/hadoop/mapred</value>
> </property>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/raid/hadoop/tmp</value>
> </property>
>
> <property>
>   <name>mapred.child.java.opts</name>
>   <value>-Xmx1024m</value>
> </property>
>
> <property>
>   <name>mapred.tasktracker.tasks.maximum</name>
>   <value>4</value>
> </property>
>
> <!--
> <property>
>   <name>mapred.map.tasks</name>
>   <value>7</value>
> </property>
>
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>3</value>
> </property>
> -->
>
> <property>
>   <name>fs.inmemory.size.mb</name>
>   <value>200</value>
> </property>
>
> <property>
>   <name>dfs.block.size</name>
>   <value>134217728</value>
> </property>
>
> <property>
>   <name>io.sort.factor</name>
>   <value>100</value>
> </property>
>
> <property>
>   <name>io.sort.mb</name>
>   <value>200</value>
> </property>
>
> <property>
>   <name>io.file.buffer.size</name>
>   <value>131072</value>
> </property>
>
> </configuration>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reduce hangs

Yunhong Gu1


Yes, it looks like HADOOP-1374

The program actually failed after a while:


gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar hadoop-0.15.2-test.jar mrbench
MRBenchmark.0.0.2
08/01/18 18:53:08 INFO mapred.MRBench: creating control file: 1 numLines,
ASCENDING sortOrder
08/01/18 18:53:08 INFO mapred.MRBench: created control file:
/benchmarks/MRBench/mr_input/input_-450753747.txt
08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
input=/benchmarks/MRBench/mr_input
output=/benchmarks/MRBench/mr_output/output_1843693325
08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input paths to process : 1
08/01/18 18:53:09 INFO mapred.JobClient: Running job: job_200801181852_0001
08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
task_200801181852_0001_m_000001_0, Status : FAILED
Too many fetch-failures
08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15
08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15
08/01/18 19:08:34 INFO mapred.JobClient:  map 100% reduce 100%
08/01/18 19:08:35 INFO mapred.JobClient: Job complete: job_200801181852_0001
08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
08/01/18 19:08:35 INFO mapred.JobClient:   Job Counters
08/01/18 19:08:35 INFO mapred.JobClient:     Launched map tasks=3
08/01/18 19:08:35 INFO mapred.JobClient:     Launched reduce tasks=1
08/01/18 19:08:35 INFO mapred.JobClient:     Data-local map tasks=2
08/01/18 19:08:35 INFO mapred.JobClient:   Map-Reduce Framework
08/01/18 19:08:35 INFO mapred.JobClient:     Map input records=1
08/01/18 19:08:35 INFO mapred.JobClient:     Map output records=1
08/01/18 19:08:35 INFO mapred.JobClient:     Map input bytes=2
08/01/18 19:08:35 INFO mapred.JobClient:     Map output bytes=5
08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input groups=1
08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input records=1
08/01/18 19:08:35 INFO mapred.JobClient:     Reduce output records=1
DataLines       Maps    Reduces AvgTime (milliseconds)
1               2       1       926333



On Fri, 18 Jan 2008, Konstantin Shvachko wrote:

> Looks like we still have this unsolved mysterious problem:
>
> http://issues.apache.org/jira/browse/HADOOP-1374
>
> Could it be related to HADOOP-1246? Arun?
>
> Thanks,
> --Konstantin
>
> Yunhong Gu1 wrote:
>>
>> Hi,
>>
>> If someone knows how to fix the problem described below, please help me
>> out. Thanks!
>>
>> I am testing Hadoop on 2-node cluster and the "reduce" always hangs at some
>> stage, even if I use different clusters. My OS is Debian Linux kernel 2.6
>> (AMD Opteron w/ 4GB Mem). Hadoop verision is 0.15.2. Java version is
>> 1.5.0_01-b08.
>>
>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar mrbench" and when
>> the map stage finishes, the reduce stage will hang somewhere in the middle,
>> sometimes at 0%. I also tried any other mapreduce program I can find in the
>> example jar package but they all hang.
>>
>> The log file simply print
>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>
>> forever.
>>
>> The program does work if I start Hadoop only on single node.
>>
>> Below is my hadoop-site.xml configuration:
>>
>> <configuration>
>>
>> <property>
>>    <name>fs.default.name</name>
>>    <value>10.0.0.1:60000</value>
>> </property>
>>
>> <property>
>>    <name>mapred.job.tracker</name>
>>    <value>10.0.0.1:60001</value>
>> </property>
>>
>> <property>
>>    <name>dfs.data.dir</name>
>>    <value>/raid/hadoop/data</value>
>> </property>
>>
>> <property>
>>    <name>mapred.local.dir</name>
>>    <value>/raid/hadoop/mapred</value>
>> </property>
>>
>> <property>
>>   <name>hadoop.tmp.dir</name>
>>   <value>/raid/hadoop/tmp</value>
>> </property>
>>
>> <property>
>>   <name>mapred.child.java.opts</name>
>>   <value>-Xmx1024m</value>
>> </property>
>>
>> <property>
>>   <name>mapred.tasktracker.tasks.maximum</name>
>>   <value>4</value>
>> </property>
>>
>> <!--
>> <property>
>>   <name>mapred.map.tasks</name>
>>   <value>7</value>
>> </property>
>>
>> <property>
>>   <name>mapred.reduce.tasks</name>
>>   <value>3</value>
>> </property>
>> -->
>>
>> <property>
>>   <name>fs.inmemory.size.mb</name>
>>   <value>200</value>
>> </property>
>>
>> <property>
>>   <name>dfs.block.size</name>
>>   <value>134217728</value>
>> </property>
>>
>> <property>
>>   <name>io.sort.factor</name>
>>   <value>100</value>
>> </property>
>>
>> <property>
>>   <name>io.sort.mb</name>
>>   <value>200</value>
>> </property>
>>
>> <property>
>>   <name>io.file.buffer.size</name>
>>   <value>131072</value>
>> </property>
>>
>> </configuration>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

RE: Reduce hangs

Devaraj Das
Hi Yunhong,
As per the output it seems the job ran to successful completion (albeit with
some failures)...
Devaraj

> -----Original Message-----
> From: Yunhong Gu1 [mailto:[hidden email]]
> Sent: Saturday, January 19, 2008 8:56 AM
> To: [hidden email]
> Subject: Re: Reduce hangs
>
>
>
> Yes, it looks like HADOOP-1374
>
> The program actually failed after a while:
>
>
> gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar
> hadoop-0.15.2-test.jar mrbench
> MRBenchmark.0.0.2
> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
> 1 numLines, ASCENDING sortOrder
> 08/01/18 18:53:08 INFO mapred.MRBench: created control file:
> /benchmarks/MRBench/mr_input/input_-450753747.txt
> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
> input=/benchmarks/MRBench/mr_input
> output=/benchmarks/MRBench/mr_output/output_1843693325
> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
> paths to process : 1
> 08/01/18 18:53:09 INFO mapred.JobClient: Running job:
> job_200801181852_0001
> 08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
> 08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
> 08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
> task_200801181852_0001_m_000001_0, Status : FAILED Too many
> fetch-failures
> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
> outputncdm15
> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
> outputncdm15
> 08/01/18 19:08:34 INFO mapred.JobClient:  map 100% reduce 100%
> 08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
> job_200801181852_0001
> 08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
> 08/01/18 19:08:35 INFO mapred.JobClient:   Job Counters
> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched map tasks=3
> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched reduce tasks=1
> 08/01/18 19:08:35 INFO mapred.JobClient:     Data-local map tasks=2
> 08/01/18 19:08:35 INFO mapred.JobClient:   Map-Reduce Framework
> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input records=1
> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output records=1
> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input bytes=2
> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output bytes=5
> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input groups=1
> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input records=1
> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce output records=1
> DataLines       Maps    Reduces AvgTime (milliseconds)
> 1               2       1       926333
>
>
>
> On Fri, 18 Jan 2008, Konstantin Shvachko wrote:
>
> > Looks like we still have this unsolved mysterious problem:
> >
> > http://issues.apache.org/jira/browse/HADOOP-1374
> >
> > Could it be related to HADOOP-1246? Arun?
> >
> > Thanks,
> > --Konstantin
> >
> > Yunhong Gu1 wrote:
> >>
> >> Hi,
> >>
> >> If someone knows how to fix the problem described below,
> please help
> >> me out. Thanks!
> >>
> >> I am testing Hadoop on 2-node cluster and the "reduce"
> always hangs
> >> at some stage, even if I use different clusters. My OS is Debian
> >> Linux kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision
> is 0.15.2.
> >> Java version is 1.5.0_01-b08.
> >>
> >> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar
> mrbench" and
> >> when the map stage finishes, the reduce stage will hang
> somewhere in
> >> the middle, sometimes at 0%. I also tried any other
> mapreduce program
> >> I can find in the example jar package but they all hang.
> >>
> >> The log file simply print
> >> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
> >> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> >> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
> >> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> >> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
> >> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> >>
> >> forever.
> >>
> >> The program does work if I start Hadoop only on single node.
> >>
> >> Below is my hadoop-site.xml configuration:
> >>
> >> <configuration>
> >>
> >> <property>
> >>    <name>fs.default.name</name>
> >>    <value>10.0.0.1:60000</value>
> >> </property>
> >>
> >> <property>
> >>    <name>mapred.job.tracker</name>
> >>    <value>10.0.0.1:60001</value>
> >> </property>
> >>
> >> <property>
> >>    <name>dfs.data.dir</name>
> >>    <value>/raid/hadoop/data</value>
> >> </property>
> >>
> >> <property>
> >>    <name>mapred.local.dir</name>
> >>    <value>/raid/hadoop/mapred</value>
> >> </property>
> >>
> >> <property>
> >>   <name>hadoop.tmp.dir</name>
> >>   <value>/raid/hadoop/tmp</value>
> >> </property>
> >>
> >> <property>
> >>   <name>mapred.child.java.opts</name>
> >>   <value>-Xmx1024m</value>
> >> </property>
> >>
> >> <property>
> >>   <name>mapred.tasktracker.tasks.maximum</name>
> >>   <value>4</value>
> >> </property>
> >>
> >> <!--
> >> <property>
> >>   <name>mapred.map.tasks</name>
> >>   <value>7</value>
> >> </property>
> >>
> >> <property>
> >>   <name>mapred.reduce.tasks</name>
> >>   <value>3</value>
> >> </property>
> >> -->
> >>
> >> <property>
> >>   <name>fs.inmemory.size.mb</name>
> >>   <value>200</value>
> >> </property>
> >>
> >> <property>
> >>   <name>dfs.block.size</name>
> >>   <value>134217728</value>
> >> </property>
> >>
> >> <property>
> >>   <name>io.sort.factor</name>
> >>   <value>100</value>
> >> </property>
> >>
> >> <property>
> >>   <name>io.sort.mb</name>
> >>   <value>200</value>
> >> </property>
> >>
> >> <property>
> >>   <name>io.file.buffer.size</name>
> >>   <value>131072</value>
> >> </property>
> >>
> >> </configuration>
> >>
> >>
> >
>

Reply | Threaded
Open this post in threaded view
|

RE: Reduce hangs

Yunhong Gu1

Oh, so it is the task running on the other node (ncdm-15) fails and Hadoop
re-run the task on the local node (ncdm-8). (I only have two nodes, ncdm-8
and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The
program is also started on ncdm-8).

>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id : task_200801181852_0001_m_000001_0, Status : FAILED Too many fetch-failures
>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15

Any ideas why the task would fail? And why it takes so long for Hadoop to
detect the failure?

Thanks
Yunhong

On Sat, 19 Jan 2008, Devaraj Das wrote:

> Hi Yunhong,
> As per the output it seems the job ran to successful completion (albeit with
> some failures)...
> Devaraj
>
>> -----Original Message-----
>> From: Yunhong Gu1 [mailto:[hidden email]]
>> Sent: Saturday, January 19, 2008 8:56 AM
>> To: [hidden email]
>> Subject: Re: Reduce hangs
>>
>>
>>
>> Yes, it looks like HADOOP-1374
>>
>> The program actually failed after a while:
>>
>>
>> gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar
>> hadoop-0.15.2-test.jar mrbench
>> MRBenchmark.0.0.2
>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
>> 1 numLines, ASCENDING sortOrder
>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file:
>> /benchmarks/MRBench/mr_input/input_-450753747.txt
>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
>> input=/benchmarks/MRBench/mr_input
>> output=/benchmarks/MRBench/mr_output/output_1843693325
>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
>> paths to process : 1
>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job:
>> job_200801181852_0001
>> 08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
>> 08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
>> fetch-failures
>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>> outputncdm15
>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>> outputncdm15
>> 08/01/18 19:08:34 INFO mapred.JobClient:  map 100% reduce 100%
>> 08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
>> job_200801181852_0001
>> 08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
>> 08/01/18 19:08:35 INFO mapred.JobClient:   Job Counters
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched map tasks=3
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched reduce tasks=1
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Data-local map tasks=2
>> 08/01/18 19:08:35 INFO mapred.JobClient:   Map-Reduce Framework
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input records=1
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output records=1
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input bytes=2
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output bytes=5
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input groups=1
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input records=1
>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce output records=1
>> DataLines       Maps    Reduces AvgTime (milliseconds)
>> 1               2       1       926333
>>
>>
>>
>> On Fri, 18 Jan 2008, Konstantin Shvachko wrote:
>>
>>> Looks like we still have this unsolved mysterious problem:
>>>
>>> http://issues.apache.org/jira/browse/HADOOP-1374
>>>
>>> Could it be related to HADOOP-1246? Arun?
>>>
>>> Thanks,
>>> --Konstantin
>>>
>>> Yunhong Gu1 wrote:
>>>>
>>>> Hi,
>>>>
>>>> If someone knows how to fix the problem described below,
>> please help
>>>> me out. Thanks!
>>>>
>>>> I am testing Hadoop on 2-node cluster and the "reduce"
>> always hangs
>>>> at some stage, even if I use different clusters. My OS is Debian
>>>> Linux kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision
>> is 0.15.2.
>>>> Java version is 1.5.0_01-b08.
>>>>
>>>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar
>> mrbench" and
>>>> when the map stage finishes, the reduce stage will hang
>> somewhere in
>>>> the middle, sometimes at 0%. I also tried any other
>> mapreduce program
>>>> I can find in the example jar package but they all hang.
>>>>
>>>> The log file simply print
>>>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>
>>>> forever.
>>>>
>>>> The program does work if I start Hadoop only on single node.
>>>>
>>>> Below is my hadoop-site.xml configuration:
>>>>
>>>> <configuration>
>>>>
>>>> <property>
>>>>    <name>fs.default.name</name>
>>>>    <value>10.0.0.1:60000</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>mapred.job.tracker</name>
>>>>    <value>10.0.0.1:60001</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>dfs.data.dir</name>
>>>>    <value>/raid/hadoop/data</value>
>>>> </property>
>>>>
>>>> <property>
>>>>    <name>mapred.local.dir</name>
>>>>    <value>/raid/hadoop/mapred</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>hadoop.tmp.dir</name>
>>>>   <value>/raid/hadoop/tmp</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>mapred.child.java.opts</name>
>>>>   <value>-Xmx1024m</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>mapred.tasktracker.tasks.maximum</name>
>>>>   <value>4</value>
>>>> </property>
>>>>
>>>> <!--
>>>> <property>
>>>>   <name>mapred.map.tasks</name>
>>>>   <value>7</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>mapred.reduce.tasks</name>
>>>>   <value>3</value>
>>>> </property>
>>>> -->
>>>>
>>>> <property>
>>>>   <name>fs.inmemory.size.mb</name>
>>>>   <value>200</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>dfs.block.size</name>
>>>>   <value>134217728</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>io.sort.factor</name>
>>>>   <value>100</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>io.sort.mb</name>
>>>>   <value>200</value>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>io.file.buffer.size</name>
>>>>   <value>131072</value>
>>>> </property>
>>>>
>>>> </configuration>
>>>>
>>>>
>>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Reduce hangs

Yunhong Gu1

Hi, all

Just to keep this topic updated :) I am still tring to figure what
happened.

I found that in my 2-node configuration (namenode and jobtracker on
node-1, while both are datanodes and tasktrackers). The reduce task may
sometimes (but rarely) complete for programs that needs small amount of
CPU time (e.g., mrbench), but for programs with large computation, it
never finish. When reduces blocks, it always fails at 16%.

Eventually I will get this error information:
08/01/18 15:01:27 WARN mapred.JobClient: Error reading task outputncdm-IPxxxxxxxxx
08/01/18 15:01:27 WARN mapred.JobClient: Error reading task outputncdm-IPxxxxxxxxx
08/01/18 15:13:38 INFO mapred.JobClient: Task Id :
task_200801181145_0005_r_000000_1, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/01/18 19:56:38 WARN mapred.JobClient: Error reading task outputConnection timed out
08/01/18 19:59:47 WARN mapred.JobClient: Error reading task outputConnection timed out
08/01/18 20:09:40 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!

I found that "IPxxxxxxxx" is not the correct network address
that Hadoop should read result from. The servers I use have 2 network
interfaces and I am using another one. I explicitly fill the IP addresses
10.0.0.x in all the configuration files.

Might this be the reason of Reduce failure? But the Map phase does work.

Thanks
Yunhong


On Sat, 19 Jan 2008, Yunhong Gu1 wrote:

>
> Oh, so it is the task running on the other node (ncdm-15) fails and Hadoop
> re-run the task on the local node (ncdm-8). (I only have two nodes, ncdm-8
> and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The program
> is also started on ncdm-8).
>
>>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
>>> task_200801181852_0001_m_000001_0, Status : FAILED Too many fetch-failures
>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15
>
> Any ideas why the task would fail? And why it takes so long for Hadoop to
> detect the failure?
>
> Thanks
> Yunhong
>
> On Sat, 19 Jan 2008, Devaraj Das wrote:
>
>> Hi Yunhong,
>> As per the output it seems the job ran to successful completion (albeit
>> with
>> some failures)...
>> Devaraj
>>
>>> -----Original Message-----
>>> From: Yunhong Gu1 [mailto:[hidden email]]
>>> Sent: Saturday, January 19, 2008 8:56 AM
>>> To: [hidden email]
>>> Subject: Re: Reduce hangs
>>>
>>>
>>>
>>> Yes, it looks like HADOOP-1374
>>>
>>> The program actually failed after a while:
>>>
>>>
>>> gu@ncdm-8:~/hadoop-0.15.2$ ./bin/hadoop jar
>>> hadoop-0.15.2-test.jar mrbench
>>> MRBenchmark.0.0.2
>>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
>>> 1 numLines, ASCENDING sortOrder
>>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file:
>>> /benchmarks/MRBench/mr_input/input_-450753747.txt
>>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
>>> input=/benchmarks/MRBench/mr_input
>>> output=/benchmarks/MRBench/mr_output/output_1843693325
>>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
>>> paths to process : 1
>>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job:
>>> job_200801181852_0001
>>> 08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
>>> 08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
>>> 08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
>>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
>>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
>>> fetch-failures
>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>> outputncdm15
>>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
>>> outputncdm15
>>> 08/01/18 19:08:34 INFO mapred.JobClient:  map 100% reduce 100%
>>> 08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
>>> job_200801181852_0001
>>> 08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
>>> 08/01/18 19:08:35 INFO mapred.JobClient:   Job Counters
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched map tasks=3
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched reduce tasks=1
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Data-local map tasks=2
>>> 08/01/18 19:08:35 INFO mapred.JobClient:   Map-Reduce Framework
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input records=1
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output records=1
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input bytes=2
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output bytes=5
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input groups=1
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input records=1
>>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce output records=1
>>> DataLines       Maps    Reduces AvgTime (milliseconds)
>>> 1               2       1       926333
>>>
>>>
>>>
>>> On Fri, 18 Jan 2008, Konstantin Shvachko wrote:
>>>
>>>> Looks like we still have this unsolved mysterious problem:
>>>>
>>>> http://issues.apache.org/jira/browse/HADOOP-1374
>>>>
>>>> Could it be related to HADOOP-1246? Arun?
>>>>
>>>> Thanks,
>>>> --Konstantin
>>>>
>>>> Yunhong Gu1 wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> If someone knows how to fix the problem described below,
>>> please help
>>>>> me out. Thanks!
>>>>>
>>>>> I am testing Hadoop on 2-node cluster and the "reduce"
>>> always hangs
>>>>> at some stage, even if I use different clusters. My OS is Debian
>>>>> Linux kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision
>>> is 0.15.2.
>>>>> Java version is 1.5.0_01-b08.
>>>>>
>>>>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar
>>> mrbench" and
>>>>> when the map stage finishes, the reduce stage will hang
>>> somewhere in
>>>>> the middle, sometimes at 0%. I also tried any other
>>> mapreduce program
>>>>> I can find in the example jar package but they all hang.
>>>>>
>>>>> The log file simply print
>>>>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
>>>>>
>>>>> forever.
>>>>>
>>>>> The program does work if I start Hadoop only on single node.
>>>>>
>>>>> Below is my hadoop-site.xml configuration:
>>>>>
>>>>> <configuration>
>>>>>
>>>>> <property>
>>>>>    <name>fs.default.name</name>
>>>>>    <value>10.0.0.1:60000</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>    <name>mapred.job.tracker</name>
>>>>>    <value>10.0.0.1:60001</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>    <name>dfs.data.dir</name>
>>>>>    <value>/raid/hadoop/data</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>    <name>mapred.local.dir</name>
>>>>>    <value>/raid/hadoop/mapred</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>hadoop.tmp.dir</name>
>>>>>   <value>/raid/hadoop/tmp</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>mapred.child.java.opts</name>
>>>>>   <value>-Xmx1024m</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>mapred.tasktracker.tasks.maximum</name>
>>>>>   <value>4</value>
>>>>> </property>
>>>>>
>>>>> <!--
>>>>> <property>
>>>>>   <name>mapred.map.tasks</name>
>>>>>   <value>7</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>mapred.reduce.tasks</name>
>>>>>   <value>3</value>
>>>>> </property>
>>>>> -->
>>>>>
>>>>> <property>
>>>>>   <name>fs.inmemory.size.mb</name>
>>>>>   <value>200</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>dfs.block.size</name>
>>>>>   <value>134217728</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>io.sort.factor</name>
>>>>>   <value>100</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>io.sort.mb</name>
>>>>>   <value>200</value>
>>>>> </property>
>>>>>
>>>>> <property>
>>>>>   <name>io.file.buffer.size</name>
>>>>>   <value>131072</value>
>>>>> </property>
>>>>>
>>>>> </configuration>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>