question on Hadoop configuration for non cpu intensive jobs - 0.15.1

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Jason Venner-2
We have two flavors of jobs we run through hadoop, the first flavor is a
simple merge sort, where there is very little happening in the mapper or
the reducer.
The second flavor are very compute intensive.

In the first type, our each map task consumes its (default sized) 64meg
input split in a small number of seconds, resulting quite a bit of the
elapsed time being spent in job setup and shutdown.

We have tried reducing the number of splits by increasing the block
sizes to 10x and 5x 64meg, but then we constantly have out of memory
errors and timeouts. At this point each jvm is getting 768M and I can't
readily allocate more without dipping into swap.

What suggestions do people have for this case?

07/12/25 11:49:59 INFO mapred.JobClient: Task Id :
task_200712251146_0001_m_000002_0, Status : FAILED
java.lang.OutOfMemoryError: Java heap space
        at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:52)
        at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1763)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1663)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1709)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:79)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:174)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

07/12/25 11:51:35 INFO mapred.JobClient: Task Id :
task_200712251146_0001_r_000038_0, Status : FAILED
java.net.SocketTimeoutException: timed out waiting for rpc response
        at org.apache.hadoop.ipc.Client.call(Client.java:484)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
        at org.apache.hadoop.dfs.$Proxy1.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:269)
        at
org.apache.hadoop.dfs.DFSClient.createNamenode(DFSClient.java:147)
        at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:161)
        at
org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:65)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:159)
        at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:118)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:90)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1759)

Reply | Threaded
Open this post in threaded view
|

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Ted Dunning-3

What are your mappers doing that they run out of memory?  Or is it your
reducers?

Often, you can write this sort of program so that you don't have higher
memory requirements for larger splits.


On 12/25/07 1:52 PM, "Jason Venner" <[hidden email]> wrote:

> We have tried reducing the number of splits by increasing the block
> sizes to 10x and 5x 64meg, but then we constantly have out of memory
> errors and timeouts. At this point each jvm is getting 768M and I can't
> readily allocate more without dipping into swap.

Reply | Threaded
Open this post in threaded view
|

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Rui Shi
In reply to this post by Jason Venner-2
Hi,

I got the similar problem too. Then I have to keep the split size smaller to solve it.

-Rui

----- Original Message ----
From: Ted Dunning <[hidden email]>
To: [hidden email]
Sent: Tuesday, December 25, 2007 1:56:16 PM
Subject: Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1



What are your mappers doing that they run out of memory?  Or is it your
reducers?

Often, you can write this sort of program so that you don't have higher
memory requirements for larger splits.


On 12/25/07 1:52 PM, "Jason Venner" <[hidden email]> wrote:

> We have tried reducing the number of splits by increasing the block
> sizes to 10x and 5x 64meg, but then we constantly have out of memory
> errors and timeouts. At this point each jvm is getting 768M and I
 can't
> readily allocate more without dipping into swap.







      ____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 
Reply | Threaded
Open this post in threaded view
|

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Jason Venner-2
In reply to this post by Ted Dunning-3
My mapper in this case is the identity mapper, and the reducer gets
about 10 values per key and makes a collect decision based on the data
in the values.
The reducer is very close to a no-op, and uses very little additional
memory than the values.

I believe the problem is in the amount of buffering in the output files.

The quandary we have is the jobs run very poorly with the standard input
split size as the mean time to finishing a split is very small, vrs
gigantic memory requirements for large split sizes.

Time to play with parameters again ... since the answer doesn't appear
to be in working memory for the list.



Ted Dunning wrote:

> What are your mappers doing that they run out of memory?  Or is it your
> reducers?
>
> Often, you can write this sort of program so that you don't have higher
> memory requirements for larger splits.
>
>
> On 12/25/07 1:52 PM, "Jason Venner" <[hidden email]> wrote:
>
>  
>> We have tried reducing the number of splits by increasing the block
>> sizes to 10x and 5x 64meg, but then we constantly have out of memory
>> errors and timeouts. At this point each jvm is getting 768M and I can't
>> readily allocate more without dipping into swap.
>>    
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Ted Dunning-3


This sounds like a bug.

The memory requirements for hadoop itself shouldn't change with the split
size.  At the very least, it should adapt correctly to whatever the memory
limits are.

Can you build a version of your program that works from random data so that
you can file a bug?  If you contact me off-line, I can help build a random
data generator that matches your input reasonably well.


On 12/25/07 2:52 PM, "Jason Venner" <[hidden email]> wrote:

> My mapper in this case is the identity mapper, and the reducer gets
> about 10 values per key and makes a collect decision based on the data
> in the values.
> The reducer is very close to a no-op, and uses very little additional
> memory than the values.
>
> I believe the problem is in the amount of buffering in the output files.
>
> The quandary we have is the jobs run very poorly with the standard input
> split size as the mean time to finishing a split is very small, vrs
> gigantic memory requirements for large split sizes.
>
> Time to play with parameters again ... since the answer doesn't appear
> to be in working memory for the list.
>
>
>
> Ted Dunning wrote:
>> What are your mappers doing that they run out of memory?  Or is it your
>> reducers?
>>
>> Often, you can write this sort of program so that you don't have higher
>> memory requirements for larger splits.
>>
>>
>> On 12/25/07 1:52 PM, "Jason Venner" <[hidden email]> wrote:
>>
>>  
>>> We have tried reducing the number of splits by increasing the block
>>> sizes to 10x and 5x 64meg, but then we constantly have out of memory
>>> errors and timeouts. At this point each jvm is getting 768M and I can't
>>> readily allocate more without dipping into swap.
>>>    
>>
>>  

Reply | Threaded
Open this post in threaded view
|

RE: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Devaraj Das
I am also interested in the test demonstrating OOM for large split sizes (if
this is true then it is indeed a bug). Sort & Spill-to-disk should happen as
soon as io.sort.mb amount of key/value data is collected. I am assuming that
you didn't change (increased) the value of io.sort.mb when you increased the
split size..

Thanks,
Devaraj

> -----Original Message-----
> From: Ted Dunning [mailto:[hidden email]]
> Sent: Wednesday, December 26, 2007 4:31 AM
> To: [hidden email]
> Subject: Re: question on Hadoop configuration for non cpu
> intensive jobs - 0.15.1
>
>
>
> This sounds like a bug.
>
> The memory requirements for hadoop itself shouldn't change
> with the split size.  At the very least, it should adapt
> correctly to whatever the memory limits are.
>
> Can you build a version of your program that works from
> random data so that you can file a bug?  If you contact me
> off-line, I can help build a random data generator that
> matches your input reasonably well.
>
>
> On 12/25/07 2:52 PM, "Jason Venner" <[hidden email]> wrote:
>
> > My mapper in this case is the identity mapper, and the reducer gets
> > about 10 values per key and makes a collect decision based
> on the data
> > in the values.
> > The reducer is very close to a no-op, and uses very little
> additional
> > memory than the values.
> >
> > I believe the problem is in the amount of buffering in the
> output files.
> >
> > The quandary we have is the jobs run very poorly with the standard
> > input split size as the mean time to finishing a split is
> very small,
> > vrs gigantic memory requirements for large split sizes.
> >
> > Time to play with parameters again ... since the answer
> doesn't appear
> > to be in working memory for the list.
> >
> >
> >
> > Ted Dunning wrote:
> >> What are your mappers doing that they run out of memory?  Or is it
> >> your reducers?
> >>
> >> Often, you can write this sort of program so that you don't have
> >> higher memory requirements for larger splits.
> >>
> >>
> >> On 12/25/07 1:52 PM, "Jason Venner" <[hidden email]> wrote:
> >>
> >>  
> >>> We have tried reducing the number of splits by increasing
> the block
> >>> sizes to 10x and 5x 64meg, but then we constantly have
> out of memory
> >>> errors and timeouts. At this point each jvm is getting 768M and I
> >>> can't readily allocate more without dipping into swap.
> >>>    
> >>
> >>  
>
>

Reply | Threaded
Open this post in threaded view
|

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Jason Venner-2
Our OOM was being caused by a damaged sequence data file. We had assumed
that the sequence files had checksums, which appears to be in correct.
The deserializer was reading a bad length out of the file and trying to
allocate 4gig of ram.
Reply | Threaded
Open this post in threaded view
|

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Eric Baldeschwieler
I created HADOOP-2497 to describe this bug.

Was your sequence file stored on HDFS?  Because HDFS does provide  
checksums.

On Dec 28, 2007, at 7:20 AM, Jason Venner wrote:

> Our OOM was being caused by a damaged sequence data file. We had  
> assumed that the sequence files had checksums, which appears to be  
> in correct.
> The deserializer was reading a bad length out of the file and  
> trying to allocate 4gig of ram.

Reply | Threaded
Open this post in threaded view
|

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Jason Venner-2
Yes, our sequence files are stored in hdfs.

Some of them are constructed via the FileUtil.copyMerge routine and some
are the results of a mapper or a reducer and they are all in hdfs.


Eric Baldeschwieler wrote:

> I created HADOOP-2497 to describe this bug.
>
> Was your sequence file stored on HDFS?  Because HDFS does provide
> checksums.
>
> On Dec 28, 2007, at 7:20 AM, Jason Venner wrote:
>
>> Our OOM was being caused by a damaged sequence data file. We had
>> assumed that the sequence files had checksums, which appears to be in
>> correct.
>> The deserializer was reading a bad length out of the file and trying
>> to allocate 4gig of ram.
>
Reply | Threaded
Open this post in threaded view
|

RE: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

Runping Qi-2
In reply to this post by Devaraj Das


I encountered similar problems many times too, especially the input data
is compressed.
I had to raise the heapsize around 700MB to avoid oom problems in the
mappers.

Runping


> -----Original Message-----
> From: Devaraj Das [mailto:[hidden email]]
> Sent: Friday, December 28, 2007 3:28 AM
> To: [hidden email]
> Subject: RE: question on Hadoop configuration for non cpu intensive
jobs -
> 0.15.1
>
> I am also interested in the test demonstrating OOM for large split
sizes
> (if
> this is true then it is indeed a bug). Sort & Spill-to-disk should
happen
> as
> soon as io.sort.mb amount of key/value data is collected. I am
assuming
> that
> you didn't change (increased) the value of io.sort.mb when you
increased

> the
> split size..
>
> Thanks,
> Devaraj
>
> > -----Original Message-----
> > From: Ted Dunning [mailto:[hidden email]]
> > Sent: Wednesday, December 26, 2007 4:31 AM
> > To: [hidden email]
> > Subject: Re: question on Hadoop configuration for non cpu
> > intensive jobs - 0.15.1
> >
> >
> >
> > This sounds like a bug.
> >
> > The memory requirements for hadoop itself shouldn't change
> > with the split size.  At the very least, it should adapt
> > correctly to whatever the memory limits are.
> >
> > Can you build a version of your program that works from
> > random data so that you can file a bug?  If you contact me
> > off-line, I can help build a random data generator that
> > matches your input reasonably well.
> >
> >
> > On 12/25/07 2:52 PM, "Jason Venner" <[hidden email]> wrote:
> >
> > > My mapper in this case is the identity mapper, and the reducer
gets

> > > about 10 values per key and makes a collect decision based
> > on the data
> > > in the values.
> > > The reducer is very close to a no-op, and uses very little
> > additional
> > > memory than the values.
> > >
> > > I believe the problem is in the amount of buffering in the
> > output files.
> > >
> > > The quandary we have is the jobs run very poorly with the standard
> > > input split size as the mean time to finishing a split is
> > very small,
> > > vrs gigantic memory requirements for large split sizes.
> > >
> > > Time to play with parameters again ... since the answer
> > doesn't appear
> > > to be in working memory for the list.
> > >
> > >
> > >
> > > Ted Dunning wrote:
> > >> What are your mappers doing that they run out of memory?  Or is
it

> > >> your reducers?
> > >>
> > >> Often, you can write this sort of program so that you don't have
> > >> higher memory requirements for larger splits.
> > >>
> > >>
> > >> On 12/25/07 1:52 PM, "Jason Venner" <[hidden email]> wrote:
> > >>
> > >>
> > >>> We have tried reducing the number of splits by increasing
> > the block
> > >>> sizes to 10x and 5x 64meg, but then we constantly have
> > out of memory
> > >>> errors and timeouts. At this point each jvm is getting 768M and
I
> > >>> can't readily allocate more without dipping into swap.
> > >>>
> > >>
> > >>
> >
> >