Strange behavior - One reduce out of N reduces always fail.

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Strange behavior - One reduce out of N reduces always fail.

Venkat Seeth
Hi there,

Howdy. I've been using hadoop to parse and index XML
documents. Its a 2 step process similar to Nutch. I
parse the XML and create field-value tuples written to
a file.

I read this file and index the field-value pairs in
the next step.

Everything works fine but always one reduce out of N
fails in the last step when merging segments. It fails
with one or more of the following:
- Task failed to report status for 608 seconds.
Killing.
- java.lang.OutOfMemoryError: GC overhead limit
exceeded

I've tried various configuration combinations and it
fails always at the 4th one in a 8 reduce
configuration and the first one in a 4 reduce config.

Environment:
Suse Linux 64 bit
Java 6 (Java 5 also fails)
Hadoop-0.11-2
Lucene-2.1 (Lucene 2.0 also fails)

Configuration:
I have about 128 maps and 8 reduces so I get to create
8 partitions of my index. It runs on a 4 node cluster
with 4-Dual-proc 64GB machines.

Number of documents: 1.65 million each about 10K in
size.

I ran with 4 or 8 task trackers per node with 4 GB
Heap for Job, Task trackers and the child JVMs.

mergeFactor set to 50 and maxBufferedDocs at 1000.

I fail to understand whats going on. When I run the
job individually, it works with the same settings.

Why would all jobs work where in only one fails.

I'd appreciate if any one can share their experience.

Thanks,
Ven



 
____________________________________________________________________________________
Yahoo! Music Unlimited
Access over 1 million songs.
http://music.yahoo.com/unlimited
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior - One reduce out of N reduces always fail.

Andrzej Białecki-2
Venkat Seeth wrote:

> Hi there,
>
> Howdy. I've been using hadoop to parse and index XML
> documents. Its a 2 step process similar to Nutch. I
> parse the XML and create field-value tuples written to
> a file.
>
> I read this file and index the field-value pairs in
> the next step.
>
> Everything works fine but always one reduce out of N
> fails in the last step when merging segments. It fails
> with one or more of the following:
> - Task failed to report status for 608 seconds.
> Killing.
> - java.lang.OutOfMemoryError: GC overhead limit
> exceeded
>  

Perhaps you are running with too large heap, as strange as it may sound
... If I understand this message correctly, JVM complains that GC is
taking too much resources.

This may be also related to ulimit on this account ...


> Configuration:
> I have about 128 maps and 8 reduces so I get to create
> 8 partitions of my index. It runs on a 4 node cluster
> with 4-Dual-proc 64GB machines.
>  

I think that with this configuration you could increase the number of
reduces, to decrease the amount of data each reduce task has to handle.
In your current config you run at most 2 reduces per machine.

> Number of documents: 1.65 million each about 10K in
> size.
>
> I ran with 4 or 8 task trackers per node with 4 GB
> Heap for Job, Task trackers and the child JVMs.
>
> mergeFactor set to 50 and maxBufferedDocs at 1000.
>
> I fail to understand whats going on. When I run the
> job individually, it works with the same settings.
>
> Why would all jobs work where in only one fails.
>  

You can also use IsolationRunner to re-run individual tasks under
debugger and see where they fail.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: Strange behavior - One reduce out of N reduces always fail.

Devaraj Das
In reply to this post by Venkat Seeth
While this could be a JVM/GC issue as Andrez pointed out, it could also be
due to a very large key/value being read from the map output. Do you have an
estimate of the sizes? Attached is a quick-hack-patch to log the sizes of
the key/values read from the sequence files. Please apply this patch on
hadoop-0.11.2 and check the userlogs what key/value it is failing for (if at
all)..
Thanks,
Devaraj.

> -----Original Message-----
> From: Venkat Seeth [mailto:[hidden email]]
> Sent: Tuesday, February 20, 2007 11:32 AM
> To: [hidden email]
> Subject: Strange behavior - One reduce out of N reduces always fail.
>
> Hi there,
>
> Howdy. I've been using hadoop to parse and index XML
> documents. Its a 2 step process similar to Nutch. I
> parse the XML and create field-value tuples written to
> a file.
>
> I read this file and index the field-value pairs in
> the next step.
>
> Everything works fine but always one reduce out of N
> fails in the last step when merging segments. It fails
> with one or more of the following:
> - Task failed to report status for 608 seconds.
> Killing.
> - java.lang.OutOfMemoryError: GC overhead limit
> exceeded
>
> I've tried various configuration combinations and it
> fails always at the 4th one in a 8 reduce
> configuration and the first one in a 4 reduce config.
>
> Environment:
> Suse Linux 64 bit
> Java 6 (Java 5 also fails)
> Hadoop-0.11-2
> Lucene-2.1 (Lucene 2.0 also fails)
>
> Configuration:
> I have about 128 maps and 8 reduces so I get to create
> 8 partitions of my index. It runs on a 4 node cluster
> with 4-Dual-proc 64GB machines.
>
> Number of documents: 1.65 million each about 10K in
> size.
>
> I ran with 4 or 8 task trackers per node with 4 GB
> Heap for Job, Task trackers and the child JVMs.
>
> mergeFactor set to 50 and maxBufferedDocs at 1000.
>
> I fail to understand whats going on. When I run the
> job individually, it works with the same settings.
>
> Why would all jobs work where in only one fails.
>
> I'd appreciate if any one can share their experience.
>
> Thanks,
> Ven
>
>
>
>
> __________________________________________________________________________
> __________
> Yahoo! Music Unlimited
> Access over 1 million songs.
> http://music.yahoo.com/unlimited
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior - One reduce out of N reduces always fail.

Venkat Seeth
In reply to this post by Andrzej Białecki-2
Hi Andrzej,

Thanks for your quick response.

Please find my comments below.

> Perhaps you are running with too large heap, as
strange as it may sound
> ... If I understand this message correctly, JVM
complains that GC is
> taking too much resources.
I started with defaults, 200m and maxBufferedDocs to
100. But I got too many open files error. Then I
increased maxBuffredDocs to 2000, I got OOM. Hence I
went thru a series of changes to arrive at this
conclusion that irrespective of any config, one reduce
fails.

>
> This may be also related to ulimit on this account
I checked and it has a limt of 1024. The number of
segements generated was around 500 for 1 million docs
in each part.

> I think that with this configuration you could
> increase the number of
> reduces, to decrease the amount of data each reduce
> task has to handle.
Ideally I want a partition for 10-15 million docs per
reduce since I want to index 100 million. I can try
with 10 or 12 reduces.
But, even with 8, one fails and in isolation that
works fine with the same settings.

> In your current config you run at most 2 reduces per
machine.
True. Why do you say so. I've set 4 tasks/node but I
was at 8 too and faced the same issue.

> You can also use IsolationRunner to re-run
> individual tasks under
> debugger and see where they fail.
I tried with mapred.job.tracker = local and things fly
without errors. I also tried the same with a slave and
they work too.
Locally on windows using Cygwin, it works too.

Any thoughts are greatly appreciated. I'm doing a
proof-of-concept and this is really a big hurdle.

Thanks,
Venkat

--- Andrzej Bialecki <[hidden email]> wrote:

> Venkat Seeth wrote:
> > Hi there,
> >
> > Howdy. I've been using hadoop to parse and index
> XML
> > documents. Its a 2 step process similar to Nutch.
> I
> > parse the XML and create field-value tuples
> written to
> > a file.
> >
> > I read this file and index the field-value pairs
> in
> > the next step.
> >
> > Everything works fine but always one reduce out of
> N
> > fails in the last step when merging segments. It
> fails
> > with one or more of the following:
> > - Task failed to report status for 608 seconds.
> > Killing.
> > - java.lang.OutOfMemoryError: GC overhead limit
> > exceeded
> >  
>
> Perhaps you are running with too large heap, as
> strange as it may sound
> ... If I understand this message correctly, JVM
> complains that GC is
> taking too much resources.
>
> This may be also related to ulimit on this account
> ...
>
>
> > Configuration:
> > I have about 128 maps and 8 reduces so I get to
> create
> > 8 partitions of my index. It runs on a 4 node
> cluster
> > with 4-Dual-proc 64GB machines.
> >  
>
> I think that with this configuration you could
> increase the number of
> reduces, to decrease the amount of data each reduce
> task has to handle.
> In your current config you run at most 2 reduces per
> machine.
>
> > Number of documents: 1.65 million each about 10K
> in
> > size.
> >
> > I ran with 4 or 8 task trackers per node with 4 GB
> > Heap for Job, Task trackers and the child JVMs.
> >
> > mergeFactor set to 50 and maxBufferedDocs at 1000.
> >
> > I fail to understand whats going on. When I run
> the
> > job individually, it works with the same settings.
> >
> > Why would all jobs work where in only one fails.
> >  
>
> You can also use IsolationRunner to re-run
> individual tasks under
> debugger and see where they fail.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>



 
____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com
Reply | Threaded
Open this post in threaded view
|

RE: Strange behavior - One reduce out of N reduces always fail.

Venkat Seeth
In reply to this post by Devaraj Das
Hi Devraj,

Thanks for your response.

> Do you have an estimate of the sizes?
# of entries:1080746
[# of field-value Pairs]
min count:20
max count:3116
avg count:66

These are small documents and yes, full-text content
for each document can be big. I've also set the
MaxFieldLength to 10000 so that I dont index very
large values as suggested in Lucene.

Always, the reduce fails while merging segments. I do
see a large line in Log4J output which consists of

Typically, the job that fails is is ALWAYS VERY SLOW
as compared to other N - 1 jobs.

Can I log the Key-Value pair sizes in the reduce part
of the indexer?

Again,

Thanks,
Venkat

--- Devaraj Das <[hidden email]> wrote:

> While this could be a JVM/GC issue as Andrez pointed
> out, it could also be
> due to a very large key/value being read from the
> map output. Do you have an
> estimate of the sizes? Attached is a
> quick-hack-patch to log the sizes of
> the key/values read from the sequence files. Please
> apply this patch on
> hadoop-0.11.2 and check the userlogs what key/value
> it is failing for (if at
> all)..
> Thanks,
> Devaraj.
>
> > -----Original Message-----
> > From: Venkat Seeth [mailto:[hidden email]]
> > Sent: Tuesday, February 20, 2007 11:32 AM
> > To: [hidden email]
> > Subject: Strange behavior - One reduce out of N
> reduces always fail.
> >
> > Hi there,
> >
> > Howdy. I've been using hadoop to parse and index
> XML
> > documents. Its a 2 step process similar to Nutch.
> I
> > parse the XML and create field-value tuples
> written to
> > a file.
> >
> > I read this file and index the field-value pairs
> in
> > the next step.
> >
> > Everything works fine but always one reduce out of
> N
> > fails in the last step when merging segments. It
> fails
> > with one or more of the following:
> > - Task failed to report status for 608 seconds.
> > Killing.
> > - java.lang.OutOfMemoryError: GC overhead limit
> > exceeded
> >
> > I've tried various configuration combinations and
> it
> > fails always at the 4th one in a 8 reduce
> > configuration and the first one in a 4 reduce
> config.
> >
> > Environment:
> > Suse Linux 64 bit
> > Java 6 (Java 5 also fails)
> > Hadoop-0.11-2
> > Lucene-2.1 (Lucene 2.0 also fails)
> >
> > Configuration:
> > I have about 128 maps and 8 reduces so I get to
> create
> > 8 partitions of my index. It runs on a 4 node
> cluster
> > with 4-Dual-proc 64GB machines.
> >
> > Number of documents: 1.65 million each about 10K
> in
> > size.
> >
> > I ran with 4 or 8 task trackers per node with 4 GB
> > Heap for Job, Task trackers and the child JVMs.
> >
> > mergeFactor set to 50 and maxBufferedDocs at 1000.
> >
> > I fail to understand whats going on. When I run
> the
> > job individually, it works with the same settings.
> >
> > Why would all jobs work where in only one fails.
> >
> > I'd appreciate if any one can share their
> experience.
> >
> > Thanks,
> > Ven
> >
> >
> >
> >
> >
>
__________________________________________________________________________
> > __________
> > Yahoo! Music Unlimited
> > Access over 1 million songs.
> > http://music.yahoo.com/unlimited
>



 
____________________________________________________________________________________
Cheap talk?
Check out Yahoo! Messenger's low PC-to-Phone call rates.
http://voice.yahoo.com
Reply | Threaded
Open this post in threaded view
|

RE: Strange behavior - One reduce out of N reduces always fail.

Devaraj Das
Hi Venkat,
You forgot to paste the log output in your reply. The patch that I sent will
log the key/value sizes in the Reducers as well. See if you get helpful
hints with that.
Thanks,
Devaraj.

> -----Original Message-----
> From: Venkat Seeth [mailto:[hidden email]]
> Sent: Tuesday, February 20, 2007 9:55 PM
> To: [hidden email]; Devaraj Das
> Subject: RE: Strange behavior - One reduce out of N reduces always fail.
>
> Hi Devraj,
>
> Thanks for your response.
>
> > Do you have an estimate of the sizes?
> # of entries:1080746
> [# of field-value Pairs]
> min count:20
> max count:3116
> avg count:66
>
> These are small documents and yes, full-text content
> for each document can be big. I've also set the
> MaxFieldLength to 10000 so that I dont index very
> large values as suggested in Lucene.
>
> Always, the reduce fails while merging segments. I do
> see a large line in Log4J output which consists of
>
> Typically, the job that fails is is ALWAYS VERY SLOW
> as compared to other N - 1 jobs.
>
> Can I log the Key-Value pair sizes in the reduce part
> of the indexer?
>
> Again,
>
> Thanks,
> Venkat
>
> --- Devaraj Das <[hidden email]> wrote:
>
> > While this could be a JVM/GC issue as Andrez pointed
> > out, it could also be
> > due to a very large key/value being read from the
> > map output. Do you have an
> > estimate of the sizes? Attached is a
> > quick-hack-patch to log the sizes of
> > the key/values read from the sequence files. Please
> > apply this patch on
> > hadoop-0.11.2 and check the userlogs what key/value
> > it is failing for (if at
> > all)..
> > Thanks,
> > Devaraj.
> >
> > > -----Original Message-----
> > > From: Venkat Seeth [mailto:[hidden email]]
> > > Sent: Tuesday, February 20, 2007 11:32 AM
> > > To: [hidden email]
> > > Subject: Strange behavior - One reduce out of N
> > reduces always fail.
> > >
> > > Hi there,
> > >
> > > Howdy. I've been using hadoop to parse and index
> > XML
> > > documents. Its a 2 step process similar to Nutch.
> > I
> > > parse the XML and create field-value tuples
> > written to
> > > a file.
> > >
> > > I read this file and index the field-value pairs
> > in
> > > the next step.
> > >
> > > Everything works fine but always one reduce out of
> > N
> > > fails in the last step when merging segments. It
> > fails
> > > with one or more of the following:
> > > - Task failed to report status for 608 seconds.
> > > Killing.
> > > - java.lang.OutOfMemoryError: GC overhead limit
> > > exceeded
> > >
> > > I've tried various configuration combinations and
> > it
> > > fails always at the 4th one in a 8 reduce
> > > configuration and the first one in a 4 reduce
> > config.
> > >
> > > Environment:
> > > Suse Linux 64 bit
> > > Java 6 (Java 5 also fails)
> > > Hadoop-0.11-2
> > > Lucene-2.1 (Lucene 2.0 also fails)
> > >
> > > Configuration:
> > > I have about 128 maps and 8 reduces so I get to
> > create
> > > 8 partitions of my index. It runs on a 4 node
> > cluster
> > > with 4-Dual-proc 64GB machines.
> > >
> > > Number of documents: 1.65 million each about 10K
> > in
> > > size.
> > >
> > > I ran with 4 or 8 task trackers per node with 4 GB
> > > Heap for Job, Task trackers and the child JVMs.
> > >
> > > mergeFactor set to 50 and maxBufferedDocs at 1000.
> > >
> > > I fail to understand whats going on. When I run
> > the
> > > job individually, it works with the same settings.
> > >
> > > Why would all jobs work where in only one fails.
> > >
> > > I'd appreciate if any one can share their
> > experience.
> > >
> > > Thanks,
> > > Ven
> > >
> > >
> > >
> > >
> > >
> >
> __________________________________________________________________________
> > > __________
> > > Yahoo! Music Unlimited
> > > Access over 1 million songs.
> > > http://music.yahoo.com/unlimited
> >
>
>
>
>
> __________________________________________________________________________
> __________
> Cheap talk?
> Check out Yahoo! Messenger's low PC-to-Phone call rates.
> http://voice.yahoo.com

Reply | Threaded
Open this post in threaded view
|

RE: Strange behavior - One reduce out of N reduces always fail.

Venkat Seeth
Hi Devraj,

The log file for key-value pairs are huge? If you can
tell me what are you looking for I can mine and send
the relevant information.

 343891695 2007-02-20 18:37 seq.log

This time aroung I get the following error:

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at
java.util.Arrays.copyOfRange(Arrays.java:3209)
        at java.lang.String.<init>(String.java:216)
        at
java.lang.StringBuffer.toString(StringBuffer.java:585)
        at
org.apache.log4j.WriterAppender.checkEntryConditions(WriterAppender.java:176)
        at
org.apache.log4j.WriterAppender.append(WriterAppender.java:156)
        at
org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
        at
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:65)
        at
org.apache.log4j.Category.callAppenders(Category.java:203)
        at
org.apache.log4j.Category.forcedLog(Category.java:388)
        at
org.apache.log4j.Category.debug(Category.java:257)
        at
com.gale.searchng.workflow.model.TuplesWritable.readFields(TuplesWritable.java:127)
        at
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:199)
        at
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:160)
        at
com.gale.searchng.workflow.indexer.Indexer.reduce(Indexer.java:152)
        at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:324)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1372)

Thanks,
Venkat

--- Devaraj Das <[hidden email]> wrote:

> Hi Venkat,
> You forgot to paste the log output in your reply.
> The patch that I sent will
> log the key/value sizes in the Reducers as well. See
> if you get helpful
> hints with that.
> Thanks,
> Devaraj.
>
> > -----Original Message-----
> > From: Venkat Seeth [mailto:[hidden email]]
> > Sent: Tuesday, February 20, 2007 9:55 PM
> > To: [hidden email]; Devaraj Das
> > Subject: RE: Strange behavior - One reduce out of
> N reduces always fail.
> >
> > Hi Devraj,
> >
> > Thanks for your response.
> >
> > > Do you have an estimate of the sizes?
> > # of entries:1080746
> > [# of field-value Pairs]
> > min count:20
> > max count:3116
> > avg count:66
> >
> > These are small documents and yes, full-text
> content
> > for each document can be big. I've also set the
> > MaxFieldLength to 10000 so that I dont index very
> > large values as suggested in Lucene.
> >
> > Always, the reduce fails while merging segments. I
> do
> > see a large line in Log4J output which consists of
> >
> > Typically, the job that fails is is ALWAYS VERY
> SLOW
> > as compared to other N - 1 jobs.
> >
> > Can I log the Key-Value pair sizes in the reduce
> part
> > of the indexer?
> >
> > Again,
> >
> > Thanks,
> > Venkat
> >
> > --- Devaraj Das <[hidden email]> wrote:
> >
> > > While this could be a JVM/GC issue as Andrez
> pointed
> > > out, it could also be
> > > due to a very large key/value being read from
> the
> > > map output. Do you have an
> > > estimate of the sizes? Attached is a
> > > quick-hack-patch to log the sizes of
> > > the key/values read from the sequence files.
> Please
> > > apply this patch on
> > > hadoop-0.11.2 and check the userlogs what
> key/value
> > > it is failing for (if at
> > > all)..
> > > Thanks,
> > > Devaraj.
> > >
> > > > -----Original Message-----
> > > > From: Venkat Seeth [mailto:[hidden email]]
> > > > Sent: Tuesday, February 20, 2007 11:32 AM
> > > > To: [hidden email]
> > > > Subject: Strange behavior - One reduce out of
> N
> > > reduces always fail.
> > > >
> > > > Hi there,
> > > >
> > > > Howdy. I've been using hadoop to parse and
> index
> > > XML
> > > > documents. Its a 2 step process similar to
> Nutch.
> > > I
> > > > parse the XML and create field-value tuples
> > > written to
> > > > a file.
> > > >
> > > > I read this file and index the field-value
> pairs
> > > in
> > > > the next step.
> > > >
> > > > Everything works fine but always one reduce
> out of
> > > N
> > > > fails in the last step when merging segments.
> It
> > > fails
> > > > with one or more of the following:
> > > > - Task failed to report status for 608
> seconds.
> > > > Killing.
> > > > - java.lang.OutOfMemoryError: GC overhead
> limit
> > > > exceeded
> > > >
> > > > I've tried various configuration combinations
> and
> > > it
> > > > fails always at the 4th one in a 8 reduce
> > > > configuration and the first one in a 4 reduce
> > > config.
> > > >
> > > > Environment:
> > > > Suse Linux 64 bit
> > > > Java 6 (Java 5 also fails)
> > > > Hadoop-0.11-2
> > > > Lucene-2.1 (Lucene 2.0 also fails)
> > > >
> > > > Configuration:
> > > > I have about 128 maps and 8 reduces so I get
> to
> > > create
> > > > 8 partitions of my index. It runs on a 4 node
> > > cluster
> > > > with 4-Dual-proc 64GB machines.
> > > >
> > > > Number of documents: 1.65 million each about
> 10K
> > > in
> > > > size.
> > > >
> > > > I ran with 4 or 8 task trackers per node with
> 4 GB
> > > > Heap for Job, Task trackers and the child
> JVMs.
> > > >
> > > > mergeFactor set to 50 and maxBufferedDocs at
> 1000.
> > > >
> > > > I fail to understand whats going on. When I
> run
> > > the
> > > > job individually, it works with the same
> settings.
> > > >
> > > > Why would all jobs work where in only one
> fails.
> > > >
> > > > I'd appreciate if any one can share their
> > > experience.
> > > >
> > > > Thanks,
> > > > Ven
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>
__________________________________________________________________________

> > > > __________
> > > > Yahoo! Music Unlimited
> > > > Access over 1 million songs.
> > > > http://music.yahoo.com/unlimited
> > >
> >
> >
> >
> >
> >
>
__________________________________________________________________________
> > __________
> > Cheap talk?
> > Check out Yahoo! Messenger's low PC-to-Phone call
> rates.
> > http://voice.yahoo.com
>
>



 
____________________________________________________________________________________
Yahoo! Music Unlimited
Access over 1 million songs.
http://music.yahoo.com/unlimited
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior - One reduce out of N reduces always fail.

Andrzej Białecki-2
Venkat Seeth wrote:

> Hi Devraj,
>
> The log file for key-value pairs are huge? If you can
> tell me what are you looking for I can mine and send
> the relevant information.
>
>  343891695 2007-02-20 18:37 seq.log
>
> This time aroung I get the following error:
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>  


This message is so curious I went and googled for it. Additional
information on this is here:

http://java.sun.com/j2se/1.5.0/docs/guide/vm/gc-ergonomics.html

Look at the point 3. - the time-limit setting seems to be related (from
reading the JVM sources ;) ) with this error message.

All in all, your GC is overworked .. ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior - One reduce out of N reduces always fail.

Venkat Seeth
Hi Andrzej,

Thanks again. Its interesting. I did create the Child
JVM with 200m, 1 GB, 2 GB, 3 GB and finally 4 GB Heap
space with min and max set to the same.

The error is reproducible.

Then again, 7 out of 8 jobs complete successfully in
10 mins and this one job takes more than 30 mins to
fail.

This is the strange thing and I'm out of all options
now. :-(

Let me play around more with GC settings.

Thanks,
Venkat


--- Andrzej Bialecki <[hidden email]> wrote:

> Venkat Seeth wrote:
> > Hi Devraj,
> >
> > The log file for key-value pairs are huge? If you
> can
> > tell me what are you looking for I can mine and
> send
> > the relevant information.
> >
> >  343891695 2007-02-20 18:37 seq.log
> >
> > This time aroung I get the following error:
> >
> > java.lang.OutOfMemoryError: GC overhead limit
> exceeded
> >  
>
>
> This message is so curious I went and googled for
> it. Additional
> information on this is here:
>
>
http://java.sun.com/j2se/1.5.0/docs/guide/vm/gc-ergonomics.html

>
> Look at the point 3. - the time-limit setting seems
> to be related (from
> reading the JVM sources ;) ) with this error
> message.
>
> All in all, your GC is overworked .. ;)
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>



 
____________________________________________________________________________________
Food fight? Enjoy some healthy debate
in the Yahoo! Answers Food & Drink Q&A.
http://answers.yahoo.com/dir/?link=list&sid=396545367
Reply | Threaded
Open this post in threaded view
|

RE: Strange behavior - One reduce out of N reduces always fail.

Mahadev Konar
In reply to this post by Venkat Seeth
It does look like the value for a particular key is huge in size. Does your
map reduce job fail for the same key/value pair or is it non deterministic?

Regards
Mahadev

> -----Original Message-----
> From: Venkat Seeth [mailto:[hidden email]]
> Sent: Tuesday, February 20, 2007 4:09 PM
> To: [hidden email]; Devaraj Das
> Subject: RE: Strange behavior - One reduce out of N reduces always fail.
>
> Hi Devraj,
>
> The log file for key-value pairs are huge? If you can
> tell me what are you looking for I can mine and send
> the relevant information.
>
>  343891695 2007-02-20 18:37 seq.log
>
> This time aroung I get the following error:
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at
> java.util.Arrays.copyOfRange(Arrays.java:3209)
>         at java.lang.String.<init>(String.java:216)
>         at
> java.lang.StringBuffer.toString(StringBuffer.java:585)
>         at
> org.apache.log4j.WriterAppender.checkEntryConditions(WriterAppender.java:1
> 76)
>         at
> org.apache.log4j.WriterAppender.append(WriterAppender.java:156)
>         at
> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
>         at
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(Appe
> nderAttachableImpl.java:65)
>         at
> org.apache.log4j.Category.callAppenders(Category.java:203)
>         at
> org.apache.log4j.Category.forcedLog(Category.java:388)
>         at
> org.apache.log4j.Category.debug(Category.java:257)
>         at
> com.gale.searchng.workflow.model.TuplesWritable.readFields(TuplesWritable.
> java:127)
>         at
> org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java
> :199)
>         at
> org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:16
> 0)
>         at
> com.gale.searchng.workflow.indexer.Indexer.reduce(Indexer.java:152)
>         at
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:324)
>         at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1372)
>
> Thanks,
> Venkat
>
> --- Devaraj Das <[hidden email]> wrote:
>
> > Hi Venkat,
> > You forgot to paste the log output in your reply.
> > The patch that I sent will
> > log the key/value sizes in the Reducers as well. See
> > if you get helpful
> > hints with that.
> > Thanks,
> > Devaraj.
> >
> > > -----Original Message-----
> > > From: Venkat Seeth [mailto:[hidden email]]
> > > Sent: Tuesday, February 20, 2007 9:55 PM
> > > To: [hidden email]; Devaraj Das
> > > Subject: RE: Strange behavior - One reduce out of
> > N reduces always fail.
> > >
> > > Hi Devraj,
> > >
> > > Thanks for your response.
> > >
> > > > Do you have an estimate of the sizes?
> > > # of entries:1080746
> > > [# of field-value Pairs]
> > > min count:20
> > > max count:3116
> > > avg count:66
> > >
> > > These are small documents and yes, full-text
> > content
> > > for each document can be big. I've also set the
> > > MaxFieldLength to 10000 so that I dont index very
> > > large values as suggested in Lucene.
> > >
> > > Always, the reduce fails while merging segments. I
> > do
> > > see a large line in Log4J output which consists of
> > >
> > > Typically, the job that fails is is ALWAYS VERY
> > SLOW
> > > as compared to other N - 1 jobs.
> > >
> > > Can I log the Key-Value pair sizes in the reduce
> > part
> > > of the indexer?
> > >
> > > Again,
> > >
> > > Thanks,
> > > Venkat
> > >
> > > --- Devaraj Das <[hidden email]> wrote:
> > >
> > > > While this could be a JVM/GC issue as Andrez
> > pointed
> > > > out, it could also be
> > > > due to a very large key/value being read from
> > the
> > > > map output. Do you have an
> > > > estimate of the sizes? Attached is a
> > > > quick-hack-patch to log the sizes of
> > > > the key/values read from the sequence files.
> > Please
> > > > apply this patch on
> > > > hadoop-0.11.2 and check the userlogs what
> > key/value
> > > > it is failing for (if at
> > > > all)..
> > > > Thanks,
> > > > Devaraj.
> > > >
> > > > > -----Original Message-----
> > > > > From: Venkat Seeth [mailto:[hidden email]]
> > > > > Sent: Tuesday, February 20, 2007 11:32 AM
> > > > > To: [hidden email]
> > > > > Subject: Strange behavior - One reduce out of
> > N
> > > > reduces always fail.
> > > > >
> > > > > Hi there,
> > > > >
> > > > > Howdy. I've been using hadoop to parse and
> > index
> > > > XML
> > > > > documents. Its a 2 step process similar to
> > Nutch.
> > > > I
> > > > > parse the XML and create field-value tuples
> > > > written to
> > > > > a file.
> > > > >
> > > > > I read this file and index the field-value
> > pairs
> > > > in
> > > > > the next step.
> > > > >
> > > > > Everything works fine but always one reduce
> > out of
> > > > N
> > > > > fails in the last step when merging segments.
> > It
> > > > fails
> > > > > with one or more of the following:
> > > > > - Task failed to report status for 608
> > seconds.
> > > > > Killing.
> > > > > - java.lang.OutOfMemoryError: GC overhead
> > limit
> > > > > exceeded
> > > > >
> > > > > I've tried various configuration combinations
> > and
> > > > it
> > > > > fails always at the 4th one in a 8 reduce
> > > > > configuration and the first one in a 4 reduce
> > > > config.
> > > > >
> > > > > Environment:
> > > > > Suse Linux 64 bit
> > > > > Java 6 (Java 5 also fails)
> > > > > Hadoop-0.11-2
> > > > > Lucene-2.1 (Lucene 2.0 also fails)
> > > > >
> > > > > Configuration:
> > > > > I have about 128 maps and 8 reduces so I get
> > to
> > > > create
> > > > > 8 partitions of my index. It runs on a 4 node
> > > > cluster
> > > > > with 4-Dual-proc 64GB machines.
> > > > >
> > > > > Number of documents: 1.65 million each about
> > 10K
> > > > in
> > > > > size.
> > > > >
> > > > > I ran with 4 or 8 task trackers per node with
> > 4 GB
> > > > > Heap for Job, Task trackers and the child
> > JVMs.
> > > > >
> > > > > mergeFactor set to 50 and maxBufferedDocs at
> > 1000.
> > > > >
> > > > > I fail to understand whats going on. When I
> > run
> > > > the
> > > > > job individually, it works with the same
> > settings.
> > > > >
> > > > > Why would all jobs work where in only one
> > fails.
> > > > >
> > > > > I'd appreciate if any one can share their
> > > > experience.
> > > > >
> > > > > Thanks,
> > > > > Ven
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> __________________________________________________________________________
> > > > > __________
> > > > > Yahoo! Music Unlimited
> > > > > Access over 1 million songs.
> > > > > http://music.yahoo.com/unlimited
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> __________________________________________________________________________
> > > __________
> > > Cheap talk?
> > > Check out Yahoo! Messenger's low PC-to-Phone call
> > rates.
> > > http://voice.yahoo.com
> >
> >
>
>
>
>
> __________________________________________________________________________
> __________
> Yahoo! Music Unlimited
> Access over 1 million songs.
> http://music.yahoo.com/unlimited


Reply | Threaded
Open this post in threaded view
|

RE: Strange behavior - One reduce out of N reduces always fail.

Venkat Seeth
Havent determined which Key-Value pair causes this
one. Need to find that out.

Thanks,
Venkat

--- Mahadev Konar <[hidden email]> wrote:

> It does look like the value for a particular key is
> huge in size. Does your
> map reduce job fail for the same key/value pair or
> is it non deterministic?
>
> Regards
> Mahadev
>
> > -----Original Message-----
> > From: Venkat Seeth [mailto:[hidden email]]
> > Sent: Tuesday, February 20, 2007 4:09 PM
> > To: [hidden email]; Devaraj Das
> > Subject: RE: Strange behavior - One reduce out of
> N reduces always fail.
> >
> > Hi Devraj,
> >
> > The log file for key-value pairs are huge? If you
> can
> > tell me what are you looking for I can mine and
> send
> > the relevant information.
> >
> >  343891695 2007-02-20 18:37 seq.log
> >
> > This time aroung I get the following error:
> >
> > java.lang.OutOfMemoryError: GC overhead limit
> exceeded
> >         at
> > java.util.Arrays.copyOfRange(Arrays.java:3209)
> >         at
> java.lang.String.<init>(String.java:216)
> >         at
> >
>
java.lang.StringBuffer.toString(StringBuffer.java:585)
> >         at
> >
>
org.apache.log4j.WriterAppender.checkEntryConditions(WriterAppender.java:1
> > 76)
> >         at
> >
>
org.apache.log4j.WriterAppender.append(WriterAppender.java:156)
> >         at
> >
>
org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
> >         at
> >
>
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(Appe
> > nderAttachableImpl.java:65)
> >         at
> >
>
org.apache.log4j.Category.callAppenders(Category.java:203)
> >         at
> >
>
org.apache.log4j.Category.forcedLog(Category.java:388)
> >         at
> > org.apache.log4j.Category.debug(Category.java:257)
> >         at
> >
>
com.gale.searchng.workflow.model.TuplesWritable.readFields(TuplesWritable.
> > java:127)
> >         at
> >
>
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java
> > :199)
> >         at
> >
>
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:16
> > 0)
> >         at
> >
>
com.gale.searchng.workflow.indexer.Indexer.reduce(Indexer.java:152)
> >         at
> >
>
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:324)
> >         at
> >
>
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1372)

> >
> > Thanks,
> > Venkat
> >
> > --- Devaraj Das <[hidden email]> wrote:
> >
> > > Hi Venkat,
> > > You forgot to paste the log output in your
> reply.
> > > The patch that I sent will
> > > log the key/value sizes in the Reducers as well.
> See
> > > if you get helpful
> > > hints with that.
> > > Thanks,
> > > Devaraj.
> > >
> > > > -----Original Message-----
> > > > From: Venkat Seeth [mailto:[hidden email]]
> > > > Sent: Tuesday, February 20, 2007 9:55 PM
> > > > To: [hidden email]; Devaraj Das
> > > > Subject: RE: Strange behavior - One reduce out
> of
> > > N reduces always fail.
> > > >
> > > > Hi Devraj,
> > > >
> > > > Thanks for your response.
> > > >
> > > > > Do you have an estimate of the sizes?
> > > > # of entries:1080746
> > > > [# of field-value Pairs]
> > > > min count:20
> > > > max count:3116
> > > > avg count:66
> > > >
> > > > These are small documents and yes, full-text
> > > content
> > > > for each document can be big. I've also set
> the
> > > > MaxFieldLength to 10000 so that I dont index
> very
> > > > large values as suggested in Lucene.
> > > >
> > > > Always, the reduce fails while merging
> segments. I
> > > do
> > > > see a large line in Log4J output which
> consists of
> > > >
> > > > Typically, the job that fails is is ALWAYS
> VERY
> > > SLOW
> > > > as compared to other N - 1 jobs.
> > > >
> > > > Can I log the Key-Value pair sizes in the
> reduce
> > > part
> > > > of the indexer?
> > > >
> > > > Again,
> > > >
> > > > Thanks,
> > > > Venkat
> > > >
> > > > --- Devaraj Das <[hidden email]> wrote:
> > > >
> > > > > While this could be a JVM/GC issue as Andrez
> > > pointed
> > > > > out, it could also be
> > > > > due to a very large key/value being read
> from
> > > the
> > > > > map output. Do you have an
> > > > > estimate of the sizes? Attached is a
> > > > > quick-hack-patch to log the sizes of
> > > > > the key/values read from the sequence files.
> > > Please
> > > > > apply this patch on
> > > > > hadoop-0.11.2 and check the userlogs what
> > > key/value
> > > > > it is failing for (if at
> > > > > all)..
> > > > > Thanks,
> > > > > Devaraj.
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Venkat Seeth
> [mailto:[hidden email]]
> > > > > > Sent: Tuesday, February 20, 2007 11:32 AM
> > > > > > To: [hidden email]
> > > > > > Subject: Strange behavior - One reduce out
> of
> > > N
> > > > > reduces always fail.
> > > > > >
> > > > > > Hi there,
> > > > > >
> > > > > > Howdy. I've been using hadoop to parse and
> > > index
> > > > > XML
> > > > > > documents. Its a 2 step process similar to
> > > Nutch.
> > > > > I
> > > > > > parse the XML and create field-value
> tuples
> > > > > written to
> > > > > > a file.
> > > > > >
> > > > > > I read this file and index the field-value
> > > pairs
> > > > > in
> > > > > > the next step.
> > > > > >
> > > > > > Everything works fine but always one
> reduce
> > > out of
> > > > > N
> > > > > > fails in the last step when merging
> segments.
> > > It
> > > > > fails
>
=== message truncated ===



 
____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior - One reduce out of N reduces always fail.

Venkat Seeth
In reply to this post by Andrzej Białecki-2
Hi Andrzej,

A quick question on your suggestion.

>> Configuration:
>> I have about 128 maps and 8 reduces so I get to
create 8 partitions of my index.

> I think that with this configuration you could
increase the number of
> reduces, to decrease the amount of data each reduce
task has to handle.
> In your current config you run at most 2 reduces per
machine.

You suggested to increase the number of reduces. I did
come up with 8 partitions for my index each containing
about 10 million documents.

Are you saying I could probably create 32 partitions
and then later merge into smaller number of
partitions?

If I have a huge number of partitions, I do not know
how it'll affect federating search across these large
number of indexes and merging the results from those
searches.

Any thoughts are greatly appreciated.

Thanks,
Venkat

--- Andrzej Bialecki <[hidden email]> wrote:

> Venkat Seeth wrote:
> > Hi there,
> >
> > Howdy. I've been using hadoop to parse and index
> XML
> > documents. Its a 2 step process similar to Nutch.
> I
> > parse the XML and create field-value tuples
> written to
> > a file.
> >
> > I read this file and index the field-value pairs
> in
> > the next step.
> >
> > Everything works fine but always one reduce out of
> N
> > fails in the last step when merging segments. It
> fails
> > with one or more of the following:
> > - Task failed to report status for 608 seconds.
> > Killing.
> > - java.lang.OutOfMemoryError: GC overhead limit
> > exceeded
> >  
>
> Perhaps you are running with too large heap, as
> strange as it may sound
> ... If I understand this message correctly, JVM
> complains that GC is
> taking too much resources.
>
> This may be also related to ulimit on this account
> ...
>
>
> > Configuration:
> > I have about 128 maps and 8 reduces so I get to
> create
> > 8 partitions of my index. It runs on a 4 node
> cluster
> > with 4-Dual-proc 64GB machines.
> >  
>
> I think that with this configuration you could
> increase the number of
> reduces, to decrease the amount of data each reduce
> task has to handle.
> In your current config you run at most 2 reduces per
> machine.
>
> > Number of documents: 1.65 million each about 10K
> in
> > size.
> >
> > I ran with 4 or 8 task trackers per node with 4 GB
> > Heap for Job, Task trackers and the child JVMs.
> >
> > mergeFactor set to 50 and maxBufferedDocs at 1000.
> >
> > I fail to understand whats going on. When I run
> the
> > job individually, it works with the same settings.
> >
> > Why would all jobs work where in only one fails.
> >  
>
> You can also use IsolationRunner to re-run
> individual tasks under
> debugger and see where they fail.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>



 
____________________________________________________________________________________
No need to miss a message. Get email on-the-go
with Yahoo! Mail for Mobile. Get started.
http://mobile.yahoo.com/mail 
Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior - One reduce out of N reduces always fail.

Andrzej Białecki-2
Venkat Seeth wrote:

> Hi Andrzej,
>
> A quick question on your suggestion.
>
>  
>>> Configuration:
>>> I have about 128 maps and 8 reduces so I get to
>>>      
> create 8 partitions of my index.
>
>  
>> I think that with this configuration you could
>>    
> increase the number of
>  
>> reduces, to decrease the amount of data each reduce
>>    
> task has to handle.
>  
>> In your current config you run at most 2 reduces per
>>    
> machine.
>
> You suggested to increase the number of reduces. I did
> come up with 8 partitions for my index each containing
> about 10 million documents.
>
> Are you saying I could probably create 32 partitions
> and then later merge into smaller number of
> partitions?
>
> If I have a huge number of partitions, I do not know
> how it'll affect federating search across these large
> number of indexes and merging the results from those
> searches.
>
> Any thoughts are greatly appreciated.
>  


The only reason I suggested to increase the number of reduces is to get
you past the memory problems. From the search performance point of view
you should definitely merge partial indexes.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Strange behavior - One reduce out of N reduces always fail.

Venkat Seeth
Thank you Sami Siren, Andrzej Bialecki, Devaraj Das
and Mahadev Konar for your inputs. I finally was able
to get past 1 million with 2 changes.

1. Reduced the document size significantly.
2. Increased the file-hanldle limit from 1024 to 4096.
These 2 did the magic.

I was able to successfully process 5 million docs.
Planning a test for processing 25 million. I'll keep
things posted.

Thanks,
Venkat

--- Andrzej Bialecki <[hidden email]> wrote:

> Venkat Seeth wrote:
> > Hi Andrzej,
> >
> > A quick question on your suggestion.
> >
> >  
> >>> Configuration:
> >>> I have about 128 maps and 8 reduces so I get to
> >>>      
> > create 8 partitions of my index.
> >
> >  
> >> I think that with this configuration you could
> >>    
> > increase the number of
> >  
> >> reduces, to decrease the amount of data each
> reduce
> >>    
> > task has to handle.
> >  
> >> In your current config you run at most 2 reduces
> per
> >>    
> > machine.
> >
> > You suggested to increase the number of reduces. I
> did
> > come up with 8 partitions for my index each
> containing
> > about 10 million documents.
> >
> > Are you saying I could probably create 32
> partitions
> > and then later merge into smaller number of
> > partitions?
> >
> > If I have a huge number of partitions, I do not
> know
> > how it'll affect federating search across these
> large
> > number of indexes and merging the results from
> those
> > searches.
> >
> > Any thoughts are greatly appreciated.
> >  
>
>
> The only reason I suggested to increase the number
> of reduces is to get
> you past the memory problems. From the search
> performance point of view
> you should definitely merge partial indexes.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
>
>
>



 
____________________________________________________________________________________
Never miss an email again!
Yahoo! Toolbar alerts you the instant new Mail arrives.
http://tools.search.yahoo.com/toolbar/features/mail/