Fwd: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

bmdevelopment
Hi, Sorry for the cross-post. But just trying to see if anyone else
has had this issue before.
Thanks


---------- Forwarded message ----------
From: bmdevelopment <[hidden email]>
Date: Fri, Jun 25, 2010 at 10:56 AM
Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
To: [hidden email]


Hello,
Thanks so much for the reply.
See inline.

On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[hidden email]> wrote:

> Hi,
>
>> I've been getting the following error when trying to run a very simple
>> MapReduce job.
>> Map finishes without problem, but error occurs as soon as it enters
>> Reduce phase.
>>
>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>
>> I am running a 5 node cluster and I believe I have all my settings correct:
>>
>> * ulimit -n 32768
>> * DNS/RDNS configured properly
>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>
>> The program is very simple - just counts a unique string in a log file.
>> See here: http://pastebin.com/5uRG3SFL
>>
>> When I run, the job fails and I get the following output.
>> http://pastebin.com/AhW6StEb
>>
>> However, runs fine when I do *not* use substring() on the value (see
>> map function in code above).
>>
>> This runs fine and completes successfully:
>>            String str = val.toString();
>>
>> This causes error and fails:
>>            String str = val.toString().substring(0,10);
>>
>> Please let me know if you need any further information.
>> It would be greatly appreciated if anyone could shed some light on this problem.
>
> It catches attention that changing the code to use a substring is
> causing a difference. Assuming it is consistent and not a red herring,

Yes, this has been consistent over the last week. I was running 0.20.1
first and then
upgrade to 0.20.2 but results have been exactly the same.

> can you look at the counters for the two jobs using the JobTracker web
> UI - things like map records, bytes etc and see if there is a
> noticeable difference ?

Ok, so here is the first job using write.set(value.toString()); having
*no* errors:
http://pastebin.com/xvy0iGwL

And here is the second job using
write.set(value.toString().substring(0, 10)); that fails:
http://pastebin.com/uGw6yNqv

And here is even another where I used a longer, and therefore unique string,
by write.set(value.toString().substring(0, 20)); This makes every line
unique, similar to first job.
Still fails.
http://pastebin.com/GdQ1rp8i

>Also, are the two programs being run against
> the exact same input data ?

Yes, exactly the same input: a single csv file with 23K lines.
Using a shorter string leads to more like keys and therefore more
combining/reducing, but going
by the above it seems to fail whether the substring/key is entirely
unique (23000 combine output records) or
mostly the same (9 combine output records).

>
> Also, since the cluster size is small, you could also look at the
> tasktracker logs on the machines where the maps have run to see if
> there are any failures when the reduce attempts start failing.

Here is the TT log from the last failed job. I do not see anything
besides the shuffle failure, but there
may be something I am overlooking or simply do not understand.
http://pastebin.com/DKFTyGXg

Thanks again!

>
> Thanks
> Hemanth
>
Reply | Threaded
Open this post in threaded view
|

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Deepak Diwakar
Hey friends,

I got stuck on setting up hdfs cluster and getting this error while running
simple wordcount example(I did that 2 yrs back not had any problem).

Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
(
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
).

 I checked the firewall settings and /etc/hosts there is no issue there.
Also master and slave are accessible both ways.

Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
because ulimit(its btw of 4096).

Would be really thankful  if  anyone can guide me to resolve this.

Thanks & regards,
- Deepak Diwakar,




On 28 June 2010 18:39, bmdevelopment <[hidden email]> wrote:

> Hi, Sorry for the cross-post. But just trying to see if anyone else
> has had this issue before.
> Thanks
>
>
> ---------- Forwarded message ----------
> From: bmdevelopment <[hidden email]>
> Date: Fri, Jun 25, 2010 at 10:56 AM
> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
> To: [hidden email]
>
>
> Hello,
> Thanks so much for the reply.
> See inline.
>
> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[hidden email]>
> wrote:
> > Hi,
> >
> >> I've been getting the following error when trying to run a very simple
> >> MapReduce job.
> >> Map finishes without problem, but error occurs as soon as it enters
> >> Reduce phase.
> >>
> >> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>
> >> I am running a 5 node cluster and I believe I have all my settings
> correct:
> >>
> >> * ulimit -n 32768
> >> * DNS/RDNS configured properly
> >> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>
> >> The program is very simple - just counts a unique string in a log file.
> >> See here: http://pastebin.com/5uRG3SFL
> >>
> >> When I run, the job fails and I get the following output.
> >> http://pastebin.com/AhW6StEb
> >>
> >> However, runs fine when I do *not* use substring() on the value (see
> >> map function in code above).
> >>
> >> This runs fine and completes successfully:
> >>            String str = val.toString();
> >>
> >> This causes error and fails:
> >>            String str = val.toString().substring(0,10);
> >>
> >> Please let me know if you need any further information.
> >> It would be greatly appreciated if anyone could shed some light on this
> problem.
> >
> > It catches attention that changing the code to use a substring is
> > causing a difference. Assuming it is consistent and not a red herring,
>
> Yes, this has been consistent over the last week. I was running 0.20.1
> first and then
> upgrade to 0.20.2 but results have been exactly the same.
>
> > can you look at the counters for the two jobs using the JobTracker web
> > UI - things like map records, bytes etc and see if there is a
> > noticeable difference ?
>
> Ok, so here is the first job using write.set(value.toString()); having
> *no* errors:
> http://pastebin.com/xvy0iGwL
>
> And here is the second job using
> write.set(value.toString().substring(0, 10)); that fails:
> http://pastebin.com/uGw6yNqv
>
> And here is even another where I used a longer, and therefore unique
> string,
> by write.set(value.toString().substring(0, 20)); This makes every line
> unique, similar to first job.
> Still fails.
> http://pastebin.com/GdQ1rp8i
>
> >Also, are the two programs being run against
> > the exact same input data ?
>
> Yes, exactly the same input: a single csv file with 23K lines.
> Using a shorter string leads to more like keys and therefore more
> combining/reducing, but going
> by the above it seems to fail whether the substring/key is entirely
> unique (23000 combine output records) or
> mostly the same (9 combine output records).
>
> >
> > Also, since the cluster size is small, you could also look at the
> > tasktracker logs on the machines where the maps have run to see if
> > there are any failures when the reduce attempts start failing.
>
> Here is the TT log from the last failed job. I do not see anything
> besides the shuffle failure, but there
> may be something I am overlooking or simply do not understand.
> http://pastebin.com/DKFTyGXg
>
> Thanks again!
>
> >
> > Thanks
> > Hemanth
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Chen He
Hey Deepak Diwakar

Try to keep the /etc/hosts file as the same among all your cluster nodes.
See whether this problem will disappear.

On Tue, Jul 27, 2010 at 2:31 PM, Deepak Diwakar <[hidden email]> wrote:

> Hey friends,
>
> I got stuck on setting up hdfs cluster and getting this error while running
> simple wordcount example(I did that 2 yrs back not had any problem).
>
> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed
> from
> (
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> ).
>
>  I checked the firewall settings and /etc/hosts there is no issue there.
> Also master and slave are accessible both ways.
>
> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
> because ulimit(its btw of 4096).
>
> Would be really thankful  if  anyone can guide me to resolve this.
>
> Thanks & regards,
> - Deepak Diwakar,
>
>
>
>
> On 28 June 2010 18:39, bmdevelopment <[hidden email]> wrote:
>
> > Hi, Sorry for the cross-post. But just trying to see if anyone else
> > has had this issue before.
> > Thanks
> >
> >
> > ---------- Forwarded message ----------
> > From: bmdevelopment <[hidden email]>
> > Date: Fri, Jun 25, 2010 at 10:56 AM
> > Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> > bailing-out.
> > To: [hidden email]
> >
> >
> > Hello,
> > Thanks so much for the reply.
> > See inline.
> >
> > On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[hidden email]>
> > wrote:
> > > Hi,
> > >
> > >> I've been getting the following error when trying to run a very simple
> > >> MapReduce job.
> > >> Map finishes without problem, but error occurs as soon as it enters
> > >> Reduce phase.
> > >>
> > >> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> > >> attempt_201006241812_0001_r_000000_0, Status : FAILED
> > >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> > >>
> > >> I am running a 5 node cluster and I believe I have all my settings
> > correct:
> > >>
> > >> * ulimit -n 32768
> > >> * DNS/RDNS configured properly
> > >> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> > >> * mapred-site.xml : http://pastebin.com/JraVQZcW
> > >>
> > >> The program is very simple - just counts a unique string in a log
> file.
> > >> See here: http://pastebin.com/5uRG3SFL
> > >>
> > >> When I run, the job fails and I get the following output.
> > >> http://pastebin.com/AhW6StEb
> > >>
> > >> However, runs fine when I do *not* use substring() on the value (see
> > >> map function in code above).
> > >>
> > >> This runs fine and completes successfully:
> > >>            String str = val.toString();
> > >>
> > >> This causes error and fails:
> > >>            String str = val.toString().substring(0,10);
> > >>
> > >> Please let me know if you need any further information.
> > >> It would be greatly appreciated if anyone could shed some light on
> this
> > problem.
> > >
> > > It catches attention that changing the code to use a substring is
> > > causing a difference. Assuming it is consistent and not a red herring,
> >
> > Yes, this has been consistent over the last week. I was running 0.20.1
> > first and then
> > upgrade to 0.20.2 but results have been exactly the same.
> >
> > > can you look at the counters for the two jobs using the JobTracker web
> > > UI - things like map records, bytes etc and see if there is a
> > > noticeable difference ?
> >
> > Ok, so here is the first job using write.set(value.toString()); having
> > *no* errors:
> > http://pastebin.com/xvy0iGwL
> >
> > And here is the second job using
> > write.set(value.toString().substring(0, 10)); that fails:
> > http://pastebin.com/uGw6yNqv
> >
> > And here is even another where I used a longer, and therefore unique
> > string,
> > by write.set(value.toString().substring(0, 20)); This makes every line
> > unique, similar to first job.
> > Still fails.
> > http://pastebin.com/GdQ1rp8i
> >
> > >Also, are the two programs being run against
> > > the exact same input data ?
> >
> > Yes, exactly the same input: a single csv file with 23K lines.
> > Using a shorter string leads to more like keys and therefore more
> > combining/reducing, but going
> > by the above it seems to fail whether the substring/key is entirely
> > unique (23000 combine output records) or
> > mostly the same (9 combine output records).
> >
> > >
> > > Also, since the cluster size is small, you could also look at the
> > > tasktracker logs on the machines where the maps have run to see if
> > > there are any failures when the reduce attempts start failing.
> >
> > Here is the TT log from the last failed job. I do not see anything
> > besides the shuffle failure, but there
> > may be something I am overlooking or simply do not understand.
> > http://pastebin.com/DKFTyGXg
> >
> > Thanks again!
> >
> > >
> > > Thanks
> > > Hemanth
> > >
> >
>



--
Best Wishes!
顺送商祺!

--
Chen He
(402)613-9298
PhD. student of CSE Dept.
Research Assistant of Holland Computing Center
University of Nebraska-Lincoln
Lincoln NE 68588
Reply | Threaded
Open this post in threaded view
|

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

C.V.Krishnakumar
In reply to this post by Deepak Diwakar
Hi Deepak,

YOu could refer this too : http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results 
 I tried those instructions and it is working for me.
Regards,
Krishna
On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:

> Hey friends,
>
> I got stuck on setting up hdfs cluster and getting this error while running
> simple wordcount example(I did that 2 yrs back not had any problem).
>
> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
> (
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> ).
>
> I checked the firewall settings and /etc/hosts there is no issue there.
> Also master and slave are accessible both ways.
>
> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
> because ulimit(its btw of 4096).
>
> Would be really thankful  if  anyone can guide me to resolve this.
>
> Thanks & regards,
> - Deepak Diwakar,
>
>
>
>
> On 28 June 2010 18:39, bmdevelopment <[hidden email]> wrote:
>
>> Hi, Sorry for the cross-post. But just trying to see if anyone else
>> has had this issue before.
>> Thanks
>>
>>
>> ---------- Forwarded message ----------
>> From: bmdevelopment <[hidden email]>
>> Date: Fri, Jun 25, 2010 at 10:56 AM
>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
>> bailing-out.
>> To: [hidden email]
>>
>>
>> Hello,
>> Thanks so much for the reply.
>> See inline.
>>
>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[hidden email]>
>> wrote:
>>> Hi,
>>>
>>>> I've been getting the following error when trying to run a very simple
>>>> MapReduce job.
>>>> Map finishes without problem, but error occurs as soon as it enters
>>>> Reduce phase.
>>>>
>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>
>>>> I am running a 5 node cluster and I believe I have all my settings
>> correct:
>>>>
>>>> * ulimit -n 32768
>>>> * DNS/RDNS configured properly
>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>
>>>> The program is very simple - just counts a unique string in a log file.
>>>> See here: http://pastebin.com/5uRG3SFL
>>>>
>>>> When I run, the job fails and I get the following output.
>>>> http://pastebin.com/AhW6StEb
>>>>
>>>> However, runs fine when I do *not* use substring() on the value (see
>>>> map function in code above).
>>>>
>>>> This runs fine and completes successfully:
>>>>           String str = val.toString();
>>>>
>>>> This causes error and fails:
>>>>           String str = val.toString().substring(0,10);
>>>>
>>>> Please let me know if you need any further information.
>>>> It would be greatly appreciated if anyone could shed some light on this
>> problem.
>>>
>>> It catches attention that changing the code to use a substring is
>>> causing a difference. Assuming it is consistent and not a red herring,
>>
>> Yes, this has been consistent over the last week. I was running 0.20.1
>> first and then
>> upgrade to 0.20.2 but results have been exactly the same.
>>
>>> can you look at the counters for the two jobs using the JobTracker web
>>> UI - things like map records, bytes etc and see if there is a
>>> noticeable difference ?
>>
>> Ok, so here is the first job using write.set(value.toString()); having
>> *no* errors:
>> http://pastebin.com/xvy0iGwL
>>
>> And here is the second job using
>> write.set(value.toString().substring(0, 10)); that fails:
>> http://pastebin.com/uGw6yNqv
>>
>> And here is even another where I used a longer, and therefore unique
>> string,
>> by write.set(value.toString().substring(0, 20)); This makes every line
>> unique, similar to first job.
>> Still fails.
>> http://pastebin.com/GdQ1rp8i
>>
>>> Also, are the two programs being run against
>>> the exact same input data ?
>>
>> Yes, exactly the same input: a single csv file with 23K lines.
>> Using a shorter string leads to more like keys and therefore more
>> combining/reducing, but going
>> by the above it seems to fail whether the substring/key is entirely
>> unique (23000 combine output records) or
>> mostly the same (9 combine output records).
>>
>>>
>>> Also, since the cluster size is small, you could also look at the
>>> tasktracker logs on the machines where the maps have run to see if
>>> there are any failures when the reduce attempts start failing.
>>
>> Here is the TT log from the last failed job. I do not see anything
>> besides the shuffle failure, but there
>> may be something I am overlooking or simply do not understand.
>> http://pastebin.com/DKFTyGXg
>>
>> Thanks again!
>>
>>>
>>> Thanks
>>> Hemanth
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

C.V.Krishnakumar
Hi Deepak,

Maybe I did not make my mail clear. I had tried the instructions in the blog you mentioned. They are  working for me.
Did you change the /etc/hosts file at any point of time?

Regards,
Krishna

On Jul 27, 2010, at 2:30 PM, C.V.Krishnakumar wrote:

> Hi Deepak,
>
> YOu could refer this too : http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results 
> I tried those instructions and it is working for me.
> Regards,
> Krishna
> On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:
>
>> Hey friends,
>>
>> I got stuck on setting up hdfs cluster and getting this error while running
>> simple wordcount example(I did that 2 yrs back not had any problem).
>>
>> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
>> (
>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>> ).
>>
>> I checked the firewall settings and /etc/hosts there is no issue there.
>> Also master and slave are accessible both ways.
>>
>> Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
>> because ulimit(its btw of 4096).
>>
>> Would be really thankful  if  anyone can guide me to resolve this.
>>
>> Thanks & regards,
>> - Deepak Diwakar,
>>
>>
>>
>>
>> On 28 June 2010 18:39, bmdevelopment <[hidden email]> wrote:
>>
>>> Hi, Sorry for the cross-post. But just trying to see if anyone else
>>> has had this issue before.
>>> Thanks
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: bmdevelopment <[hidden email]>
>>> Date: Fri, Jun 25, 2010 at 10:56 AM
>>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
>>> bailing-out.
>>> To: [hidden email]
>>>
>>>
>>> Hello,
>>> Thanks so much for the reply.
>>> See inline.
>>>
>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[hidden email]>
>>> wrote:
>>>> Hi,
>>>>
>>>>> I've been getting the following error when trying to run a very simple
>>>>> MapReduce job.
>>>>> Map finishes without problem, but error occurs as soon as it enters
>>>>> Reduce phase.
>>>>>
>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>>
>>>>> I am running a 5 node cluster and I believe I have all my settings
>>> correct:
>>>>>
>>>>> * ulimit -n 32768
>>>>> * DNS/RDNS configured properly
>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>>
>>>>> The program is very simple - just counts a unique string in a log file.
>>>>> See here: http://pastebin.com/5uRG3SFL
>>>>>
>>>>> When I run, the job fails and I get the following output.
>>>>> http://pastebin.com/AhW6StEb
>>>>>
>>>>> However, runs fine when I do *not* use substring() on the value (see
>>>>> map function in code above).
>>>>>
>>>>> This runs fine and completes successfully:
>>>>>          String str = val.toString();
>>>>>
>>>>> This causes error and fails:
>>>>>          String str = val.toString().substring(0,10);
>>>>>
>>>>> Please let me know if you need any further information.
>>>>> It would be greatly appreciated if anyone could shed some light on this
>>> problem.
>>>>
>>>> It catches attention that changing the code to use a substring is
>>>> causing a difference. Assuming it is consistent and not a red herring,
>>>
>>> Yes, this has been consistent over the last week. I was running 0.20.1
>>> first and then
>>> upgrade to 0.20.2 but results have been exactly the same.
>>>
>>>> can you look at the counters for the two jobs using the JobTracker web
>>>> UI - things like map records, bytes etc and see if there is a
>>>> noticeable difference ?
>>>
>>> Ok, so here is the first job using write.set(value.toString()); having
>>> *no* errors:
>>> http://pastebin.com/xvy0iGwL
>>>
>>> And here is the second job using
>>> write.set(value.toString().substring(0, 10)); that fails:
>>> http://pastebin.com/uGw6yNqv
>>>
>>> And here is even another where I used a longer, and therefore unique
>>> string,
>>> by write.set(value.toString().substring(0, 20)); This makes every line
>>> unique, similar to first job.
>>> Still fails.
>>> http://pastebin.com/GdQ1rp8i
>>>
>>>> Also, are the two programs being run against
>>>> the exact same input data ?
>>>
>>> Yes, exactly the same input: a single csv file with 23K lines.
>>> Using a shorter string leads to more like keys and therefore more
>>> combining/reducing, but going
>>> by the above it seems to fail whether the substring/key is entirely
>>> unique (23000 combine output records) or
>>> mostly the same (9 combine output records).
>>>
>>>>
>>>> Also, since the cluster size is small, you could also look at the
>>>> tasktracker logs on the machines where the maps have run to see if
>>>> there are any failures when the reduce attempts start failing.
>>>
>>> Here is the TT log from the last failed job. I do not see anything
>>> besides the shuffle failure, but there
>>> may be something I am overlooking or simply do not understand.
>>> http://pastebin.com/DKFTyGXg
>>>
>>> Thanks again!
>>>
>>>>
>>>> Thanks
>>>> Hemanth
>>>>
>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Deepak Diwakar
Thanks Krishna and Chen.

Yes problem was in /etc/hosts. In fact on  each node there was unique
identifier like necromancer/rocker etc.. which is the only difference in
/etc/hosts amongst the nodes. Once I put same identifier for all, it worked.


Thanks & regards
- Deepak Diwakar,




On 28 July 2010 03:09, C.V.Krishnakumar <[hidden email]> wrote:

> Hi Deepak,
>
> Maybe I did not make my mail clear. I had tried the instructions in the
> blog you mentioned. They are  working for me.
> Did you change the /etc/hosts file at any point of time?
>
> Regards,
> Krishna
>
> On Jul 27, 2010, at 2:30 PM, C.V.Krishnakumar wrote:
>
> > Hi Deepak,
> >
> > YOu could refer this too :
> http://markmail.org/message/mjq6gzjhst2inuab#query:MAX_FAILED_UNIQUE_FETCHES+page:1+mid:ubrwgmddmfvoadh2+state:results
> > I tried those instructions and it is working for me.
> > Regards,
> > Krishna
> > On Jul 27, 2010, at 12:31 PM, Deepak Diwakar wrote:
> >
> >> Hey friends,
> >>
> >> I got stuck on setting up hdfs cluster and getting this error while
> running
> >> simple wordcount example(I did that 2 yrs back not had any problem).
> >>
> >> Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed
> from
> >> (
> >>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> >> ).
> >>
> >> I checked the firewall settings and /etc/hosts there is no issue there.
> >> Also master and slave are accessible both ways.
> >>
> >> Also the input size very low ~ 3 MB  and hence there shouldn't be no
> issue
> >> because ulimit(its btw of 4096).
> >>
> >> Would be really thankful  if  anyone can guide me to resolve this.
> >>
> >> Thanks & regards,
> >> - Deepak Diwakar,
> >>
> >>
> >>
> >>
> >> On 28 June 2010 18:39, bmdevelopment <[hidden email]> wrote:
> >>
> >>> Hi, Sorry for the cross-post. But just trying to see if anyone else
> >>> has had this issue before.
> >>> Thanks
> >>>
> >>>
> >>> ---------- Forwarded message ----------
> >>> From: bmdevelopment <[hidden email]>
> >>> Date: Fri, Jun 25, 2010 at 10:56 AM
> >>> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> >>> bailing-out.
> >>> To: [hidden email]
> >>>
> >>>
> >>> Hello,
> >>> Thanks so much for the reply.
> >>> See inline.
> >>>
> >>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[hidden email]
> >
> >>> wrote:
> >>>> Hi,
> >>>>
> >>>>> I've been getting the following error when trying to run a very
> simple
> >>>>> MapReduce job.
> >>>>> Map finishes without problem, but error occurs as soon as it enters
> >>>>> Reduce phase.
> >>>>>
> >>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>>>>
> >>>>> I am running a 5 node cluster and I believe I have all my settings
> >>> correct:
> >>>>>
> >>>>> * ulimit -n 32768
> >>>>> * DNS/RDNS configured properly
> >>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>>>>
> >>>>> The program is very simple - just counts a unique string in a log
> file.
> >>>>> See here: http://pastebin.com/5uRG3SFL
> >>>>>
> >>>>> When I run, the job fails and I get the following output.
> >>>>> http://pastebin.com/AhW6StEb
> >>>>>
> >>>>> However, runs fine when I do *not* use substring() on the value (see
> >>>>> map function in code above).
> >>>>>
> >>>>> This runs fine and completes successfully:
> >>>>>          String str = val.toString();
> >>>>>
> >>>>> This causes error and fails:
> >>>>>          String str = val.toString().substring(0,10);
> >>>>>
> >>>>> Please let me know if you need any further information.
> >>>>> It would be greatly appreciated if anyone could shed some light on
> this
> >>> problem.
> >>>>
> >>>> It catches attention that changing the code to use a substring is
> >>>> causing a difference. Assuming it is consistent and not a red herring,
> >>>
> >>> Yes, this has been consistent over the last week. I was running 0.20.1
> >>> first and then
> >>> upgrade to 0.20.2 but results have been exactly the same.
> >>>
> >>>> can you look at the counters for the two jobs using the JobTracker web
> >>>> UI - things like map records, bytes etc and see if there is a
> >>>> noticeable difference ?
> >>>
> >>> Ok, so here is the first job using write.set(value.toString()); having
> >>> *no* errors:
> >>> http://pastebin.com/xvy0iGwL
> >>>
> >>> And here is the second job using
> >>> write.set(value.toString().substring(0, 10)); that fails:
> >>> http://pastebin.com/uGw6yNqv
> >>>
> >>> And here is even another where I used a longer, and therefore unique
> >>> string,
> >>> by write.set(value.toString().substring(0, 20)); This makes every line
> >>> unique, similar to first job.
> >>> Still fails.
> >>> http://pastebin.com/GdQ1rp8i
> >>>
> >>>> Also, are the two programs being run against
> >>>> the exact same input data ?
> >>>
> >>> Yes, exactly the same input: a single csv file with 23K lines.
> >>> Using a shorter string leads to more like keys and therefore more
> >>> combining/reducing, but going
> >>> by the above it seems to fail whether the substring/key is entirely
> >>> unique (23000 combine output records) or
> >>> mostly the same (9 combine output records).
> >>>
> >>>>
> >>>> Also, since the cluster size is small, you could also look at the
> >>>> tasktracker logs on the machines where the maps have run to see if
> >>>> there are any failures when the reduce attempts start failing.
> >>>
> >>> Here is the TT log from the last failed job. I do not see anything
> >>> besides the shuffle failure, but there
> >>> may be something I am overlooking or simply do not understand.
> >>> http://pastebin.com/DKFTyGXg
> >>>
> >>> Thanks again!
> >>>
> >>>>
> >>>> Thanks
> >>>> Hemanth
> >>>>
> >>>
> >
>
>