A couple of usability problems

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

A couple of usability problems

Nathan Wang
Hi,
I have a couple of problems that I think the development team could enhance.  
I'm currently running a job that takes a whole day to finish.
 
1) Adjusting input set dynamically
At the start, I had 9090 gzipped input data files for the job,
    07/09/24 10:26:06 INFO mapred.FileInputFormat: Total input paths to process : 9090

Then I realized there were 3 files that were bad (couldn't be gunzipped).  
So, I removed them by doing,
    bin/hadoop  dfs  -rm  srcdir/FILExxx.gz

20 hours later, the job was failed.  And, I found a few errors in the log:
    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename ...FILExxx.gz

Is it possible that the runtime could adjust the input data set accordingly?

2) Checking the output directory first
I started my job with the standard command line,
    bin/hardoop  jar  myjob.jar  srcdir  resultdir

Then, after many long hours, the job was about to finish with
    ...INFO mapred.JobClient:  map 100% reduce 100%
But, it ended up with
    Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory ...resultdir already exists

Can we check the existence of the output directory at the very beginning, to save us a day?

Thanks,
Nathan
Reply | Threaded
Open this post in threaded view
|

RE: A couple of usability problems

Devaraj Das
> 20 hours later, the job was failed.  And, I found a few
> errors in the log:
>     org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Cannot open filename ...FILExxx.gz
>
> Is it possible that the runtime could adjust the input data
> set accordingly?

If I remember right, the implementation of the packaged
inputformat/recordreaders in hadoop doesn't do this check. But if you
implement your own inputformat/recordreader or subclass an existing
recordreader. If you take org.apache.hadoop.mapred.LineRecordReader as an
example for this discussion, you could have this check ignored in the
LineRecordReader's contructor. Also next(K,V) method would have to check
whether the inputstream is null and return false in the very first call
(apart from possibly some other things). So, in summary it is possible.

> Then, after many long hours, the job was about to finish with
>     ...INFO mapred.JobClient:  map 100% reduce 100% But, it
> ended up with
>     Exception in thread "main"
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output
> directory ...resultdir already exists

This should be handled by the OutputFormat.checkOutputSpecs, and if you used
an OutputFormat implementation supplied in hadoop that extends from
org.apache.hadoop.mapred.OutputFormatBase, this check would immediately fail
jobs whose output dir already exists. Which outputformat are you using?

> -----Original Message-----
> From: Nathan Wang [mailto:[hidden email]]
> Sent: Tuesday, September 25, 2007 11:00 PM
> To: [hidden email]
> Subject: A couple of usability problems
>
> Hi,
> I have a couple of problems that I think the development team
> could enhance.  
> I'm currently running a job that takes a whole day to finish.
>  
> 1) Adjusting input set dynamically
> At the start, I had 9090 gzipped input data files for the job,
>     07/09/24 10:26:06 INFO mapred.FileInputFormat: Total
> input paths to process : 9090
>
> Then I realized there were 3 files that were bad (couldn't be
> gunzipped).  
> So, I removed them by doing,
>     bin/hadoop  dfs  -rm  srcdir/FILExxx.gz
>
> 20 hours later, the job was failed.  And, I found a few
> errors in the log:
>     org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: Cannot open filename ...FILExxx.gz
>
> Is it possible that the runtime could adjust the input data
> set accordingly?
>
> 2) Checking the output directory first
> I started my job with the standard command line,
>     bin/hardoop  jar  myjob.jar  srcdir  resultdir
>
> Then, after many long hours, the job was about to finish with
>     ...INFO mapred.JobClient:  map 100% reduce 100% But, it
> ended up with
>     Exception in thread "main"
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output
> directory ...resultdir already exists
>
> Can we check the existence of the output directory at the
> very beginning, to save us a day?
>
> Thanks,
> Nathan
>

Reply | Threaded
Open this post in threaded view
|

Re: A couple of usability problems

Ted Dunning-3
In reply to this post by Nathan Wang

My jobs seem to do that.  I am surprised yours do not.

What version of hadoop are you running?  I am using 0.13.1


On 9/25/07 10:30 AM, "Nathan Wang" <[hidden email]> wrote:

>
> Can we check the existence of the output directory at the very beginning, to
> save us a day?

Reply | Threaded
Open this post in threaded view
|

Re: A couple of usability problems

Nathan Wang
In reply to this post by Nathan Wang
I didn't set the OutputFormat.  By default, it should be TextOutputFormat, I believe.
And, I'm running the latest version: 0.14.0.
Reply | Threaded
Open this post in threaded view
|

Re: A couple of usability problems

Owen O'Malley-4
In reply to this post by Nathan Wang
On Sep 25, 2007, at 10:30 AM, Nathan Wang wrote:

> 1) Adjusting input set dynamically
> At the start, I had 9090 gzipped input data files for the job,
>     07/09/24 10:26:06 INFO mapred.FileInputFormat: Total input  
> paths to process : 9090
>
> Then I realized there were 3 files that were bad (couldn't be  
> gunzipped).
> So, I removed them by doing,
>     bin/hadoop  dfs  -rm  srcdir/FILExxx.gz
>
> 20 hours later, the job was failed.  And, I found a few errors in  
> the log:
>     org.apache.hadoop.ipc.RemoteException: java.io.IOException:  
> Cannot open filename ...FILExxx.gz
>
> Is it possible that the runtime could adjust the input data set  
> accordingly?

As Devaraj pointed out this is possible, but in general I think it is  
correct to make this an error. The planning for the job must happen  
at the beginning before the job is launched and once the map has been  
assigned a file, if the mapper can't read the assigned input, it is a  
fatal problem. If failures are tolerable for your application, you  
can set the percent of mappers and reducers that can fail before the  
job is killed.

> Can we check the existence of the output directory at the very  
> beginning, to save us a day?

It does already. That was done back before 0.1 in HADOOP-3. Was your  
program launching two jobs or something? Very strange.

-- Owen
Reply | Threaded
Open this post in threaded view
|

Re: A couple of usability problems

Torsten Curdt
In reply to this post by Ted Dunning-3
Something we noticed too - that has changed with our upgrade to 0.14.0

On 25.09.2007, at 21:36, Ted Dunning wrote:

>
> My jobs seem to do that.  I am surprised yours do not.
>
> What version of hadoop are you running?  I am using 0.13.1
>
>
> On 9/25/07 10:30 AM, "Nathan Wang" <[hidden email]> wrote:
>
>>
>> Can we check the existence of the output directory at the very  
>> beginning, to
>> save us a day?
>