Hadoop MapReduce: using NFS as the filesystem

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Hadoop MapReduce: using NFS as the filesystem

Jon Blower
Hi all,

(I hope this is the place for Hadoop questions - please direct me if not!
Thanks to those who pointed out to me that MapReduce/DFS has moved to the
the Hadoop project.)

I'd like to use Hadoop to do distributed searches through data files that
are stored on a networked disk which we access through NFS.  All the
"worker" machines in our pool have access to this disk already so I don't
want to use DFS.  How can I set up Hadoop such that the workers read from
and write to this networked disk?

My first thought was that the NFS disk looks like a local disk from the
point of view of the workers.  So I added these properties to
conf/hadoop-site.xml:

    <property>  
        <name>fs.default.name</name>
        <value>local</value>
    </property>  
    <property>
        <name>mapred.job.tracker</name>  
        <value>localhost:9001</value>  
    </property>

I.e. I set it to use a "local" filesystem but started the job tracker
running.  My "conf/slaves" file just contains "localhost".  I then ran
bin/start-all.sh and tried to submit a job:

    bin/hadoop org.apache.hadoop.examples.WordCount in out

where "in" is a directory containing input files and "out" is a directory
for the output files.  This doesn't work: I see a log message "map 0% reduce
0%" and nothing further happens.

I can run the above job in standalone operation (i.e. with fs.default.name
set to "local" and the "mapred.job.tracker" unset) with no problems: this
worked straight out of the box.


Any suggestions would be much appreciated.

Thanks, Jon


--------------------------------------------------------------
Dr Jon Blower              Tel: +44 118 378 5213 (direct line)
Technical Director         Tel: +44 118 378 8741 (ESSC)
Reading e-Science Centre   Fax: +44 118 378 6413
ESSC                       Email: [hidden email]
University of Reading
3 Earley Gate
Reading RG6 6AL, UK
--------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop MapReduce: using NFS as the filesystem

Stefan Groschupf-2
Jon,
there is also a hadoop user mailing list.

It is not clear to me, what you are planing to do, but in general  
hadoop's tasktracks and jobtrackers require to run with a switched-on  
dfs.
What you can do is write a map task that is reading from the local  
disk you mentioned,but you will no get the counting demo to work with  
a local file system.
The problem in general is that hadoop require dfs for processing but  
does not allow multiple file systems at once.

I remember a discuss where people talked about  introducing a kind of  
protocol syntax like: dfs://path and file://path.
I would find that very useful also for nutch since it would allows to  
having an index on the local file system and the segment data in the  
dfs. :-) A serious problem for search performance and scalability today.
Feel free to trigger this improvement suggestion again in the hadoop  
mailing list and may developers today find that sense-fully as well.

Does that help?
Stefan




Am 27.02.2006 um 10:41 schrieb Jon Blower:

> Hi all,
>
> (I hope this is the place for Hadoop questions - please direct me  
> if not!
> Thanks to those who pointed out to me that MapReduce/DFS has moved  
> to the
> the Hadoop project.)
>
> I'd like to use Hadoop to do distributed searches through data  
> files that
> are stored on a networked disk which we access through NFS.  All the
> "worker" machines in our pool have access to this disk already so I  
> don't
> want to use DFS.  How can I set up Hadoop such that the workers  
> read from
> and write to this networked disk?
>
> My first thought was that the NFS disk looks like a local disk from  
> the
> point of view of the workers.  So I added these properties to
> conf/hadoop-site.xml:
>
>     <property>
>         <name>fs.default.name</name>
>         <value>local</value>
>     </property>
>     <property>
>         <name>mapred.job.tracker</name>
>         <value>localhost:9001</value>
>     </property>
>
> I.e. I set it to use a "local" filesystem but started the job tracker
> running.  My "conf/slaves" file just contains "localhost".  I then ran
> bin/start-all.sh and tried to submit a job:
>
>     bin/hadoop org.apache.hadoop.examples.WordCount in out
>
> where "in" is a directory containing input files and "out" is a  
> directory
> for the output files.  This doesn't work: I see a log message "map  
> 0% reduce
> 0%" and nothing further happens.
>
> I can run the above job in standalone operation (i.e. with  
> fs.default.name
> set to "local" and the "mapred.job.tracker" unset) with no  
> problems: this
> worked straight out of the box.
>
>
> Any suggestions would be much appreciated.
>
> Thanks, Jon
>
>
> --------------------------------------------------------------
> Dr Jon Blower              Tel: +44 118 378 5213 (direct line)
> Technical Director         Tel: +44 118 378 8741 (ESSC)
> Reading e-Science Centre   Fax: +44 118 378 6413
> ESSC                       Email: [hidden email]
> University of Reading
> 3 Earley Gate
> Reading RG6 6AL, UK
> --------------------------------------------------------------
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply | Threaded
Open this post in threaded view
|

Re: Hadoop MapReduce: using NFS as the filesystem

Doug Cutting
Stefan Groschupf wrote:
> in general  
> hadoop's tasktracks and jobtrackers require to run with a switched-on  dfs.

Stefan: that should not be the case.  One should be able to run things
entirely out of the "local" filesystem.  Absolute pathnames may be
required for input and output directories, but that's a bug that we can fix.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Hadoop MapReduce: using NFS as the filesystem

Jeff Ritchie
I was having some success with PVFS2.  Jobtrackers and Tasktrackers were
setup to 'local' file system.

mapred.local.dir was on the hard drive of the machine.  ie /tmp/hadoop
mapred.system.dir was on the pvfs2 mount and the same path for all
tasktrackers and the jobtracker. ie /mnt/pvfs2/hadoop/system
mapred.temp.dir was on the pvfs2 mount and the same path for all
tasktrackers and the jobtracker. ie /mnt/pvfs2/hadoop/temp

It worked out pretty good except for the performance of the pvfs2
cluster.  When I decided to switch to the hadoop dfs I noticed that
things were more stable (tasktrackers stopped timing out) and that my
reduce tasks completed quicker.

There may have been somethings I could have done to the storage cluster
to increase it's performance but I decided it was quicker to try out the
hadoop dfs.

Jeff

Doug Cutting wrote:

> Stefan Groschupf wrote:
>
>> in general  hadoop's tasktracks and jobtrackers require to run with a
>> switched-on  dfs.
>
>
> Stefan: that should not be the case.  One should be able to run things
> entirely out of the "local" filesystem.  Absolute pathnames may be
> required for input and output directories, but that's a bug that we
> can fix.
>
> Doug
>

Reply | Threaded
Open this post in threaded view
|

RE: Hadoop MapReduce: using NFS as the filesystem

Jon Blower
In reply to this post by Doug Cutting

>
> Stefan Groschupf wrote:
> > in general
> > hadoop's tasktracks and jobtrackers require to run with a
> switched-on  dfs.
>
> Stefan: that should not be the case.  One should be able to
> run things entirely out of the "local" filesystem.  Absolute
> pathnames may be required for input and output directories,
> but that's a bug that we can fix.
>

Just to be clear - does this mean that I don't have to run DFS at all, and I
can get all input data from (and write all output data to) an NFS drive?
DFS is unnecessary for my particular app (unless it brings other benefits
that I'm not aware of).

Jon

Reply | Threaded
Open this post in threaded view
|

Re: Hadoop MapReduce: using NFS as the filesystem

Stefan Groschupf-2
In reply to this post by Doug Cutting
Doug,
thanks for the clarification I missed that.
Having absolute path names would be great, but if I remember correct  
this is already discussed and was planed in the hadoop developer  
mailing list.

Stefan

Am 28.02.2006 um 21:04 schrieb Doug Cutting:

> Stefan Groschupf wrote:
>> in general  hadoop's tasktracks and jobtrackers require to run  
>> with a switched-on  dfs.
>
> Stefan: that should not be the case.  One should be able to run  
> things entirely out of the "local" filesystem.  Absolute pathnames  
> may be required for input and output directories, but that's a bug  
> that we can fix.
>
> Doug
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Reply | Threaded
Open this post in threaded view
|

Re: Hadoop MapReduce: using NFS as the filesystem

Raghavendra Prabhu
In reply to this post by Jon Blower
Hi Jon

The thing is when you mount a system and the mapred directory is present on
the mounted space

it will write to that folder mimicking network writes

So you can have this mounted in the task trackers i guess.

Am i right guys?

The dfs is managing content without having any filesystem in place. It
indirectly mimicks a networked file system on top of your existing one.

Hope that answers your question. Please correct me if i am wrong

Rgds
Prabhu


On 3/1/06, Jon Blower <[hidden email]> wrote:

>
>
> >
> > Stefan Groschupf wrote:
> > > in general
> > > hadoop's tasktracks and jobtrackers require to run with a
> > switched-on  dfs.
> >
> > Stefan: that should not be the case.  One should be able to
> > run things entirely out of the "local" filesystem.  Absolute
> > pathnames may be required for input and output directories,
> > but that's a bug that we can fix.
> >
>
> Just to be clear - does this mean that I don't have to run DFS at all, and
> I
> can get all input data from (and write all output data to) an NFS drive?
> DFS is unnecessary for my particular app (unless it brings other benefits
> that I'm not aware of).
>
> Jon
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Hadoop MapReduce: using NFS as the filesystem

Jon Blower
Hi,

I got MapReduce working with an NFS filesystem pretty easily in the end
(thanks to Jeff Ritchie) by having the following properties set in
hadoop-site.xml:

fs.default.name = local (doesn't use DFS at all)
mapred.local.dir = /tmp/hadoop/mapred/local (on the local hard disk of the
workers)
mapred.system.dir = /nfs-share/hadoop/tmp/mapred/system (on the NFS disk)
mapred.temp.dir = /nfs-share/hadoop/tmp/temp

As Doug pointed out, you need to specify *full paths* to input and output
data, e.g:

bin/hadoop org.apache.hadoop.examples.WordCount /nfs-share/hadoop/in
/nfs-share/hadoop/out

The WordCount example works fine but the Grep example does not.  I think
this is because the Grep example runs two jobs (a grep job and a sort job).
The grep job works fine (provided that full paths are specified for input
and output data) but the sort job does not - I think this is because the
system does not specify full paths for the files for the sort job.  Would it
be easy to fix this?

Thanks,
Jon

P.S. Should this conversation be moved to the hadoop mailing list?

--------------------------------------------------------------
Dr Jon Blower              Tel: +44 118 378 5213 (direct line)
Technical Director         Tel: +44 118 378 8741 (ESSC)
Reading e-Science Centre   Fax: +44 118 378 6413
ESSC                       Email: [hidden email]
University of Reading
3 Earley Gate
Reading RG6 6AL, UK
--------------------------------------------------------------  

> -----Original Message-----
> From: Raghavendra Prabhu [mailto:[hidden email]]
> Sent: 02 March 2006 06:44
> To: [hidden email]
> Subject: Re: Hadoop MapReduce: using NFS as the filesystem
>
> Hi Jon
>
> The thing is when you mount a system and the mapred directory
> is present on the mounted space
>
> it will write to that folder mimicking network writes
>
> So you can have this mounted in the task trackers i guess.
>
> Am i right guys?
>
> The dfs is managing content without having any filesystem in
> place. It indirectly mimicks a networked file system on top
> of your existing one.
>
> Hope that answers your question. Please correct me if i am wrong
>
> Rgds
> Prabhu
>
>
> On 3/1/06, Jon Blower <[hidden email]> wrote:
> >
> >
> > >
> > > Stefan Groschupf wrote:
> > > > in general
> > > > hadoop's tasktracks and jobtrackers require to run with a
> > > switched-on  dfs.
> > >
> > > Stefan: that should not be the case.  One should be able to run
> > > things entirely out of the "local" filesystem.  Absolute
> pathnames
> > > may be required for input and output directories, but
> that's a bug
> > > that we can fix.
> > >
> >
> > Just to be clear - does this mean that I don't have to run
> DFS at all,
> > and I can get all input data from (and write all output data to) an
> > NFS drive?
> > DFS is unnecessary for my particular app (unless it brings other
> > benefits that I'm not aware of).
> >
> > Jon
> >
> >
>