Configuration and Hadoop cluster setup

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Configuration and Hadoop cluster setup

alakshman
I am trying to run Hadoop on a cluster of 3 nodes. The namenode and the
jobtracker web UI work. I have the namenode running on node A and job
tracker running on node B. Is it true that namenode and jobtracker cannot
run on the same box ? Also if I want to run the examples on the cluster is
there anything special that needs to be done. When I run the example
WordCount on a machine C (which is a task tracker and not a job tracker) the
LocalJobRunner is invoked all the time. I am guessing this means that the
map tasks are running locally. How can I distribute this on the cluster ?
Please advice.

Thanks
Avinash
Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

Dennis Kubes


Phantom wrote:
> I am trying to run Hadoop on a cluster of 3 nodes. The namenode and the
> jobtracker web UI work. I have the namenode running on node A and job
> tracker running on node B. Is it true that namenode and jobtracker cannot
> run on the same box ?

The namenode and the jobtracker can most definitely run on the same box.
  As far as I know this is the preferred configuration.

Also if I want to run the examples on the cluster is
> there anything special that needs to be done. When I run the example
> WordCount on a machine C (which is a task tracker and not a job tracker)
> the
> LocalJobRunner is invoked all the time. I am guessing this means that the
> map tasks are running locally. How can I distribute this on the cluster ?
> Please advice.

Are the conf files on machine C the same as the namenode/jobtracker?
Are they pointing to the namenode and jobtracker or are they pointing to
local in the hadoop-site.xml file.  Also we have found it easier
(although not necessarily better) to start tasks on the namenode server.

It would be helpful to have more information about what is happening and
your setup as that would help myself and others on the list debug what
may be occurring.

Dennis Kubes

>
> Thanks
> Avinash
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

alakshman
Yes the files are the same and I am starting the tasks on the namenode
server. I also figured what my problem was with respect to not being able to
start the namenode and job tracker on the same machine. I had to reformat
the file system. But the all this still doesn't cause the WordCount sample
to run in a distributed fashion. I can tell this because the LocalJobRunner
is being used. Do I need to specify the config file to the running instance
of the program ? If so how do I do that ?

Thanks
A

On 5/24/07, Dennis Kubes <[hidden email]> wrote:

>
>
>
> Phantom wrote:
> > I am trying to run Hadoop on a cluster of 3 nodes. The namenode and the
> > jobtracker web UI work. I have the namenode running on node A and job
> > tracker running on node B. Is it true that namenode and jobtracker
> cannot
> > run on the same box ?
>
> The namenode and the jobtracker can most definitely run on the same box.
>   As far as I know this is the preferred configuration.
>
> Also if I want to run the examples on the cluster is
> > there anything special that needs to be done. When I run the example
> > WordCount on a machine C (which is a task tracker and not a job tracker)
> > the
> > LocalJobRunner is invoked all the time. I am guessing this means that
> the
> > map tasks are running locally. How can I distribute this on the cluster
> ?
> > Please advice.
>
> Are the conf files on machine C the same as the namenode/jobtracker?
> Are they pointing to the namenode and jobtracker or are they pointing to
> local in the hadoop-site.xml file.  Also we have found it easier
> (although not necessarily better) to start tasks on the namenode server.
>
> It would be helpful to have more information about what is happening and
> your setup as that would help myself and others on the list debug what
> may be occurring.
>
> Dennis Kubes
>
> >
> > Thanks
> > Avinash
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Configuration and Hadoop cluster setup

Mahadev Konar
Hi,
  When you run the job, you need to set the environment variable
HADOOP_CONF_DIR to the configuration directory that has the configuration
file pointing to the right jobtracker.

Regards
Mahadev

> -----Original Message-----
> From: Phantom [mailto:[hidden email]]
> Sent: Thursday, May 24, 2007 4:51 PM
> To: [hidden email]
> Subject: Re: Configuration and Hadoop cluster setup
>
> Yes the files are the same and I am starting the tasks on the namenode
> server. I also figured what my problem was with respect to not being able
> to
> start the namenode and job tracker on the same machine. I had to reformat
> the file system. But the all this still doesn't cause the WordCount sample
> to run in a distributed fashion. I can tell this because the
> LocalJobRunner
> is being used. Do I need to specify the config file to the running
> instance
> of the program ? If so how do I do that ?
>
> Thanks
> A
>
> On 5/24/07, Dennis Kubes <[hidden email]> wrote:
> >
> >
> >
> > Phantom wrote:
> > > I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
> the
> > > jobtracker web UI work. I have the namenode running on node A and job
> > > tracker running on node B. Is it true that namenode and jobtracker
> > cannot
> > > run on the same box ?
> >
> > The namenode and the jobtracker can most definitely run on the same box.
> >   As far as I know this is the preferred configuration.
> >
> > Also if I want to run the examples on the cluster is
> > > there anything special that needs to be done. When I run the example
> > > WordCount on a machine C (which is a task tracker and not a job
> tracker)
> > > the
> > > LocalJobRunner is invoked all the time. I am guessing this means that
> > the
> > > map tasks are running locally. How can I distribute this on the
> cluster
> > ?
> > > Please advice.
> >
> > Are the conf files on machine C the same as the namenode/jobtracker?
> > Are they pointing to the namenode and jobtracker or are they pointing to
> > local in the hadoop-site.xml file.  Also we have found it easier
> > (although not necessarily better) to start tasks on the namenode server.
> >
> > It would be helpful to have more information about what is happening and
> > your setup as that would help myself and others on the list debug what
> > may be occurring.
> >
> > Dennis Kubes
> >
> > >
> > > Thanks
> > > Avinash
> > >
> >

Reply | Threaded
Open this post in threaded view
|

RE: Configuration and Hadoop cluster setup

Vishal Shah-3
Hi Avinash,

  Can you share your hadoop-site.xml, mapred-default.xml and slaves files?
Most probably, you have not set the jobtracker properly in the
hadoop-site.xml conf file. Check the property mapred.job.tracker property in
your file. It should look something like this:

<property>
  <name>mapred.job.tracker</name>
  <value>fully.qualified.domainname:40000</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

-vishal.

-----Original Message-----
From: Mahadev Konar [mailto:[hidden email]]
Sent: Friday, May 25, 2007 5:54 AM
To: [hidden email]
Subject: RE: Configuration and Hadoop cluster setup

Hi,
  When you run the job, you need to set the environment variable
HADOOP_CONF_DIR to the configuration directory that has the configuration
file pointing to the right jobtracker.

Regards
Mahadev

> -----Original Message-----
> From: Phantom [mailto:[hidden email]]
> Sent: Thursday, May 24, 2007 4:51 PM
> To: [hidden email]
> Subject: Re: Configuration and Hadoop cluster setup
>
> Yes the files are the same and I am starting the tasks on the namenode
> server. I also figured what my problem was with respect to not being able
> to
> start the namenode and job tracker on the same machine. I had to reformat
> the file system. But the all this still doesn't cause the WordCount sample
> to run in a distributed fashion. I can tell this because the
> LocalJobRunner
> is being used. Do I need to specify the config file to the running
> instance
> of the program ? If so how do I do that ?
>
> Thanks
> A
>
> On 5/24/07, Dennis Kubes <[hidden email]> wrote:
> >
> >
> >
> > Phantom wrote:
> > > I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
> the
> > > jobtracker web UI work. I have the namenode running on node A and job
> > > tracker running on node B. Is it true that namenode and jobtracker
> > cannot
> > > run on the same box ?
> >
> > The namenode and the jobtracker can most definitely run on the same box.
> >   As far as I know this is the preferred configuration.
> >
> > Also if I want to run the examples on the cluster is
> > > there anything special that needs to be done. When I run the example
> > > WordCount on a machine C (which is a task tracker and not a job
> tracker)
> > > the
> > > LocalJobRunner is invoked all the time. I am guessing this means that
> > the
> > > map tasks are running locally. How can I distribute this on the
> cluster
> > ?
> > > Please advice.
> >
> > Are the conf files on machine C the same as the namenode/jobtracker?
> > Are they pointing to the namenode and jobtracker or are they pointing to
> > local in the hadoop-site.xml file.  Also we have found it easier
> > (although not necessarily better) to start tasks on the namenode server.
> >
> > It would be helpful to have more information about what is happening and
> > your setup as that would help myself and others on the list debug what
> > may be occurring.
> >
> > Dennis Kubes
> >
> > >
> > > Thanks
> > > Avinash
> > >
> >

Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

alakshman
In reply to this post by Mahadev Konar
I tried this. So before running the WordCount sample I did an export
HADOOP_CONF_DIR=<my conf dir>. Doesn't seem to help. I still see the
LocalJobRunner being used.

Thanks
Avinash

On 5/24/07, Mahadev Konar <[hidden email]> wrote:

>
> Hi,
>   When you run the job, you need to set the environment variable
> HADOOP_CONF_DIR to the configuration directory that has the configuration
> file pointing to the right jobtracker.
>
> Regards
> Mahadev
>
> > -----Original Message-----
> > From: Phantom [mailto:[hidden email]]
> > Sent: Thursday, May 24, 2007 4:51 PM
> > To: [hidden email]
> > Subject: Re: Configuration and Hadoop cluster setup
> >
> > Yes the files are the same and I am starting the tasks on the namenode
> > server. I also figured what my problem was with respect to not being
> able
> > to
> > start the namenode and job tracker on the same machine. I had to
> reformat
> > the file system. But the all this still doesn't cause the WordCount
> sample
> > to run in a distributed fashion. I can tell this because the
> > LocalJobRunner
> > is being used. Do I need to specify the config file to the running
> > instance
> > of the program ? If so how do I do that ?
> >
> > Thanks
> > A
> >
> > On 5/24/07, Dennis Kubes <[hidden email]> wrote:
> > >
> > >
> > >
> > > Phantom wrote:
> > > > I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
> > the
> > > > jobtracker web UI work. I have the namenode running on node A and
> job
> > > > tracker running on node B. Is it true that namenode and jobtracker
> > > cannot
> > > > run on the same box ?
> > >
> > > The namenode and the jobtracker can most definitely run on the same
> box.
> > >   As far as I know this is the preferred configuration.
> > >
> > > Also if I want to run the examples on the cluster is
> > > > there anything special that needs to be done. When I run the example
> > > > WordCount on a machine C (which is a task tracker and not a job
> > tracker)
> > > > the
> > > > LocalJobRunner is invoked all the time. I am guessing this means
> that
> > > the
> > > > map tasks are running locally. How can I distribute this on the
> > cluster
> > > ?
> > > > Please advice.
> > >
> > > Are the conf files on machine C the same as the namenode/jobtracker?
> > > Are they pointing to the namenode and jobtracker or are they pointing
> to
> > > local in the hadoop-site.xml file.  Also we have found it easier
> > > (although not necessarily better) to start tasks on the namenode
> server.
> > >
> > > It would be helpful to have more information about what is happening
> and
> > > your setup as that would help myself and others on the list debug what
> > > may be occurring.
> > >
> > > Dennis Kubes
> > >
> > > >
> > > > Thanks
> > > > Avinash
> > > >
> > >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

alakshman
In reply to this post by Vishal Shah-3
Here is a copy of my hadoop-site.xml. What am I doing wrong ?

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>dev030.sctm.com:9000</value>
        </property>

        <property>
                <name>dfs.name.dir</name>
                <value>/tmp/hadoop</value>
        </property>

        <property>
                <name>mapred.job.tracker</name>
                <value>dev030.sctm.com:50029</value>
        </property>

        <property>
                <name>mapred.job.tracker.info.port</name>
                <value>50030</value>
        </property>

        <property>
                <name>mapred.min.split.size</name>
                <value>65536</value>
        </property>

        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>

</configuration>


On 5/25/07, Vishal Shah <[hidden email]> wrote:

>
> Hi Avinash,
>
>   Can you share your hadoop-site.xml, mapred-default.xml and slaves files?
> Most probably, you have not set the jobtracker properly in the
> hadoop-site.xml conf file. Check the property mapred.job.tracker property
> in
> your file. It should look something like this:
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>fully.qualified.domainname:40000</value>
>   <description>The host and port that the MapReduce job tracker runs
>   at.  If "local", then jobs are run in-process as a single map
>   and reduce task.
>   </description>
> </property>
>
> -vishal.
>
> -----Original Message-----
> From: Mahadev Konar [mailto:[hidden email]]
> Sent: Friday, May 25, 2007 5:54 AM
> To: [hidden email]
> Subject: RE: Configuration and Hadoop cluster setup
>
> Hi,
>   When you run the job, you need to set the environment variable
> HADOOP_CONF_DIR to the configuration directory that has the configuration
> file pointing to the right jobtracker.
>
> Regards
> Mahadev
>
> > -----Original Message-----
> > From: Phantom [mailto:[hidden email]]
> > Sent: Thursday, May 24, 2007 4:51 PM
> > To: [hidden email]
> > Subject: Re: Configuration and Hadoop cluster setup
> >
> > Yes the files are the same and I am starting the tasks on the namenode
> > server. I also figured what my problem was with respect to not being
> able
> > to
> > start the namenode and job tracker on the same machine. I had to
> reformat
> > the file system. But the all this still doesn't cause the WordCount
> sample
> > to run in a distributed fashion. I can tell this because the
> > LocalJobRunner
> > is being used. Do I need to specify the config file to the running
> > instance
> > of the program ? If so how do I do that ?
> >
> > Thanks
> > A
> >
> > On 5/24/07, Dennis Kubes <[hidden email]> wrote:
> > >
> > >
> > >
> > > Phantom wrote:
> > > > I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
> > the
> > > > jobtracker web UI work. I have the namenode running on node A and
> job
> > > > tracker running on node B. Is it true that namenode and jobtracker
> > > cannot
> > > > run on the same box ?
> > >
> > > The namenode and the jobtracker can most definitely run on the same
> box.
> > >   As far as I know this is the preferred configuration.
> > >
> > > Also if I want to run the examples on the cluster is
> > > > there anything special that needs to be done. When I run the example
> > > > WordCount on a machine C (which is a task tracker and not a job
> > tracker)
> > > > the
> > > > LocalJobRunner is invoked all the time. I am guessing this means
> that
> > > the
> > > > map tasks are running locally. How can I distribute this on the
> > cluster
> > > ?
> > > > Please advice.
> > >
> > > Are the conf files on machine C the same as the namenode/jobtracker?
> > > Are they pointing to the namenode and jobtracker or are they pointing
> to
> > > local in the hadoop-site.xml file.  Also we have found it easier
> > > (although not necessarily better) to start tasks on the namenode
> server.
> > >
> > > It would be helpful to have more information about what is happening
> and
> > > your setup as that would help myself and others on the list debug what
> > > may be occurring.
> > >
> > > Dennis Kubes
> > >
> > > >
> > > > Thanks
> > > > Avinash
> > > >
> > >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Configuration and Hadoop cluster setup

Hairong Kuang
Have you tried Mahadev's suggestion? You need to set HADOOP_CONF_DIR to be
the directory in which your hadoop-site.xml is located at, or try to use
hadoop --config <conf_dir> to submit your job.

Hairong

-----Original Message-----
From: Phantom [mailto:[hidden email]]
Sent: Friday, May 25, 2007 1:37 PM
To: [hidden email]; [hidden email]
Subject: Re: Configuration and Hadoop cluster setup

Here is a copy of my hadoop-site.xml. What am I doing wrong ?

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>dev030.sctm.com:9000</value>
        </property>

        <property>
                <name>dfs.name.dir</name>
                <value>/tmp/hadoop</value>
        </property>

        <property>
                <name>mapred.job.tracker</name>
                <value>dev030.sctm.com:50029</value>
        </property>

        <property>
                <name>mapred.job.tracker.info.port</name>
                <value>50030</value>
        </property>

        <property>
                <name>mapred.min.split.size</name>
                <value>65536</value>
        </property>

        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>

</configuration>


On 5/25/07, Vishal Shah <[hidden email]> wrote:

>
> Hi Avinash,
>
>   Can you share your hadoop-site.xml, mapred-default.xml and slaves files?
> Most probably, you have not set the jobtracker properly in the
> hadoop-site.xml conf file. Check the property mapred.job.tracker
> property in your file. It should look something like this:
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>fully.qualified.domainname:40000</value>
>   <description>The host and port that the MapReduce job tracker runs
>   at.  If "local", then jobs are run in-process as a single map
>   and reduce task.
>   </description>
> </property>
>
> -vishal.
>
> -----Original Message-----
> From: Mahadev Konar [mailto:[hidden email]]
> Sent: Friday, May 25, 2007 5:54 AM
> To: [hidden email]
> Subject: RE: Configuration and Hadoop cluster setup
>
> Hi,
>   When you run the job, you need to set the environment variable
> HADOOP_CONF_DIR to the configuration directory that has the
> configuration file pointing to the right jobtracker.
>
> Regards
> Mahadev
>
> > -----Original Message-----
> > From: Phantom [mailto:[hidden email]]
> > Sent: Thursday, May 24, 2007 4:51 PM
> > To: [hidden email]
> > Subject: Re: Configuration and Hadoop cluster setup
> >
> > Yes the files are the same and I am starting the tasks on the
> > namenode server. I also figured what my problem was with respect to
> > not being
> able
> > to
> > start the namenode and job tracker on the same machine. I had to
> reformat
> > the file system. But the all this still doesn't cause the WordCount
> sample
> > to run in a distributed fashion. I can tell this because the
> > LocalJobRunner is being used. Do I need to specify the config file
> > to the running instance of the program ? If so how do I do that ?
> >
> > Thanks
> > A
> >
> > On 5/24/07, Dennis Kubes <[hidden email]> wrote:
> > >
> > >
> > >
> > > Phantom wrote:
> > > > I am trying to run Hadoop on a cluster of 3 nodes. The namenode
> > > > and
> > the
> > > > jobtracker web UI work. I have the namenode running on node A
> > > > and
> job
> > > > tracker running on node B. Is it true that namenode and
> > > > jobtracker
> > > cannot
> > > > run on the same box ?
> > >
> > > The namenode and the jobtracker can most definitely run on the
> > > same
> box.
> > >   As far as I know this is the preferred configuration.
> > >
> > > Also if I want to run the examples on the cluster is
> > > > there anything special that needs to be done. When I run the
> > > > example WordCount on a machine C (which is a task tracker and
> > > > not a job
> > tracker)
> > > > the
> > > > LocalJobRunner is invoked all the time. I am guessing this means
> that
> > > the
> > > > map tasks are running locally. How can I distribute this on the
> > cluster
> > > ?
> > > > Please advice.
> > >
> > > Are the conf files on machine C the same as the namenode/jobtracker?
> > > Are they pointing to the namenode and jobtracker or are they
> > > pointing
> to
> > > local in the hadoop-site.xml file.  Also we have found it easier
> > > (although not necessarily better) to start tasks on the namenode
> server.
> > >
> > > It would be helpful to have more information about what is
> > > happening
> and
> > > your setup as that would help myself and others on the list debug
> > > what may be occurring.
> > >
> > > Dennis Kubes
> > >
> > > >
> > > > Thanks
> > > > Avinash
> > > >
> > >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

alakshman
In reply to this post by alakshman
At last I managed to get this working along the lines of what I would want
it to do. I had to modify the sample to set the property explicitly. I did
jobConf.set("mapred.job.tracker", "<host:port>").

If my Map job is going to process a file does it have to be in HDFS and if
so how do I get it there ? Any resource I can read to get a better
understanding.

Thanks
Avinash

On 5/25/07, Phantom <[hidden email] > wrote:

>
> Here is a copy of my hadoop-site.xml. What am I doing wrong ?
>
> <configuration>
>         <property>
>                 <name>fs.default.name</name>
>                 <value> dev030.sctm.com:9000</value>
>         </property>
>
>         <property>
>                 <name> dfs.name.dir</name>
>                 <value>/tmp/hadoop</value>
>         </property>
>
>         <property>
>                 <name>mapred.job.tracker</name>
>                 <value> dev030.sctm.com:50029 </value>
>         </property>
>
>         <property>
>                 <name>mapred.job.tracker.info.port</name>
>                 <value>50030</value>
>         </property>
>
>         <property>
>                 <name>mapred.min.split.size</name>
>                 <value>65536</value>
>         </property>
>
>         <property>
>                 <name> dfs.replication</name>
>                 <value>1</value>
>         </property>
>
> </configuration>
>
>
> On 5/25/07, Vishal Shah <[hidden email]> wrote:
> >
> > Hi Avinash,
> >
> >   Can you share your hadoop-site.xml, mapred-default.xml and slaves
> > files?
> > Most probably, you have not set the jobtracker properly in the
> > hadoop-site.xml conf file. Check the property mapred.job.trackerproperty in
> > your file. It should look something like this:
> >
> > <property>
> >   <name>mapred.job.tracker</name>
> >   <value>fully.qualified.domainname:40000</value>
> >   <description>The host and port that the MapReduce job tracker runs
> >   at.  If "local", then jobs are run in-process as a single map
> >   and reduce task.
> >   </description>
> > </property>
> >
> > -vishal.
> >
> > -----Original Message-----
> > From: Mahadev Konar [mailto: [hidden email]]
> > Sent: Friday, May 25, 2007 5:54 AM
> > To: [hidden email]
> > Subject: RE: Configuration and Hadoop cluster setup
> >
> > Hi,
> >   When you run the job, you need to set the environment variable
> > HADOOP_CONF_DIR to the configuration directory that has the
> > configuration
> > file pointing to the right jobtracker.
> >
> > Regards
> > Mahadev
> >
> > > -----Original Message-----
> > > From: Phantom [mailto:[hidden email]]
> > > Sent: Thursday, May 24, 2007 4:51 PM
> > > To: [hidden email]
> > > Subject: Re: Configuration and Hadoop cluster setup
> > >
> > > Yes the files are the same and I am starting the tasks on the namenode
> > > server. I also figured what my problem was with respect to not being
> > able
> > > to
> > > start the namenode and job tracker on the same machine. I had to
> > reformat
> > > the file system. But the all this still doesn't cause the WordCount
> > sample
> > > to run in a distributed fashion. I can tell this because the
> > > LocalJobRunner
> > > is being used. Do I need to specify the config file to the running
> > > instance
> > > of the program ? If so how do I do that ?
> > >
> > > Thanks
> > > A
> > >
> > > On 5/24/07, Dennis Kubes < [hidden email]> wrote:
> > > >
> > > >
> > > >
> > > > Phantom wrote:
> > > > > I am trying to run Hadoop on a cluster of 3 nodes. The namenode
> > and
> > > the
> > > > > jobtracker web UI work. I have the namenode running on node A and
> > job
> > > > > tracker running on node B. Is it true that namenode and jobtracker
> > > > cannot
> > > > > run on the same box ?
> > > >
> > > > The namenode and the jobtracker can most definitely run on the same
> > box.
> > > >   As far as I know this is the preferred configuration.
> > > >
> > > > Also if I want to run the examples on the cluster is
> > > > > there anything special that needs to be done. When I run the
> > example
> > > > > WordCount on a machine C (which is a task tracker and not a job
> > > tracker)
> > > > > the
> > > > > LocalJobRunner is invoked all the time. I am guessing this means
> > that
> > > > the
> > > > > map tasks are running locally. How can I distribute this on the
> > > cluster
> > > > ?
> > > > > Please advice.
> > > >
> > > > Are the conf files on machine C the same as the namenode/jobtracker?
> >
> > > > Are they pointing to the namenode and jobtracker or are they
> > pointing to
> > > > local in the hadoop-site.xml file.  Also we have found it easier
> > > > (although not necessarily better) to start tasks on the namenode
> > server.
> > > >
> > > > It would be helpful to have more information about what is happening
> > and
> > > > your setup as that would help myself and others on the list debug
> > what
> > > > may be occurring.
> > > >
> > > > Dennis Kubes
> > > >
> > > > >
> > > > > Thanks
> > > > > Avinash
> > > > >
> > > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

Doug Cutting
Phantom wrote:
> If my Map job is going to process a file does it have to be in HDFS

No, but they usually are.  Job inputs are resolved relative to the
default filesystem.  So, if you've configured the default filesystem to
be HDFS, and you pass a filename that's not qualified by a filesystem as
the input to your job, then your input should be in HDFS.

But inputs don't have to be in the default filesystem nor must they be
in HDFS.  They need to be in a filesystem that's available to all nodes.
  They could be in NFS, S3, or Ceph instead of HDFS.  They could even be
in a non-default HDFS system.

> and if so how do I get it there ?

If HDFS is configured as your default filesystem:

   bin/hadoop fs -put localFileName nameInHdfs

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

Avinash Lakshman-2
I am trying to run the WordCount sample against a file on my local file
system. So I kick start my program as
"java -D/home/alakshman/hadoop-0.12.3/conf
org.apache.hadoop.examples.WordCount -m 10 -r 4 ~/test2.dat /tmp/out-dir".
When I run this, I get the following in the jobtracker log file (what should
I be doing to fix this):

2007-05-25 14:41:32,733 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from task_0001_m_000000_3: java.lang.IllegalArgumentException: Wrong FS:
file:/home/alakshman/test2.dat, expected:
hdfs://dev030.sctm.facebook.com:9000
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:216)
        at
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.getPath
(DistributedFileSystem.java:110)
        at
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(
DistributedFileSystem.java:170)
        at
org.apache.hadoop.fs.FilterFileSystem.exists(FilterFileSystem.java:168)
        at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:331)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
        at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.jav
a:54)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:139)
        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)



On 5/25/07 2:31 PM, "Doug Cutting" <[hidden email]> wrote:

> Phantom wrote:
>> If my Map job is going to process a file does it have to be in HDFS
>
> No, but they usually are.  Job inputs are resolved relative to the
> default filesystem.  So, if you've configured the default filesystem to
> be HDFS, and you pass a filename that's not qualified by a filesystem as
> the input to your job, then your input should be in HDFS.
>
> But inputs don't have to be in the default filesystem nor must they be
> in HDFS.  They need to be in a filesystem that's available to all nodes.
>   They could be in NFS, S3, or Ceph instead of HDFS.  They could even be
> in a non-default HDFS system.
>
>> and if so how do I get it there ?
>
> If HDFS is configured as your default filesystem:
>
>    bin/hadoop fs -put localFileName nameInHdfs
>
> Doug

Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

yu-yang chen
In reply to this post by alakshman
I think you are not include your nodes A, B, C in your bin/slaves file,
that may be why.
your hadoop-site.xml seems ok to me

yu-yang

Phantom wrote:

> Here is a copy of my hadoop-site.xml. What am I doing wrong ?
>
> <configuration>
>        <property>
>                <name>fs.default.name</name>
>                <value>dev030.sctm.com:9000</value>
>        </property>
>
>        <property>
>                <name>dfs.name.dir</name>
>                <value>/tmp/hadoop</value>
>        </property>
>
>        <property>
>                <name>mapred.job.tracker</name>
>                <value>dev030.sctm.com:50029</value>
>        </property>
>
>        <property>
>                <name>mapred.job.tracker.info.port</name>
>                <value>50030</value>
>        </property>
>
>        <property>
>                <name>mapred.min.split.size</name>
>                <value>65536</value>
>        </property>
>
>        <property>
>                <name>dfs.replication</name>
>                <value>1</value>
>        </property>
>
> </configuration>
>
>
> On 5/25/07, Vishal Shah <[hidden email]> wrote:
>>
>> Hi Avinash,
>>
>>   Can you share your hadoop-site.xml, mapred-default.xml and slaves
>> files?
>> Most probably, you have not set the jobtracker properly in the
>> hadoop-site.xml conf file. Check the property mapred.job.tracker
>> property
>> in
>> your file. It should look something like this:
>>
>> <property>
>>   <name>mapred.job.tracker</name>
>>   <value>fully.qualified.domainname:40000</value>
>>   <description>The host and port that the MapReduce job tracker runs
>>   at.  If "local", then jobs are run in-process as a single map
>>   and reduce task.
>>   </description>
>> </property>
>>
>> -vishal.
>>
>> -----Original Message-----
>> From: Mahadev Konar [mailto:[hidden email]]
>> Sent: Friday, May 25, 2007 5:54 AM
>> To: [hidden email]
>> Subject: RE: Configuration and Hadoop cluster setup
>>
>> Hi,
>>   When you run the job, you need to set the environment variable
>> HADOOP_CONF_DIR to the configuration directory that has the
>> configuration
>> file pointing to the right jobtracker.
>>
>> Regards
>> Mahadev
>>
>> > -----Original Message-----
>> > From: Phantom [mailto:[hidden email]]
>> > Sent: Thursday, May 24, 2007 4:51 PM
>> > To: [hidden email]
>> > Subject: Re: Configuration and Hadoop cluster setup
>> >
>> > Yes the files are the same and I am starting the tasks on the namenode
>> > server. I also figured what my problem was with respect to not being
>> able
>> > to
>> > start the namenode and job tracker on the same machine. I had to
>> reformat
>> > the file system. But the all this still doesn't cause the WordCount
>> sample
>> > to run in a distributed fashion. I can tell this because the
>> > LocalJobRunner
>> > is being used. Do I need to specify the config file to the running
>> > instance
>> > of the program ? If so how do I do that ?
>> >
>> > Thanks
>> > A
>> >
>> > On 5/24/07, Dennis Kubes <[hidden email]> wrote:
>> > >
>> > >
>> > >
>> > > Phantom wrote:
>> > > > I am trying to run Hadoop on a cluster of 3 nodes. The namenode
>> and
>> > the
>> > > > jobtracker web UI work. I have the namenode running on node A and
>> job
>> > > > tracker running on node B. Is it true that namenode and jobtracker
>> > > cannot
>> > > > run on the same box ?
>> > >
>> > > The namenode and the jobtracker can most definitely run on the same
>> box.
>> > >   As far as I know this is the preferred configuration.
>> > >
>> > > Also if I want to run the examples on the cluster is
>> > > > there anything special that needs to be done. When I run the
>> example
>> > > > WordCount on a machine C (which is a task tracker and not a job
>> > tracker)
>> > > > the
>> > > > LocalJobRunner is invoked all the time. I am guessing this means
>> that
>> > > the
>> > > > map tasks are running locally. How can I distribute this on the
>> > cluster
>> > > ?
>> > > > Please advice.
>> > >
>> > > Are the conf files on machine C the same as the namenode/jobtracker?
>> > > Are they pointing to the namenode and jobtracker or are they
>> pointing
>> to
>> > > local in the hadoop-site.xml file.  Also we have found it easier
>> > > (although not necessarily better) to start tasks on the namenode
>> server.
>> > >
>> > > It would be helpful to have more information about what is happening
>> and
>> > > your setup as that would help myself and others on the list debug
>> what
>> > > may be occurring.
>> > >
>> > > Dennis Kubes
>> > >
>> > > >
>> > > > Thanks
>> > > > Avinash
>> > > >
>> > >
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

Koji Noguchi
In reply to this post by Doug Cutting
Doug,

I may be wrong, but last time I tried (on 0.12.3), MapRed didn't work
for non-default filesystem as an input.
(output worked fine.)

https://issues.apache.org/jira/browse/HADOOP-71
https://issues.apache.org/jira/browse/HADOOP-1107

Mine failed with "org.apache.hadoop.mapred.InvalidInputException: Input
path does not exist".
It basically checked the default file system instead of the one passed in.

Koji


Doug Cutting wrote:
> But inputs don't have to be in the default filesystem nor must they be
> in HDFS.  They need to be in a filesystem that's available to all
> nodes.  They could be in NFS, S3, or Ceph instead of HDFS.  They could
> even be in a non-default HDFS system.

Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

Dennis Kubes
In reply to this post by alakshman
I don't know if this will make a difference or not:

    <property>
                 <name>fs.default.name</name>
                 <value> dev030.sctm.com:9000</value>
         </property>

         <property>
                 <name>mapred.job.tracker</name>
                 <value> dev030.sctm.com:50029 </value>
         </property>

Your fs.default.name and mapred.job.tracker variables both seem to have
spaces (or an unprintable character) in front of the values.  Can you
try removing these and seeing if the WordCount works correctly?

Dennis Kubes

Phantom wrote:

> I tried this. So before running the WordCount sample I did an export
> HADOOP_CONF_DIR=<my conf dir>. Doesn't seem to help. I still see the
> LocalJobRunner being used.
>
> Thanks
> Avinash
>
> On 5/24/07, Mahadev Konar <[hidden email]> wrote:
>>
>> Hi,
>>   When you run the job, you need to set the environment variable
>> HADOOP_CONF_DIR to the configuration directory that has the configuration
>> file pointing to the right jobtracker.
>>
>> Regards
>> Mahadev
>>
>> > -----Original Message-----
>> > From: Phantom [mailto:[hidden email]]
>> > Sent: Thursday, May 24, 2007 4:51 PM
>> > To: [hidden email]
>> > Subject: Re: Configuration and Hadoop cluster setup
>> >
>> > Yes the files are the same and I am starting the tasks on the namenode
>> > server. I also figured what my problem was with respect to not being
>> able
>> > to
>> > start the namenode and job tracker on the same machine. I had to
>> reformat
>> > the file system. But the all this still doesn't cause the WordCount
>> sample
>> > to run in a distributed fashion. I can tell this because the
>> > LocalJobRunner
>> > is being used. Do I need to specify the config file to the running
>> > instance
>> > of the program ? If so how do I do that ?
>> >
>> > Thanks
>> > A
>> >
>> > On 5/24/07, Dennis Kubes <[hidden email]> wrote:
>> > >
>> > >
>> > >
>> > > Phantom wrote:
>> > > > I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
>> > the
>> > > > jobtracker web UI work. I have the namenode running on node A and
>> job
>> > > > tracker running on node B. Is it true that namenode and jobtracker
>> > > cannot
>> > > > run on the same box ?
>> > >
>> > > The namenode and the jobtracker can most definitely run on the same
>> box.
>> > >   As far as I know this is the preferred configuration.
>> > >
>> > > Also if I want to run the examples on the cluster is
>> > > > there anything special that needs to be done. When I run the
>> example
>> > > > WordCount on a machine C (which is a task tracker and not a job
>> > tracker)
>> > > > the
>> > > > LocalJobRunner is invoked all the time. I am guessing this means
>> that
>> > > the
>> > > > map tasks are running locally. How can I distribute this on the
>> > cluster
>> > > ?
>> > > > Please advice.
>> > >
>> > > Are the conf files on machine C the same as the namenode/jobtracker?
>> > > Are they pointing to the namenode and jobtracker or are they pointing
>> to
>> > > local in the hadoop-site.xml file.  Also we have found it easier
>> > > (although not necessarily better) to start tasks on the namenode
>> server.
>> > >
>> > > It would be helpful to have more information about what is happening
>> and
>> > > your setup as that would help myself and others on the list debug
>> what
>> > > may be occurring.
>> > >
>> > > Dennis Kubes
>> > >
>> > > >
>> > > > Thanks
>> > > > Avinash
>> > > >
>> > >
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

alakshman
In reply to this post by Koji Noguchi
Is there a workaround ? I want to run the WordCount sample against a file on
my local filesystem. If this is not possible do I need to put my file into
HDFS and then point that location to my program ?

Thanks
Avinash

On 5/25/07, Koji Noguchi <[hidden email]> wrote:

>
> Doug,
>
> I may be wrong, but last time I tried (on 0.12.3), MapRed didn't work
> for non-default filesystem as an input.
> (output worked fine.)
>
> https://issues.apache.org/jira/browse/HADOOP-71
> https://issues.apache.org/jira/browse/HADOOP-1107
>
> Mine failed with "org.apache.hadoop.mapred.InvalidInputException: Input
> path does not exist".
> It basically checked the default file system instead of the one passed in.
>
> Koji
>
>
> Doug Cutting wrote:
> > But inputs don't have to be in the default filesystem nor must they be
> > in HDFS.  They need to be in a filesystem that's available to all
> > nodes.  They could be in NFS, S3, or Ceph instead of HDFS.  They could
> > even be in a non-default HDFS system.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

Doug Cutting
In reply to this post by Koji Noguchi
Koji Noguchi wrote:
> I may be wrong, but last time I tried (on 0.12.3), MapRed didn't work
> for non-default filesystem as an input.
> (output worked fine.)
>
> https://issues.apache.org/jira/browse/HADOOP-71
> https://issues.apache.org/jira/browse/HADOOP-1107

You're probably right.  That is a bug.  It's partly fixed by:

https://issues.apache.org/jira/browse/HADOOP-1226

This causes all paths from DFS to be fully qualified, fixing
HADOOP-1107, I think.  The SequenceFile bug may still be outstanding.
We should try to fix that too for 0.14.

Doug

Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

Doug Cutting
In reply to this post by alakshman
Phantom wrote:
> Is there a workaround ? I want to run the WordCount sample against a
> file on
> my local filesystem. If this is not possible do I need to put my file into
> HDFS and then point that location to my program ?

Is your local filesystem accessible to all nodes in your system?

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

alakshman
Yes it is.

Thanks
A


On 5/29/07, Doug Cutting <[hidden email]> wrote:

>
> Phantom wrote:
> > Is there a workaround ? I want to run the WordCount sample against a
> > file on
> > my local filesystem. If this is not possible do I need to put my file
> into
> > HDFS and then point that location to my program ?
>
> Is your local filesystem accessible to all nodes in your system?
>
> Doug
>
Reply | Threaded
Open this post in threaded view
|

Re: Configuration and Hadoop cluster setup

alakshman
Either I am totally confused or this configuration stuff is confusing the
hell out of me. I am pretty sure it is the former. Please I am looking for
advice here as to how I should do this. I have my fs.default.name set to
hdfs://<host>:<port>. In my JobConf setup I set the set same value for my
fs.default.name. Now I have two options and I would appreciate if some
expert could tell me which option I should take and why ?

(1) Set my fs.default.name set to hdfs://<host>:<port> and also specify it
in the JobConf configuration. Copy my sample input file into HDFS using
"bin/hadoop fd -put" from my local file system. I then need to specify this
file to my WordCount sample as input. Should I specify this file with the
hdfs:// directive ?

(2) Set my fs.default.name set to file://<host>:<port> and also specify it
in the JobConf configuration. Just specify the input path to the WordCount
sample and everything should work if the path is available to all machines
in the cluster ?

Which way should I go ?

Thanks
Avinash

On 5/29/07, Phantom <[hidden email]> wrote:

>
> Yes it is.
>
> Thanks
> A
>
>
> On 5/29/07, Doug Cutting <[hidden email]> wrote:
> >
> > Phantom wrote:
> > > Is there a workaround ? I want to run the WordCount sample against a
> > > file on
> > > my local filesystem. If this is not possible do I need to put my file
> > into
> > > HDFS and then point that location to my program ?
> >
> > Is your local filesystem accessible to all nodes in your system?
> >
> > Doug
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Configuration and Hadoop cluster setup

Mahadev Konar
Hi Avinash,
  The way Map Reduce in distributed environment works is:
1) set up the cluster in distributed fashion as specified in the wiki.
http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop
2) run mapreduce jobs with the command:
Bin/hadoop jar job.jar
Before doing this you need to set the HADOOP_CONF_DIR env variable pointing
to the conf directory that contains the distributed configuration.

The input files need to be uploaded to HDFS first and then in your jobconf
you need to set job.setInputPath(tempDir); -- where tempdir is the
inputdirectory for the mapreduce job and the directory where you uploaded
the files. You can take a look at the examples in Hadoop examples directory
for this.
Hope this helps.

Regards
Mahadev

> -----Original Message-----
> From: Phantom [mailto:[hidden email]]
> Sent: Tuesday, May 29, 2007 11:53 AM
> To: [hidden email]
> Subject: Re: Configuration and Hadoop cluster setup
>
> Either I am totally confused or this configuration stuff is confusing the
> hell out of me. I am pretty sure it is the former. Please I am looking for
> advice here as to how I should do this. I have my fs.default.name set to
> hdfs://<host>:<port>. In my JobConf setup I set the set same value for my
> fs.default.name. Now I have two options and I would appreciate if some
> expert could tell me which option I should take and why ?
>
> (1) Set my fs.default.name set to hdfs://<host>:<port> and also specify it
> in the JobConf configuration. Copy my sample input file into HDFS using
> "bin/hadoop fd -put" from my local file system. I then need to specify
> this
> file to my WordCount sample as input. Should I specify this file with the
> hdfs:// directive ?
>
> (2) Set my fs.default.name set to file://<host>:<port> and also specify it
> in the JobConf configuration. Just specify the input path to the WordCount
> sample and everything should work if the path is available to all machines
> in the cluster ?
>
> Which way should I go ?
>
> Thanks
> Avinash
>
> On 5/29/07, Phantom <[hidden email]> wrote:
> >
> > Yes it is.
> >
> > Thanks
> > A
> >
> >
> > On 5/29/07, Doug Cutting <[hidden email]> wrote:
> > >
> > > Phantom wrote:
> > > > Is there a workaround ? I want to run the WordCount sample against a
> > > > file on
> > > > my local filesystem. If this is not possible do I need to put my
> file
> > > into
> > > > HDFS and then point that location to my program ?
> > >
> > > Is your local filesystem accessible to all nodes in your system?
> > >
> > > Doug
> > >
> >
> >

12