Nutch on a shared filesystem

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch on a shared filesystem

rishi pathak
Hi,
          Our setup has 2 data node with 16 cores each. We are trying to
setup nutch to use shared local filesystem
instead of HDFS. For single tasktracker, it works fine but for more than one
tasktracker it gives an error and comes out.
The error is related to tmp data dir for map/red asks.


#mapred. conf :

<configuration>

 <property>
    <name>mapred.job.tracker</name>
    <value>yc1.cn:9001</value>
 </property>

 <property>
    <name>mapred.system.dir</name>

<value>/home/internal/sysadmin/nazgul/hadoop/dfs/local/mapredSystemDir/</value>
 </property>

 <property>
    <name>mapred.local.dir</name>

<value>/home/internal/sysadmin/nazgul/hadoop/dfs/local/mapredLocalDir/</value>
    <!--<value>/tmp/</value> -->
 </property>

 <property>
    <name>mapred.tasktracker.map.task.maximum</name>
    <value>16</value>
 </property>

 <property>
    <name>mapred.tasktracker.map.task.maximum</name>
    <value>16</value>
 </property>

 <property>
    <name>mapreduce.cluster.local.dir</name>

<value>/home/internal/sysadmin/nazgul/hadoop/dfs/local/mapredClusterLocalDir/</value>
 </property>

</configuration>



# Error ########

java.io.IOException: The temporary job-output directory
file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary
doesn't exist!
        at
org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:204)
        at
org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:234)
        at
org.apache.hadoop.mapred.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:48)
        at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:433)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

Injector: Merging injected urls into crawl db.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:226)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)



--
---
Rishi Pathak
National PARAM Supercomputing Facility
C-DAC, Pune, India
Reply | Threaded
Open this post in threaded view
|

Re: Nutch on a shared filesystem

Alex McLintock
I'm not sure if you can do this (I would recommend HDFS instead of a shared
area) but can you insert the hostname of the node into the temp dir? That
might stop separate nodes from messing up each others temp areas.

(However I am guessing here)


On 17 January 2011 08:21, rishi pathak <[hidden email]> wrote:

> # Error ########
>
> java.io.IOException: The temporary job-output directory
> file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary
> doesn't exist!
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch on a shared filesystem

rishi pathak
Hello Alex,
                 We have tried the setup with HDFS and worked fine. The
shared filesstem talked in here
is a Lustre parallel filesystem and is mounted on all the compute
nodes(tasktracker).
The problem as it seems to me is not about different nodes messing up but
temp data written
by one tasktracker on one node and being accessed by another. The dir :
/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary does exists
on the second node.


On Mon, Jan 17, 2011 at 1:59 PM, Alex McLintock <[hidden email]>wrote:

> I'm not sure if you can do this (I would recommend HDFS instead of a shared
> area) but can you insert the hostname of the node into the temp dir? That
> might stop separate nodes from messing up each others temp areas.
>
> (However I am guessing here)
>
>
> On 17 January 2011 08:21, rishi pathak <[hidden email]> wrote:
>
> > # Error ########
> >
> > java.io.IOException: The temporary job-output directory
> > file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary
> > doesn't exist!
> >
> >
>



--
---
Rishi Pathak
National PARAM Supercomputing Facility
C-DAC, Pune, India