What is the meaning of this exception ?`

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

What is the meaning of this exception ?`

alakshman
I see a whole lot of this in my namenode log. Is this benign or is there
something something more sinister going on ? Why would this be happening ?

2007-06-21 00:16:24,782 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 9000, call rollEditLog() from 10.16.159.101:48099: error:
java.io.IOException: Attempt to roll edit log but edits.new exists
java.io.IOException: Attempt to roll edit log but edits.new exists
        at org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)
        at org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java
:3239)
        at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)
2007-06-21 00:21:24,818 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit
Log

Thanks
Avinash
Reply | Threaded
Open this post in threaded view
|

Re: What is the meaning of this exception ?`

Erdong (Roger) CHEN
i also have this error. my guess is that it takes to initialize it.
when you start your hadoop, run your jobs after a few minutes.

Erdong

On 6/22/07, Phantom <[hidden email]> wrote:

> I see a whole lot of this in my namenode log. Is this benign or is there
> something something more sinister going on ? Why would this be happening ?
>
> 2007-06-21 00:16:24,782 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 9000, call rollEditLog() from 10.16.159.101:48099: error:
> java.io.IOException: Attempt to roll edit log but edits.new exists
> java.io.IOException: Attempt to roll edit log but edits.new exists
>         at org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)
>         at org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java
> :3239)
>         at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)
>         at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)
> 2007-06-21 00:21:24,818 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit
> Log
>
> Thanks
> Avinash
>
Reply | Threaded
Open this post in threaded view
|

RE: What is the meaning of this exception ?`

Dhruba Borthakur-2
In reply to this post by alakshman
This message gets printed if a previous attempt to merge the Transaction log
with the fsimage fails.

The SecondaryNameNode periodically tries to merge the fsimage and the
fsedits file and creates a new fsimage. If this process fails in the middle
(due to whatever reason), the Namenode is left with fsedits and fsedits.new.
Now, future attempts by the SecondarynameNode to merge fsimage and fsedits
prints out the message that you are seeing.

The log message is not harmful per se, but if you want to get rid of it, you
would have to restart your NameNode.

Thanks,
dhruba

-----Original Message-----
From: Phantom [mailto:[hidden email]]
Sent: Friday, June 22, 2007 10:09 AM
To: [hidden email]
Subject: What is the meaning of this exception ?`

I see a whole lot of this in my namenode log. Is this benign or is there
something something more sinister going on ? Why would this be happening ?

2007-06-21 00:16:24,782 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 9000, call rollEditLog() from 10.16.159.101:48099: error:
java.io.IOException: Attempt to roll edit log but edits.new exists
java.io.IOException: Attempt to roll edit log but edits.new exists
        at org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)
        at org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java
:3239)
        at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)
2007-06-21 00:21:24,818 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit
Log

Thanks
Avinash

Reply | Threaded
Open this post in threaded view
|

Examples of chained MapReduce?

James Kennedy-6
I was wondering if anyone knows of any examples of truly chained, truly
distributed MapReduce jobs.

So far what I've had trouble finding examples of MapReduce jobs that are
kicked-off by some one time process that in turn kick off other
MapReduce jobs long after the initial driver process is dead.  This
would be more distributed and fault tolerant since it removes dependency
on a driver process.

I looked at the Nutch crawl code for example which iteratively builds up
a url db using successive MapReduces up to a certain depth.  But this
all done from within a for loop of a single process even though each
individual MapReduce is distributed.

Also, I notice that both Google and Hadoop's example of the distributed
sort fails to deal with the fact that the result is multiple sorted
files... this isn't a complete sort since the output files still need to
be merge-sorted don't they?  To complete the algorithm, could the
Reducer kick of a subsequent merge sort MapReduce on the result files?  
Or maybe there's something I'm not understanding...
Reply | Threaded
Open this post in threaded view
|

Re: Examples of chained MapReduce?

Doug Cutting
James Kennedy wrote:
> So far what I've had trouble finding examples of MapReduce jobs that are
> kicked-off by some one time process that in turn kick off other
> MapReduce jobs long after the initial driver process is dead.  This
> would be more distributed and fault tolerant since it removes dependency
> on a driver process.

Yes, but it wouldn't be that much more fault tolerant.  The biggest
cause of failures isn't particular nodes failing, but that some nodes
fail.  A driver program only fails if the particular node running it
fails.  If the MTBF of a particular node is ~1 year, that's probably
okay for a driver program, since driver programs only need to run for
hours or days at the most.  However it's a problem if you have 1000
nodes, and see 3+ failures on average per day, and you require that all
nodes stay up for the duration of a job.

> Also, I notice that both Google and Hadoop's example of the distributed
> sort fails to deal with the fact that the result is multiple sorted
> files... this isn't a complete sort since the output files still need to
> be merge-sorted don't they?  To complete the algorithm, could the
> Reducer kick of a subsequent merge sort MapReduce on the result files?  
> Or maybe there's something I'm not understanding...

Yes, MapReduce doesn't actually do a full sort.  It produces a set of
sorted partitions.  Sometimes the partition function can arrange things
so that this is in fact a full sort, but frequently it is just a hash
function.  Google mentions this in the original MapReduce paper:

    We guarantee that within a given partition, the intermediate
    key/value pairs are processed in increasing key order.
    This ordering guarantee makes it easy to generate
    a sorted output file per partition, which is useful when
    the output file format needs to support efficient random
    access lookups by key, or users of the output find it convenient
    to have the data sorted. (from page 6)

and

    Our partitioning function for this benchmark has builtin
    knowledge of the distribution of keys. In a general
    sorting program, we would add a pre-pass MapReduce
    operation that would collect a sample of the keys and
    use the distribution of the sampled keys to compute splitpoints
    for the final sorting pass. (from page 9)

Doug
Reply | Threaded
Open this post in threaded view
|

RE: Examples of chained MapReduce?

Devaraj Das
In reply to this post by James Kennedy-6
> Also, I notice that both Google and Hadoop's example of the distributed
sort fails to deal
> with the fact that the result is multiple sorted files... this isn't a
complete sort since
> the output files still need to be merge-sorted don't they?  To complete
the algorithm,
> could the Reducer kick of a subsequent merge sort MapReduce on the result
files?  

By the way, in Hadoop the limitation (if you want to call it that) is that
the reducers copy all map outputs to the local disk(s) of the node where
that is running (for merging and later on reducing). So if you have just one
reducer and plenty of map output, the node running the reducer might just
run out of disk space. Imagine a case where you are sorting 10TB of data and
generating one sorted output file.
Also, another point worth noting is that by having more reducers it makes
the final output generation more parallel, since multiple reducers run at
the same time and each operates on smaller chunks of the map outputs.

-----Original Message-----
From: James Kennedy [mailto:[hidden email]]
Sent: Friday, June 22, 2007 11:09 PM
To: [hidden email]
Subject: Examples of chained MapReduce?

I was wondering if anyone knows of any examples of truly chained, truly
distributed MapReduce jobs.

So far what I've had trouble finding examples of MapReduce jobs that are
kicked-off by some one time process that in turn kick off other MapReduce
jobs long after the initial driver process is dead.  This would be more
distributed and fault tolerant since it removes dependency on a driver
process.

I looked at the Nutch crawl code for example which iteratively builds up a
url db using successive MapReduces up to a certain depth.  But this all done
from within a for loop of a single process even though each individual
MapReduce is distributed.

Also, I notice that both Google and Hadoop's example of the distributed sort
fails to deal with the fact that the result is multiple sorted files... this
isn't a complete sort since the output files still need to be merge-sorted
don't they?  To complete the algorithm, could the Reducer kick of a
subsequent merge sort MapReduce on the result files?  
Or maybe there's something I'm not understanding...

Reply | Threaded
Open this post in threaded view
|

Re: What is the meaning of this exception ?`

Konstantin Shvachko
In reply to this post by Dhruba Borthakur-2
This is related to HADOOP-1076.
--Konstantin

Dhruba Borthakur wrote:

>This message gets printed if a previous attempt to merge the Transaction log
>with the fsimage fails.
>
>The SecondaryNameNode periodically tries to merge the fsimage and the
>fsedits file and creates a new fsimage. If this process fails in the middle
>(due to whatever reason), the Namenode is left with fsedits and fsedits.new.
>Now, future attempts by the SecondarynameNode to merge fsimage and fsedits
>prints out the message that you are seeing.
>
>The log message is not harmful per se, but if you want to get rid of it, you
>would have to restart your NameNode.
>
>Thanks,
>dhruba
>
>-----Original Message-----
>From: Phantom [mailto:[hidden email]]
>Sent: Friday, June 22, 2007 10:09 AM
>To: [hidden email]
>Subject: What is the meaning of this exception ?`
>
>I see a whole lot of this in my namenode log. Is this benign or is there
>something something more sinister going on ? Why would this be happening ?
>
>2007-06-21 00:16:24,782 INFO org.apache.hadoop.ipc.Server: IPC Server
>handler 1 on 9000, call rollEditLog() from 10.16.159.101:48099: error:
>java.io.IOException: Attempt to roll edit log but edits.new exists
>java.io.IOException: Attempt to roll edit log but edits.new exists
>        at org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)
>        at org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java
>:3239)
>        at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)
>        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>        at java.lang.reflect.Method.invoke(Unknown Source)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)
>2007-06-21 00:21:24,818 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit
>Log
>
>Thanks
>Avinash
>
>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: Examples of chained MapReduce?

James Kennedy-6
In reply to this post by Doug Cutting
Ah, ok. I think I did read about that partition function in the Google
paper but I hand't put it quite together yet.  So a pre-pass can make a
partition function work such that the final output is well sorted.  So
no further map reduce necessary and that was a bad example.

But back to my original question... Doug suggests that dependence on a
driver process is acceptable.  But has anyone needed true MapReduce
chaining or tried it successfully?  Or is it generally accepted that a
multi-MapReduce algorithm should always be driven by a single process?

Doug Cutting wrote:

> James Kennedy wrote:
>> So far what I've had trouble finding examples of MapReduce jobs that
>> are kicked-off by some one time process that in turn kick off other
>> MapReduce jobs long after the initial driver process is dead.  This
>> would be more distributed and fault tolerant since it removes
>> dependency on a driver process.
>
> Yes, but it wouldn't be that much more fault tolerant.  The biggest
> cause of failures isn't particular nodes failing, but that some nodes
> fail.  A driver program only fails if the particular node running it
> fails.  If the MTBF of a particular node is ~1 year, that's probably
> okay for a driver program, since driver programs only need to run for
> hours or days at the most.  However it's a problem if you have 1000
> nodes, and see 3+ failures on average per day, and you require that
> all nodes stay up for the duration of a job.
>
>> Also, I notice that both Google and Hadoop's example of the
>> distributed sort fails to deal with the fact that the result is
>> multiple sorted files... this isn't a complete sort since the output
>> files still need to be merge-sorted don't they?  To complete the
>> algorithm, could the Reducer kick of a subsequent merge sort
>> MapReduce on the result files?  Or maybe there's something I'm not
>> understanding...
>
> Yes, MapReduce doesn't actually do a full sort.  It produces a set of
> sorted partitions.  Sometimes the partition function can arrange
> things so that this is in fact a full sort, but frequently it is just
> a hash function.  Google mentions this in the original MapReduce paper:
>
>    We guarantee that within a given partition, the intermediate
>    key/value pairs are processed in increasing key order.
>    This ordering guarantee makes it easy to generate
>    a sorted output file per partition, which is useful when
>    the output file format needs to support efficient random
>    access lookups by key, or users of the output find it convenient
>    to have the data sorted. (from page 6)
>
> and
>
>    Our partitioning function for this benchmark has builtin
>    knowledge of the distribution of keys. In a general
>    sorting program, we would add a pre-pass MapReduce
>    operation that would collect a sample of the keys and
>    use the distribution of the sampled keys to compute splitpoints
>    for the final sorting pass. (from page 9)
>
> Doug

Reply | Threaded
Open this post in threaded view
|

Re: Examples of chained MapReduce?

Andrzej Białecki-2
James Kennedy wrote:
> But back to my original question... Doug suggests that dependence on a
> driver process is acceptable.  But has anyone needed true MapReduce
> chaining or tried it successfully?  Or is it generally accepted that a
> multi-MapReduce algorithm should always be driven by a single process?

I would argue that this functionality is outside the scope of Hadoop. As
far as I understand your question, you need orchestration, which
involves the ability to record a state of previously executed map-reduce
jobs, and starting next map-reduce jobs based on the existing state,
possibly long time after the first job completes and from a different
process.

I'm frequently facing this problem, and so far I've been using a
poor-man's workflow system, consisting of a bunch of cron jobs, shell
scripts, and simple marker files to record current state of data. In a
similar way you can implement advisory application-level locking, using
lock files.

Example: adding a new batch of pages to a Nutch index involves many
steps, starting with fetchlist generation, fetching, parsing, updating
the db, extraction of link information, and indexing. Each of these
steps consists of one (or several) map-reduce jobs, and the input to the
next jobs depends on the output of previous jobs. What you referred to
in your previous email was a single-app driver for this workflow, called
Crawl. But I'm using the slightly modified individual tools, which on
successful completion create marker files (e.g. fetching.done). Other
tools check for the existence of these files, and either perform their
function or exit (if I want to run updatedb from a segment that is
fetched but not parsed).

To summarize this long answer - I think that this functionality belongs
in the application layer built on top of Hadoop, and IMHO we are better
off not implementing it in the Hadoop proper.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Examples of chained MapReduce?

Ted Dunning-3

We are still evaluating Hadoop for use in our main-line analysis systems,
but we already have the problem of workflow scheduling.

Our solution for that was to implement a simpler version of Amazon's Simple
Queue Service.  This allows us to have multiple redundant workers for some
tasks or to choke a task down on other tasks.

The basic idea is that queues contain XML tasks.  Tasks are read from the
queue by workers, but are kept in a holding pen for a queue specific time
period after they are read.  If the task completes normally, the worker will
delete the task, but if the timeout expires before the worker completes the
task, it is added back to the queue.

Workers are structured as a triple of scripts that are executed of a manager
process.  These are a pre-condition that can determine if any work should be
done (usually this is a check for available local disk space or available
CPU cycles), an item qualification (this is done with a particular item in
case the work is subject to resource reservation) and a worker script.

Even this tiny little framework suffices for quite complex workflows and
work constraints.  It would be very easy to schedule map-reduce tasks via a
similar mechanism.

On 6/23/07 5:34 AM, "Andrzej Bialecki" <[hidden email]> wrote:

> James Kennedy wrote:
>> But back to my original question... Doug suggests that dependence on a
>> driver process is acceptable.  But has anyone needed true MapReduce
>> chaining or tried it successfully?  Or is it generally accepted that a
>> multi-MapReduce algorithm should always be driven by a single process?
>
> I would argue that this functionality is outside the scope of Hadoop. As
> far as I understand your question, you need orchestration, which
> involves the ability to record a state of previously executed map-reduce
> jobs, and starting next map-reduce jobs based on the existing state,
> possibly long time after the first job completes and from a different
> process.
>
> I'm frequently facing this problem, and so far I've been using a
> poor-man's workflow system, consisting of a bunch of cron jobs, shell
> scripts, and simple marker files to record current state of data. In a
> similar way you can implement advisory application-level locking, using
> lock files.
>
> Example: adding a new batch of pages to a Nutch index involves many
> steps, starting with fetchlist generation, fetching, parsing, updating
> the db, extraction of link information, and indexing. Each of these
> steps consists of one (or several) map-reduce jobs, and the input to the
> next jobs depends on the output of previous jobs. What you referred to
> in your previous email was a single-app driver for this workflow, called
> Crawl. But I'm using the slightly modified individual tools, which on
> successful completion create marker files (e.g. fetching.done). Other
> tools check for the existence of these files, and either perform their
> function or exit (if I want to run updatedb from a segment that is
> fetched but not parsed).
>
> To summarize this long answer - I think that this functionality belongs
> in the application layer built on top of Hadoop, and IMHO we are better
> off not implementing it in the Hadoop proper.
>

Reply | Threaded
Open this post in threaded view
|

RE: Examples of chained MapReduce?

Devaraj Das
I haven't confirmed this but I vaguely remember that the resource schedulers
(Torque, Condor) provide the feature using which one can submit a DAG of
jobs, etc. The resource manager doesn't invoke a node in the DAG unless all
nodes pointing to it have successfully finished (or something like that) and
the resource scheduler framework does the bookkeeping to take care of failed
jobs, etc.
In hadoop there is an effort "Integration of Hadoop with batch schedulers"
https://issues.apache.org/jira/browse/HADOOP-719 
I am not sure whether it handles the use case, where one could submit a
chain of jobs, but think it potentially can handle that.

-----Original Message-----
From: Ted Dunning [mailto:[hidden email]]
Sent: Sunday, June 24, 2007 6:10 AM
To: [hidden email]
Subject: Re: Examples of chained MapReduce?


We are still evaluating Hadoop for use in our main-line analysis systems,
but we already have the problem of workflow scheduling.

Our solution for that was to implement a simpler version of Amazon's Simple
Queue Service.  This allows us to have multiple redundant workers for some
tasks or to choke a task down on other tasks.

The basic idea is that queues contain XML tasks.  Tasks are read from the
queue by workers, but are kept in a holding pen for a queue specific time
period after they are read.  If the task completes normally, the worker will
delete the task, but if the timeout expires before the worker completes the
task, it is added back to the queue.

Workers are structured as a triple of scripts that are executed of a manager
process.  These are a pre-condition that can determine if any work should be
done (usually this is a check for available local disk space or available
CPU cycles), an item qualification (this is done with a particular item in
case the work is subject to resource reservation) and a worker script.

Even this tiny little framework suffices for quite complex workflows and
work constraints.  It would be very easy to schedule map-reduce tasks via a
similar mechanism.

On 6/23/07 5:34 AM, "Andrzej Bialecki" <[hidden email]> wrote:

> James Kennedy wrote:
>> But back to my original question... Doug suggests that dependence on
>> a driver process is acceptable.  But has anyone needed true MapReduce
>> chaining or tried it successfully?  Or is it generally accepted that
>> a multi-MapReduce algorithm should always be driven by a single process?
>
> I would argue that this functionality is outside the scope of Hadoop.
> As far as I understand your question, you need orchestration, which
> involves the ability to record a state of previously executed
> map-reduce jobs, and starting next map-reduce jobs based on the
> existing state, possibly long time after the first job completes and
> from a different process.
>
> I'm frequently facing this problem, and so far I've been using a
> poor-man's workflow system, consisting of a bunch of cron jobs, shell
> scripts, and simple marker files to record current state of data. In a
> similar way you can implement advisory application-level locking,
> using lock files.
>
> Example: adding a new batch of pages to a Nutch index involves many
> steps, starting with fetchlist generation, fetching, parsing, updating
> the db, extraction of link information, and indexing. Each of these
> steps consists of one (or several) map-reduce jobs, and the input to
> the next jobs depends on the output of previous jobs. What you
> referred to in your previous email was a single-app driver for this
> workflow, called Crawl. But I'm using the slightly modified individual
> tools, which on successful completion create marker files (e.g.
> fetching.done). Other tools check for the existence of these files,
> and either perform their function or exit (if I want to run updatedb
> from a segment that is fetched but not parsed).
>
> To summarize this long answer - I think that this functionality
> belongs in the application layer built on top of Hadoop, and IMHO we
> are better off not implementing it in the Hadoop proper.
>


Reply | Threaded
Open this post in threaded view
|

Re: Examples of chained MapReduce?

James Kennedy-6
Ok guys, thank you for your responses.

Devaraj Das wrote:

> I haven't confirmed this but I vaguely remember that the resource schedulers
> (Torque, Condor) provide the feature using which one can submit a DAG of
> jobs, etc. The resource manager doesn't invoke a node in the DAG unless all
> nodes pointing to it have successfully finished (or something like that) and
> the resource scheduler framework does the bookkeeping to take care of failed
> jobs, etc.
> In hadoop there is an effort "Integration of Hadoop with batch schedulers"
> https://issues.apache.org/jira/browse/HADOOP-719 
> I am not sure whether it handles the use case, where one could submit a
> chain of jobs, but think it potentially can handle that.
>
> -----Original Message-----
> From: Ted Dunning [mailto:[hidden email]]
> Sent: Sunday, June 24, 2007 6:10 AM
> To: [hidden email]
> Subject: Re: Examples of chained MapReduce?
>
>
> We are still evaluating Hadoop for use in our main-line analysis systems,
> but we already have the problem of workflow scheduling.
>
> Our solution for that was to implement a simpler version of Amazon's Simple
> Queue Service.  This allows us to have multiple redundant workers for some
> tasks or to choke a task down on other tasks.
>
> The basic idea is that queues contain XML tasks.  Tasks are read from the
> queue by workers, but are kept in a holding pen for a queue specific time
> period after they are read.  If the task completes normally, the worker will
> delete the task, but if the timeout expires before the worker completes the
> task, it is added back to the queue.
>
> Workers are structured as a triple of scripts that are executed of a manager
> process.  These are a pre-condition that can determine if any work should be
> done (usually this is a check for available local disk space or available
> CPU cycles), an item qualification (this is done with a particular item in
> case the work is subject to resource reservation) and a worker script.
>
> Even this tiny little framework suffices for quite complex workflows and
> work constraints.  It would be very easy to schedule map-reduce tasks via a
> similar mechanism.
>
> On 6/23/07 5:34 AM, "Andrzej Bialecki" <[hidden email]> wrote:
>
>  
>> James Kennedy wrote:
>>    
>>> But back to my original question... Doug suggests that dependence on
>>> a driver process is acceptable.  But has anyone needed true MapReduce
>>> chaining or tried it successfully?  Or is it generally accepted that
>>> a multi-MapReduce algorithm should always be driven by a single process?
>>>      
>> I would argue that this functionality is outside the scope of Hadoop.
>> As far as I understand your question, you need orchestration, which
>> involves the ability to record a state of previously executed
>> map-reduce jobs, and starting next map-reduce jobs based on the
>> existing state, possibly long time after the first job completes and
>> from a different process.
>>
>> I'm frequently facing this problem, and so far I've been using a
>> poor-man's workflow system, consisting of a bunch of cron jobs, shell
>> scripts, and simple marker files to record current state of data. In a
>> similar way you can implement advisory application-level locking,
>> using lock files.
>>
>> Example: adding a new batch of pages to a Nutch index involves many
>> steps, starting with fetchlist generation, fetching, parsing, updating
>> the db, extraction of link information, and indexing. Each of these
>> steps consists of one (or several) map-reduce jobs, and the input to
>> the next jobs depends on the output of previous jobs. What you
>> referred to in your previous email was a single-app driver for this
>> workflow, called Crawl. But I'm using the slightly modified individual
>> tools, which on successful completion create marker files (e.g.
>> fetching.done). Other tools check for the existence of these files,
>> and either perform their function or exit (if I want to run updatedb
>> from a segment that is fetched but not parsed).
>>
>> To summarize this long answer - I think that this functionality
>> belongs in the application layer built on top of Hadoop, and IMHO we
>> are better off not implementing it in the Hadoop proper.
>>
>>    
>
>
>