NDFS / map tasks

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

NDFS / map tasks

Byron Miller-2
Could NDFS be easily modified so that the master node
sends the Map task to the data replica/task node that
actually has the data locally? alleviating network
traffic load?

In a scenerio like this the master node could be
prepped like google does so that when the job is
nearing completion it could spawn off retries of the
existing map tasks to other nodes to try and complete
the job incase certain nodes are failing for whatever
reason.  (especially if your processing 64 meg chunks)

It would also seem because of the smaller chunk size
you could currently run more tasks even on a single
node.  With todays hardware we could impose an NDFS
"file syste" container even on a "local" node based
system so you could achieve the benefits of being
aware of multiple volumes locally  and utilizing these
in your storage definition. Something like this on a
local system with multiple disk drives to try and
utilize all of the io channels/CPU's and such. (for
example using a 32 thread Sun T2000 server with
multiple attached disks being able to process quite a
load in smaller concurrent tasks rather then few
larger ones).

It appears google chops things into 64meg tasks (the
same size as the GFS block size) and perhaps even
doing something like that in NDFS may make things a
bit quicker to read/write and handle network IO
throughput.  (especially if the only traffic is ndfs
replica traffic and updates on such rath than actual
serial io reading remote data)