record termination and MapReduce

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

record termination and MapReduce

Toby DiPasquale-2-2
Hi all,

I have a question about the MapReduce and NDFS implementations. When
writing records into an NDFS file, how does one make sure that records
terminate cleanly on block boundaries such that a Map job's input does not
span multiple physical blocks?

It also appears as if NDFS does not have an explicit "record append"
operation. Is this the case?

--
Toby DiPasquale
Senior Software Engineer
Symantec Corporation
Reply | Threaded
Open this post in threaded view
|

Re: record termination and MapReduce

Doug Cutting
Toby DiPasquale wrote:
> I have a question about the MapReduce and NDFS implementations. When
> writing records into an NDFS file, how does one make sure that records
> terminate cleanly on block boundaries such that a Map job's input does not
> span multiple physical blocks?

We do not currently guarantee that.  A task's input may span multiple
blocks.  We try to split things into block-sized chunks, but the last
few records (up to the first sync mark past the split point) may be in
the next block.  So a bit of i/o will happen over the network, but not
the vast majority.

> It also appears as if NDFS does not have an explicit "record append"
> operation. Is this the case?

Yes.  DFS currently is write-once.

Please note that the MapReduce and DFS code has moved from Nutch to the
Hadoop project.  Such questions are more appropriately asked there.

Doug