Global information in mapreduce

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Global information in mapreduce

Ilya Vishnevsky
Hello! My question is about mapreduce. Is it possible to pass to the map
function some global information? For example I have a set of words and
a large set of documents. I want the map function to get each document
as value and emit pairs (word-frequency) for each word in the set, where
"frequency" is frequency of this word in the document. To do this I need
map function to have access to the set of words each time it runs. Is it
possible to do that?

Reply | Threaded
Open this post in threaded view
|

Re: Global information in mapreduce

Alejandro Abdelnur-2
you could write your word set to a file in DFS somewhere outside of
the input directory and read it at map init time (within the
configure() method). you could pass the path to file as a
configuration property.

HTH

Alejandro

On 3/19/07, Ilya Vishnevsky <[hidden email]> wrote:
> Hello! My question is about mapreduce. Is it possible to pass to the map
> function some global information? For example I have a set of words and
> a large set of documents. I want the map function to get each document
> as value and emit pairs (word-frequency) for each word in the set, where
> "frequency" is frequency of this word in the document. To do this I need
> map function to have access to the set of words each time it runs. Is it
> possible to do that?
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Global information in mapreduce

Owen O'Malley-5

On Mar 19, 2007, at 10:08 PM, Alejandro Abdelnur wrote:

> you could write your word set to a file in DFS somewhere outside of
> the input directory and read it at map init time (within the
> configure() method). you could pass the path to file as a
> configuration property.

On a side node, if the files are large (or the maps short) it can  
make sense to use the local file cache. See  
org.apache.hadoop.filecache.DistributedCache. In particular, look at  
setCacheFiles. Basically, you configure it with a url and the task  
tracker will copy an instance down and cache it locally.

-- Owen