Pairwise similarity using map reduce

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Pairwise similarity using map reduce

Madhav Sharan
Hi hadoop users,

I have a set of vectors stored in .txt files on HDFS. Goal is to take every pair of vector and compute similarity between them. 
  1. We generate pairs of vectors by a python script and give it as a input to MR jobs. Input file has comma separated path to vector files. "/path/to/vec1, path/to/vec2" .
  2. Then mapper tasks gets (Path1, Path2) and computes similarity.
To do this Mapper reads file at Path1 using HDFS API, reads File at Path2 using HDFS API. So, each file is read many many times due to the pairwise calculation.

I am trying find a way so that I read file only once and my mapper jobs receive contents of file rather than file path.

Can someone please share any technique they have used in past that might help?

Thanks
--
Madhav Sharan

Reply | Threaded
Open this post in threaded view
|

Re: Pairwise similarity using map reduce

zhiyuan yang
This is a standard cartesian product pattern. Although I don’t know how it’s solved in MapReduce, currently I’m working on
cartesian product in Apache Tez (https://issues.apache.org/jira/browse/TEZ-2104). There you can have
the following computation:

Map1   Map2
    \         /
   Reducer

where Map1 tasks handle splits (say A1,A2) from file1, Map2 tasks handle splits(B1, B2) from file2, and Reducer tasks got
all combinations of splits generated by Map1 and Map2 (say A1B1, A1B2, A2B1, A2B2).

Mapper only read file once, and their outputs are sent to Reducer multiple times over network. Also, you can control the speed
of processing by controlling number of reducers.

Thanks!
Zhiyuan

> From: Madhav Sharan <[hidden email]>
> Subject: Pairwise similarity using map reduce
> Date: August 10, 2016 at 12:25:46 PM PDT
> To: user <[hidden email]>
> Cc: "Mattmann, Chris A (3980)" <[hidden email]>, "Zimdars, Paul A (398C-Affiliate)" <[hidden email]>, [hidden email]
>
> Hi hadoop users,
>
> I have a set of vectors stored in .txt files on HDFS. Goal is to take every pair of vector and compute similarity between them.
> • We generate pairs of vectors by a python script and give it as a input to MR jobs. Input file has comma separated path to vector files. "/path/to/vec1, path/to/vec2" .
> • Then mapper tasks gets (Path1, Path2) and computes similarity.
> To do this Mapper reads file at Path1 using HDFS API, reads File at Path2 using HDFS API. So, each file is read many many times due to the pairwise calculation.
>
> I am trying find a way so that I read file only once and my mapper jobs receive contents of file rather than file path.
>
> Can someone please share any technique they have used in past that might help?
>
> Thanks
> --
> Madhav Sharan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]