M/R over two matrices, and computing the median

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

M/R over two matrices, and computing the median

Shannon Quinn
Hi all,

Two quick questions:

1) If I'm in a Mapper, and I'm trying to access two matrices of data (the
rows of one of them form the VectorWritables that are the input to the
Mapper; the other is a Path argument to the cache), how could I access the
same row in both matrices simultaneously? My first instinct is to use the
IntWritable key input and simply access that same row from the saved Path,
but I'm not sure how the SequenceFile index schemes are set up. For example,
if I have two DistributedRowMatrices, would the same key reference the same
row in both?

2) I looked through the Mahout math package and nothing stood out: is there
an easy way for computing the median value of a Vector?

Thanks!

Shannon
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Jake Mannix
Hi Shannon,

On Fri, Jul 30, 2010 at 8:54 AM, Shannon Quinn <[hidden email]> wrote:

1) If I'm in a Mapper, and I'm trying to access two matrices of data (the
> rows of one of them form the VectorWritables that are the input to the
> Mapper; the other is a Path argument to the cache), how could I access the
> same row in both matrices simultaneously? My first instinct is to use the
> IntWritable key input and simply access that same row from the saved Path,
> but I'm not sure how the SequenceFile index schemes are set up. For
> example,
> if I have two DistributedRowMatrices, would the same key reference the same
> row in both?
>

Accessing a separate SequenceFile from within a Mapper is *way inefficient*
(orders of magnitude slower).

You want to do a map-side join.  This is what is done in MatrixMultiplyJob
-
your Mapper gets IntWritable as key, and the value is a Pair of
VectorWritables -
one from each matrix.


> 2) I looked through the Mahout math package and nothing stood out: is there
> an easy way for computing the median value of a Vector?


Do you want the median of the non-zero entries (of a sparse vector), or the
true median?  Either way, there's not canned a canned impl of this on the
Vector classes.  It would probably be pretty nice to have an efficient
(linear-time) implementation of this, however.

  -jake
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Ted Dunning
In reply to this post by Shannon Quinn
I recently committed OnlineSummarizer that could be easily adapted to
estimate the median of a vector.

On Fri, Jul 30, 2010 at 8:54 AM, Shannon Quinn <[hidden email]> wrote:

> 2) I looked through the Mahout math package and nothing stood out: is there
> an easy way for computing the median value of a Vector?
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Jake Mannix
Awesome!  We should, like, document this stuff, so that we all know where to
find it! :)

  -jake

On Fri, Jul 30, 2010 at 1:45 PM, Ted Dunning <[hidden email]> wrote:

> I recently committed OnlineSummarizer that could be easily adapted to
> estimate the median of a vector.
>
> On Fri, Jul 30, 2010 at 8:54 AM, Shannon Quinn <[hidden email]> wrote:
>
> > 2) I looked through the Mahout math package and nothing stood out: is
> there
> > an easy way for computing the median value of a Vector?
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Ted Dunning
What a buzz.

Where would you suggest?  I built it (and the OnlineAuc companion) to
support the on-line learning stuff.  It will, of course, be described in the
example programs there and in the Manning book.

Where else?  It isn't like they have an independent story.

On Fri, Jul 30, 2010 at 2:13 PM, Jake Mannix <[hidden email]> wrote:

> Awesome!  We should, like, document this stuff, so that we all know where
> to
> find it! :)
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Jake Mannix
Oh, I was just backhandedly referring to our need for a nice consistent wiki
story.  It's tricky - there are a lot of little tools which are pretty
general inside
of Mahout, and some are well publicized and known (and used everywhere)
like LogLikelihood, and others, like this stats stuff, as well as the
general
purpose functional methods on Vector (aggregate() and
assign(Vector, BinaryFunction) ), which are not.

  -jake

On Fri, Jul 30, 2010 at 2:39 PM, Ted Dunning <[hidden email]> wrote:

> What a buzz.
>
> Where would you suggest?  I built it (and the OnlineAuc companion) to
> support the on-line learning stuff.  It will, of course, be described in
> the
> example programs there and in the Manning book.
>
> Where else?  It isn't like they have an independent story.
>
> On Fri, Jul 30, 2010 at 2:13 PM, Jake Mannix <[hidden email]>
> wrote:
>
> > Awesome!  We should, like, document this stuff, so that we all know where
> > to
> > find it! :)
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Ted Dunning
Point me to where you think it should go and I will write a page or three.

On Fri, Jul 30, 2010 at 2:44 PM, Jake Mannix <[hidden email]> wrote:

> Oh, I was just backhandedly referring to our need for a nice consistent
> wiki
> story.  It's tricky - there are a lot of little tools which are pretty
> general inside
> of Mahout, and some are well publicized and known (and used everywhere)
> like LogLikelihood, and others, like this stats stuff, as well as the
> general
> purpose functional methods on Vector (aggregate() and
> assign(Vector, BinaryFunction) ), which are not.
>
>  -jake
>
> On Fri, Jul 30, 2010 at 2:39 PM, Ted Dunning <[hidden email]>
> wrote:
>
> > What a buzz.
> >
> > Where would you suggest?  I built it (and the OnlineAuc companion) to
> > support the on-line learning stuff.  It will, of course, be described in
> > the
> > example programs there and in the Manning book.
> >
> > Where else?  It isn't like they have an independent story.
> >
> > On Fri, Jul 30, 2010 at 2:13 PM, Jake Mannix <[hidden email]>
> > wrote:
> >
> > > Awesome!  We should, like, document this stuff, so that we all know
> where
> > > to
> > > find it! :)
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Shannon Quinn
In reply to this post by Jake Mannix
>
>
> Accessing a separate SequenceFile from within a Mapper is *way inefficient*
> (orders of magnitude slower).
>
> You want to do a map-side join.  This is what is done in MatrixMultiplyJob
> -
> your Mapper gets IntWritable as key, and the value is a Pair of
> VectorWritables -
> one from each matrix.
>

Excellent. Any idea what the Hadoop 0.20.2 equivalent for
CompositeInputFormat is? :)
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Jake Mannix
On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <[hidden email]> wrote:
>
> Excellent. Any idea what the Hadoop 0.20.2 equivalent for
> CompositeInputFormat is? :)
>

Ah, there is that part.  Hmm... it's really really annoying to not have that
in 0.20.2.

This is actually why I haven't migrated the distributed matrix stuff to the
newest
Hadoop API - map-side join is pretty seriously useful sometimes.

Does the old CompositeInputFormat work with the new API, does anyone know?

  -jake
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Shannon Quinn
CompositeInputFormat implements a hadoop.mapred.join interface, whereas
job.setInputFormatClass() is expecting a class that extends a
hadoop.ioclass. Also, TupleWritable is in the deprecated hadoop.mapred
package, too.

Still hunting around the API for the newer equivalent; there has to be a way
of doing this?

On Mon, Aug 2, 2010 at 6:20 PM, Jake Mannix <[hidden email]> wrote:

> On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <[hidden email]> wrote:
> >
> > Excellent. Any idea what the Hadoop 0.20.2 equivalent for
> > CompositeInputFormat is? :)
> >
>
> Ah, there is that part.  Hmm... it's really really annoying to not have
> that
> in 0.20.2.
>
> This is actually why I haven't migrated the distributed matrix stuff to the
> newest
> Hadoop API - map-side join is pretty seriously useful sometimes.
>
> Does the old CompositeInputFormat work with the new API, does anyone know?
>
>  -jake
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Shannon Quinn
Ok, turns out those same classes do exist in the new API, but weren't
included in the 0.20.2 release for some reason - they're in a much more
recent SVN commit:

http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/join/

I've had to checkout the latest mapreduce revision and build it manually; it
looks like the "zipgroupfileset" jar attribute is my best bet for flattening
and merging the new mapreduce jar with the overall core hadoop jar.
Hopefully this won't cause any flagrant problems...

Shannon

On Mon, Aug 2, 2010 at 6:24 PM, Shannon Quinn <[hidden email]> wrote:

> CompositeInputFormat implements a hadoop.mapred.join interface, whereas
> job.setInputFormatClass() is expecting a class that extends a hadoop.ioclass. Also, TupleWritable is in the deprecated hadoop.mapred package, too.
>
> Still hunting around the API for the newer equivalent; there has to be a
> way of doing this?
>
> On Mon, Aug 2, 2010 at 6:20 PM, Jake Mannix <[hidden email]> wrote:
>
>> On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <[hidden email]> wrote:
>> >
>> > Excellent. Any idea what the Hadoop 0.20.2 equivalent for
>> > CompositeInputFormat is? :)
>> >
>>
>> Ah, there is that part.  Hmm... it's really really annoying to not have
>> that
>> in 0.20.2.
>>
>> This is actually why I haven't migrated the distributed matrix stuff to
>> the
>> newest
>> Hadoop API - map-side join is pretty seriously useful sometimes.
>>
>> Does the old CompositeInputFormat work with the new API, does anyone know?
>>
>>  -jake
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Sean Owen
In reply to this post by Shannon Quinn
What I ended up doing in this case, IIRC, is to use another phase to
convert inputs 1 and 2 into some contrived new single Writable format.
Then both sets of input are merely fed into one mapper. So I'd
literally have Writable classes that contained, inside, either a
FooWritable or BarWritable. A little ugly but not bad.

On Mon, Aug 2, 2010 at 3:24 PM, Shannon Quinn <[hidden email]> wrote:

> CompositeInputFormat implements a hadoop.mapred.join interface, whereas
> job.setInputFormatClass() is expecting a class that extends a
> hadoop.ioclass. Also, TupleWritable is in the deprecated hadoop.mapred
> package, too.
>
> Still hunting around the API for the newer equivalent; there has to be a way
> of doing this?
>
> On Mon, Aug 2, 2010 at 6:20 PM, Jake Mannix <[hidden email]> wrote:
>
>> On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <[hidden email]> wrote:
>> >
>> > Excellent. Any idea what the Hadoop 0.20.2 equivalent for
>> > CompositeInputFormat is? :)
>> >
>>
>> Ah, there is that part.  Hmm... it's really really annoying to not have
>> that
>> in 0.20.2.
>>
>> This is actually why I haven't migrated the distributed matrix stuff to the
>> newest
>> Hadoop API - map-side join is pretty seriously useful sometimes.
>>
>> Does the old CompositeInputFormat work with the new API, does anyone know?
>>
>>  -jake
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Shannon Quinn
Right, that's the concept I'd had in mind, but to me it always seem to come
down to having access to two distinct vectors at the same time, and I'm not
sure how you would do that. In my case, both the dimensions and the data
types of the two vectors are identical, so we're talking a merged vector of
floats that's simply twice as long as the original, but how to gain access
to the two original vectors at the same time is beyond me.

But still, the data types I need that would do this for me are in a newer
Hadoop commit, I'm just trying to figure out how to build the commit
manually and integrate it to the core Hadoop .jar file.

Any suggestions that would speed along either of these options are most
welcome.

Shannon

On Tue, Aug 3, 2010 at 11:50 AM, Sean Owen <[hidden email]> wrote:

> What I ended up doing in this case, IIRC, is to use another phase to
> convert inputs 1 and 2 into some contrived new single Writable format.
> Then both sets of input are merely fed into one mapper. So I'd
> literally have Writable classes that contained, inside, either a
> FooWritable or BarWritable. A little ugly but not bad.
>
> On Mon, Aug 2, 2010 at 3:24 PM, Shannon Quinn <[hidden email]> wrote:
> > CompositeInputFormat implements a hadoop.mapred.join interface, whereas
> > job.setInputFormatClass() is expecting a class that extends a
> > hadoop.ioclass. Also, TupleWritable is in the deprecated hadoop.mapred
> > package, too.
> >
> > Still hunting around the API for the newer equivalent; there has to be a
> way
> > of doing this?
> >
> > On Mon, Aug 2, 2010 at 6:20 PM, Jake Mannix <[hidden email]>
> wrote:
> >
> >> On Mon, Aug 2, 2010 at 3:13 PM, Shannon Quinn <[hidden email]>
> wrote:
> >> >
> >> > Excellent. Any idea what the Hadoop 0.20.2 equivalent for
> >> > CompositeInputFormat is? :)
> >> >
> >>
> >> Ah, there is that part.  Hmm... it's really really annoying to not have
> >> that
> >> in 0.20.2.
> >>
> >> This is actually why I haven't migrated the distributed matrix stuff to
> the
> >> newest
> >> Hadoop API - map-side join is pretty seriously useful sometimes.
> >>
> >> Does the old CompositeInputFormat work with the new API, does anyone
> know?
> >>
> >>  -jake
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Sean Owen
You want row N from matrix A and B?

Map A to (row # -> row vector) and likewise for B. Both are input paths.
Then the reducer has, for each row, both row vectors.

You can add a custom Writable with more info about, say, which vector
is which if you like.

On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <[hidden email]> wrote:

> Right, that's the concept I'd had in mind, but to me it always seem to come
> down to having access to two distinct vectors at the same time, and I'm not
> sure how you would do that. In my case, both the dimensions and the data
> types of the two vectors are identical, so we're talking a merged vector of
> floats that's simply twice as long as the original, but how to gain access
> to the two original vectors at the same time is beyond me.
>
> But still, the data types I need that would do this for me are in a newer
> Hadoop commit, I'm just trying to figure out how to build the commit
> manually and integrate it to the core Hadoop .jar file.
>
> Any suggestions that would speed along either of these options are most
> welcome.
>
> Shannon
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Shannon Quinn
And the key in each map() would correspond to the row in whichever
SequenceFile it's parsing, so as long the two files line up their keys, I'll
have exactly two VectorWritables (or whatever Writable) per key in the
Reducer.

Oy. That's about as simple as it gets. Thank you very much!!

Shannon

On Tue, Aug 3, 2010 at 1:17 PM, Sean Owen <[hidden email]> wrote:

> You want row N from matrix A and B?
>
> Map A to (row # -> row vector) and likewise for B. Both are input paths.
> Then the reducer has, for each row, both row vectors.
>
> You can add a custom Writable with more info about, say, which vector
> is which if you like.
>
> On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <[hidden email]> wrote:
> > Right, that's the concept I'd had in mind, but to me it always seem to
> come
> > down to having access to two distinct vectors at the same time, and I'm
> not
> > sure how you would do that. In my case, both the dimensions and the data
> > types of the two vectors are identical, so we're talking a merged vector
> of
> > floats that's simply twice as long as the original, but how to gain
> access
> > to the two original vectors at the same time is beyond me.
> >
> > But still, the data types I need that would do this for me are in a newer
> > Hadoop commit, I'm just trying to figure out how to build the commit
> > manually and integrate it to the core Hadoop .jar file.
> >
> > Any suggestions that would speed along either of these options are most
> > welcome.
> >
> > Shannon
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Ted Dunning
In reply to this post by Shannon Quinn
Well if both vectors are the same size, then a map-reduce on vector number
is the natural solution here.

Map-side reduce is only useful when one or the other operand is relatively
small.

On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <[hidden email]> wrote:

> but how to gain access
> to the two original vectors at the same time is beyond me.
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Shannon Quinn
Here's my next question, then: within the Mapper itself, how do I know the
source SequenceFile of the VectorWritable I'm currently holding, A or B?

On Tue, Aug 3, 2010 at 1:33 PM, Ted Dunning <[hidden email]> wrote:

> Well if both vectors are the same size, then a map-reduce on vector number
> is the natural solution here.
>
> Map-side reduce is only useful when one or the other operand is relatively
> small.
>
> On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <[hidden email]> wrote:
>
> > but how to gain access
> > to the two original vectors at the same time is beyond me.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Shannon Quinn
I don't know how dirty of a hack this might be, but what about this:

Store the Path.getName() of the two separate Paths prior to the job, then
within the Mapper, look up context.getWorkingDirectory().getName() and
compare it to the two variables set before. Whichever one matches, you know
you're working with that specific SequenceFile.

Would that even work?

On Tue, Aug 3, 2010 at 3:50 PM, Shannon Quinn <[hidden email]> wrote:

> Here's my next question, then: within the Mapper itself, how do I know the
> source SequenceFile of the VectorWritable I'm currently holding, A or B?
>
>
> On Tue, Aug 3, 2010 at 1:33 PM, Ted Dunning <[hidden email]> wrote:
>
>> Well if both vectors are the same size, then a map-reduce on vector number
>> is the natural solution here.
>>
>> Map-side reduce is only useful when one or the other operand is relatively
>> small.
>>
>> On Tue, Aug 3, 2010 at 10:12 AM, Shannon Quinn <[hidden email]> wrote:
>>
>> > but how to gain access
>> > to the two original vectors at the same time is beyond me.
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: M/R over two matrices, and computing the median

Ted Dunning
Yes.  The general principle is to label the records coming in with the file
that they came from.  That can be done in the same data-structure that
provides the polymorphism that you want.  These labelled records carry their
labels so that you can inspect the labels in the reducer and decide what to
do.  It is also common to sort in some clever way so that you know the order
of the labels.

On Tue, Aug 3, 2010 at 2:09 PM, Shannon Quinn <[hidden email]> wrote:

> Store the Path.getName() of the two separate Paths prior to the job, then
> within the Mapper, look up context.getWorkingDirectory().getName() and
> compare it to the two variables set before. Whichever one matches, you know
> you're working with that specific SequenceFile.
>
> Would that even work?
>