Quantcast

Matrix multiplication in Hadoop

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Matrix multiplication in Hadoop

Mike Spreitzer
Who is doing multiplication of large dense matrices using Hadoop?  What is
a good way to do that computation using Hadoop?

Thanks,
Mike
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

jconwell
I'm not sure, but I would suspect that Mahout has some low level map/reduce
jobs for this.  You might start there.


On Fri, Nov 18, 2011 at 8:59 AM, Mike Spreitzer <[hidden email]> wrote:

> Who is doing multiplication of large dense matrices using Hadoop?  What is
> a good way to do that computation using Hadoop?
>
> Thanks,
> Mike




--

Thanks,
John C
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

tdp2110
In reply to this post by Mike Spreitzer
I wrote up a basic algorithm for this here:
http://math.columbia.edu/~tpeters/teh-codez/hadoop/hadoop-matrix-mult.html
It's almost certainly not optimal, but might get you some ideas.

Here is another approach http://www.norstad.org/matrix-multiply/index.html

Cheers,

Tom

On Fri, Nov 18, 2011 at 11:59 AM, Mike Spreitzer <[hidden email]>wrote:

> Who is doing multiplication of large dense matrices using Hadoop?  What is
> a good way to do that computation using Hadoop?
>
> Thanks,
> Mike
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Michael Segel
In reply to this post by Mike Spreitzer
Is Hadoop the best tool for doing large matrix math.
Sure you can do it, but, aren't there better tools for these types of problems?


Sent from a remote device. Please excuse any typos...

Mike Segel

On Nov 18, 2011, at 10:59 AM, Mike Spreitzer <[hidden email]> wrote:

> Who is doing multiplication of large dense matrices using Hadoop?  What is
> a good way to do that computation using Hadoop?
>
> Thanks,
> Mike
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Mike Spreitzer
That's also an interesting question, but right now I am studying Hadoop
and want to know how well dense MM can be done in Hadoop.

Thanks,
Mike



From:   Michel Segel <[hidden email]>
To:     "[hidden email]" <[hidden email]>
Date:   11/18/2011 12:34 PM
Subject:        Re: Matrix multiplication in Hadoop



Is Hadoop the best tool for doing large matrix math.
Sure you can do it, but, aren't there better tools for these types of
problems?


Sent from a remote device. Please excuse any typos...

Mike Segel

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Ayon Sinha
In reply to this post by Michael Segel
I'd really be interested in a comparison of Numpy/Octave/Matlab kind of tools with a Hadoop (lets say 4-10 large cloud servers) implementation with growing size of the matrix. I want to know the scale at which Hadoop really starts to pull away. 
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
From: Michel Segel <[hidden email]>
To: "[hidden email]" <[hidden email]>
Sent: Friday, November 18, 2011 9:33 AM
Subject: Re: Matrix multiplication in Hadoop

Is Hadoop the best tool for doing large matrix math.
Sure you can do it, but, aren't there better tools for these types of problems?


Sent from a remote device. Please excuse any typos...

Mike Segel

On Nov 18, 2011, at 10:59 AM, Mike Spreitzer <[hidden email]> wrote:

> Who is doing multiplication of large dense matrices using Hadoop?  What is
> a good way to do that computation using Hadoop?
>
> Thanks,
> Mike
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Matrix multiplication in Hadoop

Michael Segel
In reply to this post by Mike Spreitzer

Ok Mike,

First I admire that you are studying Hadoop.

To answer your question... not well.

Might I suggest that if you want to learn Hadoop, you try and find a problem which can easily be broken in to a series of parallel tasks where there is minimal communication requirements between each task?

No offense, but if I could make a parallel... what you're asking is akin to taking a normalized relational model and trying to run it as is in HBase.
Yes it can be done. But not the best use of resources.

> To: [hidden email]
> CC: [hidden email]
> Subject: Re: Matrix multiplication in Hadoop
> From: [hidden email]
> Date: Fri, 18 Nov 2011 12:39:00 -0500
>
> That's also an interesting question, but right now I am studying Hadoop
> and want to know how well dense MM can be done in Hadoop.
>
> Thanks,
> Mike
>
>
>
> From:   Michel Segel <[hidden email]>
> To:     "[hidden email]" <[hidden email]>
> Date:   11/18/2011 12:34 PM
> Subject:        Re: Matrix multiplication in Hadoop
>
>
>
> Is Hadoop the best tool for doing large matrix math.
> Sure you can do it, but, aren't there better tools for these types of
> problems?
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
     
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Edward Capriolo
In reply to this post by Ayon Sinha
A problem with matrix multiplication in hadoop is that hadoop is row
oriented for the most part. I have thought about this use case however and
you can theoretically turn a 2D matrix into a 1D matrix and then that fits
into the row oriented nature of hadoop. Also being that the typical mapper
can have fairly large chunks of memory like 1024MB I have done work like
this before were I loaded such datasets into memory to process them. That
usage does not really fit the map reduce model.

I have been wanting to look at:
http://www.scidb.org/

Edward
On Fri, Nov 18, 2011 at 1:48 PM, Ayon Sinha <[hidden email]> wrote:

> I'd really be interested in a comparison of Numpy/Octave/Matlab kind of
> tools with a Hadoop (lets say 4-10 large cloud servers) implementation with
> growing size of the matrix. I want to know the scale at which Hadoop really
> starts to pull away.
>
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>
> ________________________________
> From: Michel Segel <[hidden email]>
> To: "[hidden email]" <[hidden email]>
> Sent: Friday, November 18, 2011 9:33 AM
> Subject: Re: Matrix multiplication in Hadoop
>
> Is Hadoop the best tool for doing large matrix math.
> Sure you can do it, but, aren't there better tools for these types of
> problems?
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Nov 18, 2011, at 10:59 AM, Mike Spreitzer <[hidden email]> wrote:
>
> > Who is doing multiplication of large dense matrices using Hadoop?  What
> is
> > a good way to do that computation using Hadoop?
> >
> > Thanks,
> > Mike
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Matrix multiplication in Hadoop

Mike Spreitzer
In reply to this post by Michael Segel
Well, this mismatch may tell me something interesting about Hadoop. Matrix
multiplication has a lot of inherent parallelism, so from very crude
considerations it is not obvious that there should be a mismatch.  Why is
matrix multiplication ill-suited for Hadoop?

BTW, I looked into the Mahout documentation some, and did not find matrix
multiplication there.  It might be hidden inside one of the advertised
algorithms; I looked at the documentation for a few, but did not notice
mention of MM.

Thanks,
Mike



From:   Michael Segel <[hidden email]>
To:     <[hidden email]>
Date:   11/18/2011 01:49 PM
Subject:        RE: Matrix multiplication in Hadoop




Ok Mike,

First I admire that you are studying Hadoop.

To answer your question... not well.

Might I suggest that if you want to learn Hadoop, you try and find a
problem which can easily be broken in to a series of parallel tasks where
there is minimal communication requirements between each task?

No offense, but if I could make a parallel... what you're asking is akin
to taking a normalized relational model and trying to run it as is in
HBase.
Yes it can be done. But not the best use of resources.

> To: [hidden email]
> CC: [hidden email]
> Subject: Re: Matrix multiplication in Hadoop
> From: [hidden email]
> Date: Fri, 18 Nov 2011 12:39:00 -0500
>
> That's also an interesting question, but right now I am studying Hadoop
> and want to know how well dense MM can be done in Hadoop.
>
> Thanks,
> Mike
>
>
>
> From:   Michel Segel <[hidden email]>
> To:     "[hidden email]" <[hidden email]>
> Date:   11/18/2011 12:34 PM
> Subject:        Re: Matrix multiplication in Hadoop
>
>
>
> Is Hadoop the best tool for doing large matrix math.
> Sure you can do it, but, aren't there better tools for these types of
> problems?
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Mike Davis
On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote:
>  Why is matrix multiplication ill-suited for Hadoop?

IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
specific SIMD instructions and, by extension, optimized BLAS routines.
Running a large MM task using intel's MKL rather than relying on generic
compiler optimization is orders of magnitude faster on a single multicore
processor. I see almost no way that Hadoop could win such a CPU intensive
task against an mpi cluster with even a tenth of the nodes running with a
decently tuned BLAS library. Racing even against a single CPU might be
difficult, given the i/o overhead.

Still, it's a reasonably common problem and we shouldn't murder the good in
favor of the best. I'm certain a MM/LinAlg Hadoop library with even
mediocre performance, wrt C, would get used.

--
Mike Davis
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Matrix multiplication in Hadoop

Tim Broberg-2
Perhaps this is a good candidate for a native library, then?

________________________________________
From: Mike Davis [[hidden email]]
Sent: Friday, November 18, 2011 7:39 PM
To: [hidden email]
Subject: Re: Matrix multiplication in Hadoop

On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote:
>  Why is matrix multiplication ill-suited for Hadoop?

IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
specific SIMD instructions and, by extension, optimized BLAS routines.
Running a large MM task using intel's MKL rather than relying on generic
compiler optimization is orders of magnitude faster on a single multicore
processor. I see almost no way that Hadoop could win such a CPU intensive
task against an mpi cluster with even a tenth of the nodes running with a
decently tuned BLAS library. Racing even against a single CPU might be
difficult, given the i/o overhead.

Still, it's a reasonably common problem and we shouldn't murder the good in
favor of the best. I'm certain a MM/LinAlg Hadoop library with even
mediocre performance, wrt C, would get used.

--
Mike Davis

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Edward Capriolo
Sounds like a job for next gen map reduce native libraries and gpu's. A
modern day Dr frankenstein for sure.

On Saturday, November 19, 2011, Tim Broberg <[hidden email]> wrote:

> Perhaps this is a good candidate for a native library, then?
>
> ________________________________________
> From: Mike Davis [[hidden email]]
> Sent: Friday, November 18, 2011 7:39 PM
> To: [hidden email]
> Subject: Re: Matrix multiplication in Hadoop
>
> On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote:
>>  Why is matrix multiplication ill-suited for Hadoop?
>
> IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
> specific SIMD instructions and, by extension, optimized BLAS routines.
> Running a large MM task using intel's MKL rather than relying on generic
> compiler optimization is orders of magnitude faster on a single multicore
> processor. I see almost no way that Hadoop could win such a CPU intensive
> task against an mpi cluster with even a tenth of the nodes running with a
> decently tuned BLAS library. Racing even against a single CPU might be
> difficult, given the i/o overhead.
>
> Still, it's a reasonably common problem and we shouldn't murder the good
in

> favor of the best. I'm certain a MM/LinAlg Hadoop library with even
> mediocre performance, wrt C, would get used.
>
> --
> Mike Davis
>
> The information and any attached documents contained in this message
> may be confidential and/or legally privileged.  The message is
> intended solely for the addressee(s).  If you are not the intended
> recipient, you are hereby notified that any use, dissemination, or
> reproduction is strictly prohibited and may be unlawful.  If you are
> not the intended recipient, please contact the sender immediately by
> return e-mail and destroy all copies of the original message.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Chen He
In reply to this post by Tim Broberg-2
Did you try Hama?

There are may methods.

1) use Hadoop MPI which allows you use MPI MM code based on Hadoop;

2) Hama is designed for MM

3) Use pure Hadoop Java MapReduce;

I did this before but may not be optimal algorithm. Put your first matrix
in DistributedCache and take second matrix line as inputsplit. For each
line, use a mapper to let a array multply the first matrix in
DistributedCache. Use reducer to collect the result matrix. This algorithm
is limited by your DistributedCache size. It is suitable for a small matrix
to multiply a huge matrix.

Chen
On Sat, Nov 19, 2011 at 10:34 AM, Tim Broberg <[hidden email]> wrote:

> Perhaps this is a good candidate for a native library, then?
>
> ________________________________________
> From: Mike Davis [[hidden email]]
> Sent: Friday, November 18, 2011 7:39 PM
> To: [hidden email]
> Subject: Re: Matrix multiplication in Hadoop
>
>  On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote:
> >  Why is matrix multiplication ill-suited for Hadoop?
>
> IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
> specific SIMD instructions and, by extension, optimized BLAS routines.
> Running a large MM task using intel's MKL rather than relying on generic
> compiler optimization is orders of magnitude faster on a single multicore
> processor. I see almost no way that Hadoop could win such a CPU intensive
> task against an mpi cluster with even a tenth of the nodes running with a
> decently tuned BLAS library. Racing even against a single CPU might be
> difficult, given the i/o overhead.
>
> Still, it's a reasonably common problem and we shouldn't murder the good in
> favor of the best. I'm certain a MM/LinAlg Hadoop library with even
> mediocre performance, wrt C, would get used.
>
> --
> Mike Davis
>
> The information and any attached documents contained in this message
> may be confidential and/or legally privileged.  The message is
> intended solely for the addressee(s).  If you are not the intended
> recipient, you are hereby notified that any use, dissemination, or
> reproduction is strictly prohibited and may be unlawful.  If you are
> not the intended recipient, please contact the sender immediately by
> return e-mail and destroy all copies of the original message.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Chen He
In reply to this post by Edward Capriolo
Right, I agree with Edward Capriolo, Hadoop + GPGPU is a better choice.



On Sat, Nov 19, 2011 at 10:53 AM, Edward Capriolo <[hidden email]>wrote:

> Sounds like a job for next gen map reduce native libraries and gpu's. A
> modern day Dr frankenstein for sure.
>
> On Saturday, November 19, 2011, Tim Broberg <[hidden email]> wrote:
> > Perhaps this is a good candidate for a native library, then?
> >
> > ________________________________________
> > From: Mike Davis [[hidden email]]
> > Sent: Friday, November 18, 2011 7:39 PM
> > To: [hidden email]
> > Subject: Re: Matrix multiplication in Hadoop
> >
> > On Friday, November 18, 2011, Mike Spreitzer <[hidden email]>
> wrote:
> >>  Why is matrix multiplication ill-suited for Hadoop?
> >
> > IMHO, a huge issue here is the JVM's inability to fully support cpu
> vendor
> > specific SIMD instructions and, by extension, optimized BLAS routines.
> > Running a large MM task using intel's MKL rather than relying on generic
> > compiler optimization is orders of magnitude faster on a single multicore
> > processor. I see almost no way that Hadoop could win such a CPU intensive
> > task against an mpi cluster with even a tenth of the nodes running with a
> > decently tuned BLAS library. Racing even against a single CPU might be
> > difficult, given the i/o overhead.
> >
> > Still, it's a reasonably common problem and we shouldn't murder the good
> in
> > favor of the best. I'm certain a MM/LinAlg Hadoop library with even
> > mediocre performance, wrt C, would get used.
> >
> > --
> > Mike Davis
> >
> > The information and any attached documents contained in this message
> > may be confidential and/or legally privileged.  The message is
> > intended solely for the addressee(s).  If you are not the intended
> > recipient, you are hereby notified that any use, dissemination, or
> > reproduction is strictly prohibited and may be unlawful.  If you are
> > not the intended recipient, please contact the sender immediately by
> > return e-mail and destroy all copies of the original message.
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Tommaso Teofili
In reply to this post by Chen He
I agree Hama (and BSP model) could be a good option, plus Hama also
supports MR nextgen now [1].
I know MM has been implemented with Hama in the past so it may be worth
asking on the mailing list.

My 2 cents,
Tommaso

[1] : http://svn.apache.org/repos/asf/incubator/hama/trunk/yarn/


2011/11/19 He Chen <[hidden email]>

> Did you try Hama?
>
> There are may methods.
>
> 1) use Hadoop MPI which allows you use MPI MM code based on Hadoop;
>
> 2) Hama is designed for MM
>
> 3) Use pure Hadoop Java MapReduce;
>
> I did this before but may not be optimal algorithm. Put your first matrix
> in DistributedCache and take second matrix line as inputsplit. For each
> line, use a mapper to let a array multply the first matrix in
> DistributedCache. Use reducer to collect the result matrix. This algorithm
> is limited by your DistributedCache size. It is suitable for a small matrix
> to multiply a huge matrix.
>
> Chen
> On Sat, Nov 19, 2011 at 10:34 AM, Tim Broberg <[hidden email]>
> wrote:
>
> > Perhaps this is a good candidate for a native library, then?
> >
> > ________________________________________
> > From: Mike Davis [[hidden email]]
> > Sent: Friday, November 18, 2011 7:39 PM
> > To: [hidden email]
> > Subject: Re: Matrix multiplication in Hadoop
> >
> >  On Friday, November 18, 2011, Mike Spreitzer <[hidden email]>
> wrote:
> > >  Why is matrix multiplication ill-suited for Hadoop?
> >
> > IMHO, a huge issue here is the JVM's inability to fully support cpu
> vendor
> > specific SIMD instructions and, by extension, optimized BLAS routines.
> > Running a large MM task using intel's MKL rather than relying on generic
> > compiler optimization is orders of magnitude faster on a single multicore
> > processor. I see almost no way that Hadoop could win such a CPU intensive
> > task against an mpi cluster with even a tenth of the nodes running with a
> > decently tuned BLAS library. Racing even against a single CPU might be
> > difficult, given the i/o overhead.
> >
> > Still, it's a reasonably common problem and we shouldn't murder the good
> in
> > favor of the best. I'm certain a MM/LinAlg Hadoop library with even
> > mediocre performance, wrt C, would get used.
> >
> > --
> > Mike Davis
> >
> > The information and any attached documents contained in this message
> > may be confidential and/or legally privileged.  The message is
> > intended solely for the addressee(s).  If you are not the intended
> > recipient, you are hereby notified that any use, dissemination, or
> > reproduction is strictly prohibited and may be unlawful.  If you are
> > not the intended recipient, please contact the sender immediately by
> > return e-mail and destroy all copies of the original message.
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Michael Segel
In reply to this post by Edward Capriolo
You really don't need to wait...

If you're going to go down this path you can use a jni wrapper to do the c/c++ code for the gpu...
You can do that now...

If you want to go beyond the 1D you can do it but you have to get a bit creative... but it's doable...


Sent from a remote device. Please excuse any typos...

Mike Segel

On Nov 19, 2011, at 10:53 AM, Edward Capriolo <[hidden email]> wrote:

> Sounds like a job for next gen map reduce native libraries and gpu's. A
> modern day Dr frankenstein for sure.
>
> On Saturday, November 19, 2011, Tim Broberg <[hidden email]> wrote:
>> Perhaps this is a good candidate for a native library, then?
>>
>> ________________________________________
>> From: Mike Davis [[hidden email]]
>> Sent: Friday, November 18, 2011 7:39 PM
>> To: [hidden email]
>> Subject: Re: Matrix multiplication in Hadoop
>>
>> On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote:
>>> Why is matrix multiplication ill-suited for Hadoop?
>>
>> IMHO, a huge issue here is the JVM's inability to fully support cpu vendor
>> specific SIMD instructions and, by extension, optimized BLAS routines.
>> Running a large MM task using intel's MKL rather than relying on generic
>> compiler optimization is orders of magnitude faster on a single multicore
>> processor. I see almost no way that Hadoop could win such a CPU intensive
>> task against an mpi cluster with even a tenth of the nodes running with a
>> decently tuned BLAS library. Racing even against a single CPU might be
>> difficult, given the i/o overhead.
>>
>> Still, it's a reasonably common problem and we shouldn't murder the good
> in
>> favor of the best. I'm certain a MM/LinAlg Hadoop library with even
>> mediocre performance, wrt C, would get used.
>>
>> --
>> Mike Davis
>>
>> The information and any attached documents contained in this message
>> may be confidential and/or legally privileged.  The message is
>> intended solely for the addressee(s).  If you are not the intended
>> recipient, you are hereby notified that any use, dissemination, or
>> reproduction is strictly prohibited and may be unlawful.  If you are
>> not the intended recipient, please contact the sender immediately by
>> return e-mail and destroy all copies of the original message.
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

bejoy.hadoop
In reply to this post by Mike Spreitzer
Hey Mike
          In mahout one place where   matrix multiplication is used is in  Collaborative Filtering distributed implementation. The recommendations here are generated by the multiplication of a cooccurence matrix with a user vector. This user vector is treated as a single column matrix and then the matrix multiplication takes place in there.

Regards
Bejoy K S

-----Original Message-----
From: Mike Spreitzer <[hidden email]>
Date: Fri, 18 Nov 2011 14:52:05
To: <[hidden email]>
Reply-To: [hidden email]
Subject: RE: Matrix multiplication in Hadoop

Well, this mismatch may tell me something interesting about Hadoop. Matrix
multiplication has a lot of inherent parallelism, so from very crude
considerations it is not obvious that there should be a mismatch.  Why is
matrix multiplication ill-suited for Hadoop?

BTW, I looked into the Mahout documentation some, and did not find matrix
multiplication there.  It might be hidden inside one of the advertised
algorithms; I looked at the documentation for a few, but did not notice
mention of MM.

Thanks,
Mike



From:   Michael Segel <[hidden email]>
To:     <[hidden email]>
Date:   11/18/2011 01:49 PM
Subject:        RE: Matrix multiplication in Hadoop




Ok Mike,

First I admire that you are studying Hadoop.

To answer your question... not well.

Might I suggest that if you want to learn Hadoop, you try and find a
problem which can easily be broken in to a series of parallel tasks where
there is minimal communication requirements between each task?

No offense, but if I could make a parallel... what you're asking is akin
to taking a normalized relational model and trying to run it as is in
HBase.
Yes it can be done. But not the best use of resources.

> To: [hidden email]
> CC: [hidden email]
> Subject: Re: Matrix multiplication in Hadoop
> From: [hidden email]
> Date: Fri, 18 Nov 2011 12:39:00 -0500
>
> That's also an interesting question, but right now I am studying Hadoop
> and want to know how well dense MM can be done in Hadoop.
>
> Thanks,
> Mike
>
>
>
> From:   Michel Segel <[hidden email]>
> To:     "[hidden email]" <[hidden email]>
> Date:   11/18/2011 12:34 PM
> Subject:        Re: Matrix multiplication in Hadoop
>
>
>
> Is Hadoop the best tool for doing large matrix math.
> Sure you can do it, but, aren't there better tools for these types of
> problems?
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Stephen Boesch
Hi,
   there are two solutions suggested that take advantage of either (a) a
vector x matrix (your CF / Mahout example )  or (b) a small matrix x large
matrix (an earlier suggestion of putting the  small matrix into the
Distributed Cache).  Not clear yet on good approaches of (c)  large matrix
x large matrix.


2011/11/19 <[hidden email]>

> Hey Mike
>          In mahout one place where   matrix multiplication is used is in
>  Collaborative Filtering distributed implementation. The recommendations
> here are generated by the multiplication of a cooccurence matrix with a
> user vector. This user vector is treated as a single column matrix and then
> the matrix multiplication takes place in there.
>
> Regards
> Bejoy K S
>
> -----Original Message-----
> From: Mike Spreitzer <[hidden email]>
> Date: Fri, 18 Nov 2011 14:52:05
> To: <[hidden email]>
> Reply-To: [hidden email]
> Subject: RE: Matrix multiplication in Hadoop
>
> Well, this mismatch may tell me something interesting about Hadoop. Matrix
> multiplication has a lot of inherent parallelism, so from very crude
> considerations it is not obvious that there should be a mismatch.  Why is
> matrix multiplication ill-suited for Hadoop?
>
> BTW, I looked into the Mahout documentation some, and did not find matrix
> multiplication there.  It might be hidden inside one of the advertised
> algorithms; I looked at the documentation for a few, but did not notice
> mention of MM.
>
> Thanks,
> Mike
>
>
>
> From:   Michael Segel <[hidden email]>
> To:     <[hidden email]>
> Date:   11/18/2011 01:49 PM
> Subject:        RE: Matrix multiplication in Hadoop
>
>
>
>
> Ok Mike,
>
> First I admire that you are studying Hadoop.
>
> To answer your question... not well.
>
> Might I suggest that if you want to learn Hadoop, you try and find a
> problem which can easily be broken in to a series of parallel tasks where
> there is minimal communication requirements between each task?
>
> No offense, but if I could make a parallel... what you're asking is akin
> to taking a normalized relational model and trying to run it as is in
> HBase.
> Yes it can be done. But not the best use of resources.
>
> > To: [hidden email]
> > CC: [hidden email]
> > Subject: Re: Matrix multiplication in Hadoop
> > From: [hidden email]
> > Date: Fri, 18 Nov 2011 12:39:00 -0500
> >
> > That's also an interesting question, but right now I am studying Hadoop
> > and want to know how well dense MM can be done in Hadoop.
> >
> > Thanks,
> > Mike
> >
> >
> >
> > From:   Michel Segel <[hidden email]>
> > To:     "[hidden email]" <[hidden email]>
> > Date:   11/18/2011 12:34 PM
> > Subject:        Re: Matrix multiplication in Hadoop
> >
> >
> >
> > Is Hadoop the best tool for doing large matrix math.
> > Sure you can do it, but, aren't there better tools for these types of
> > problems?
> >
> >
> > Sent from a remote device. Please excuse any typos...
> >
> > Mike Segel
> >
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Lance Norskog-2
Look for uses of the DistributedRowMatrix in the Mahout code. The existing
Mahout jobs are generally end-to-end algorithm implementations which do
things like matrix multiplication in the middle. Also, the Mahout
algorithms generally prefer to use sparse data for distributed work.

What is a "large" matrix? You may find that you really don't need to go to
the effort of using Hadoop.

Lance

On Sat, Nov 19, 2011 at 3:07 PM, Stephen Boesch <[hidden email]> wrote:

> Hi,
>   there are two solutions suggested that take advantage of either (a) a
> vector x matrix (your CF / Mahout example )  or (b) a small matrix x large
> matrix (an earlier suggestion of putting the  small matrix into the
> Distributed Cache).  Not clear yet on good approaches of (c)  large matrix
> x large matrix.
>
>
> 2011/11/19 <[hidden email]>
>
> > Hey Mike
> >          In mahout one place where   matrix multiplication is used is in
> >  Collaborative Filtering distributed implementation. The recommendations
> > here are generated by the multiplication of a cooccurence matrix with a
> > user vector. This user vector is treated as a single column matrix and
> then
> > the matrix multiplication takes place in there.
> >
> > Regards
> > Bejoy K S
> >
> > -----Original Message-----
> > From: Mike Spreitzer <[hidden email]>
> > Date: Fri, 18 Nov 2011 14:52:05
> > To: <[hidden email]>
> > Reply-To: [hidden email]
> > Subject: RE: Matrix multiplication in Hadoop
> >
> > Well, this mismatch may tell me something interesting about Hadoop.
> Matrix
> > multiplication has a lot of inherent parallelism, so from very crude
> > considerations it is not obvious that there should be a mismatch.  Why is
> > matrix multiplication ill-suited for Hadoop?
> >
> > BTW, I looked into the Mahout documentation some, and did not find matrix
> > multiplication there.  It might be hidden inside one of the advertised
> > algorithms; I looked at the documentation for a few, but did not notice
> > mention of MM.
> >
> > Thanks,
> > Mike
> >
> >
> >
> > From:   Michael Segel <[hidden email]>
> > To:     <[hidden email]>
> > Date:   11/18/2011 01:49 PM
> > Subject:        RE: Matrix multiplication in Hadoop
> >
> >
> >
> >
> > Ok Mike,
> >
> > First I admire that you are studying Hadoop.
> >
> > To answer your question... not well.
> >
> > Might I suggest that if you want to learn Hadoop, you try and find a
> > problem which can easily be broken in to a series of parallel tasks where
> > there is minimal communication requirements between each task?
> >
> > No offense, but if I could make a parallel... what you're asking is akin
> > to taking a normalized relational model and trying to run it as is in
> > HBase.
> > Yes it can be done. But not the best use of resources.
> >
> > > To: [hidden email]
> > > CC: [hidden email]
> > > Subject: Re: Matrix multiplication in Hadoop
> > > From: [hidden email]
> > > Date: Fri, 18 Nov 2011 12:39:00 -0500
> > >
> > > That's also an interesting question, but right now I am studying Hadoop
> > > and want to know how well dense MM can be done in Hadoop.
> > >
> > > Thanks,
> > > Mike
> > >
> > >
> > >
> > > From:   Michel Segel <[hidden email]>
> > > To:     "[hidden email]" <[hidden email]
> >
> > > Date:   11/18/2011 12:34 PM
> > > Subject:        Re: Matrix multiplication in Hadoop
> > >
> > >
> > >
> > > Is Hadoop the best tool for doing large matrix math.
> > > Sure you can do it, but, aren't there better tools for these types of
> > > problems?
> > >
> > >
> > > Sent from a remote device. Please excuse any typos...
> > >
> > > Mike Segel
> > >
> >
> >
> >
>



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication in Hadoop

Mike Spreitzer
I am looking at large dense matrix multiplication as an example problem
for a class of middleware.  I am also interested in sparse matrices, but
am taking things one step at a time.

There is a paper in IEEE CloudCom '10 about Hama, including a matrix
multiplication technique.  It is essentially the same as what is called
"technique 4" in the 2009 monograph by John Norstad cited early in this
thread.  Which means that, despite the fact that Hama touts the virtues of
BSP (a position with which I am very sympathetic), this technique doesn't
really take advantage of the extra features that BSP has over MapReduce.
Note also that this technique creates intermediate data of much greater
volume than the input.  For example, if each matrix is stored as an NxN
grid of blocks, the intermediate data (the blocks paired up, awaiting
multiplication) is a factor of N larger than the input.  I have heard
people saying that N may be rather larger than sqrt(number of machines)
because in some circumstances N has to be chosen before the number of
available machines is known and you want to be able to divide the NxN load
among your machines rather evenly.  Even if N is like sqrt(number of
machines) this is still an unwelcome amount of bloat.  In comparison, the
SUMMA technique does matrix multiplication but its intermediate data
volume is no greater than the input.

Thanks,
Mike
12
Loading...