Who is doing multiplication of large dense matrices using Hadoop? What is
a good way to do that computation using Hadoop? Thanks, Mike |
I'm not sure, but I would suspect that Mahout has some low level map/reduce
jobs for this. You might start there. On Fri, Nov 18, 2011 at 8:59 AM, Mike Spreitzer <[hidden email]> wrote: > Who is doing multiplication of large dense matrices using Hadoop? What is > a good way to do that computation using Hadoop? > > Thanks, > Mike -- Thanks, John C |
In reply to this post by Mike Spreitzer
I wrote up a basic algorithm for this here:
http://math.columbia.edu/~tpeters/teh-codez/hadoop/hadoop-matrix-mult.html It's almost certainly not optimal, but might get you some ideas. Here is another approach http://www.norstad.org/matrix-multiply/index.html Cheers, Tom On Fri, Nov 18, 2011 at 11:59 AM, Mike Spreitzer <[hidden email]>wrote: > Who is doing multiplication of large dense matrices using Hadoop? What is > a good way to do that computation using Hadoop? > > Thanks, > Mike |
In reply to this post by Mike Spreitzer
Is Hadoop the best tool for doing large matrix math.
Sure you can do it, but, aren't there better tools for these types of problems? Sent from a remote device. Please excuse any typos... Mike Segel On Nov 18, 2011, at 10:59 AM, Mike Spreitzer <[hidden email]> wrote: > Who is doing multiplication of large dense matrices using Hadoop? What is > a good way to do that computation using Hadoop? > > Thanks, > Mike |
That's also an interesting question, but right now I am studying Hadoop
and want to know how well dense MM can be done in Hadoop. Thanks, Mike From: Michel Segel <[hidden email]> To: "[hidden email]" <[hidden email]> Date: 11/18/2011 12:34 PM Subject: Re: Matrix multiplication in Hadoop Is Hadoop the best tool for doing large matrix math. Sure you can do it, but, aren't there better tools for these types of problems? Sent from a remote device. Please excuse any typos... Mike Segel |
In reply to this post by Michael Segel
I'd really be interested in a comparison of Numpy/Octave/Matlab kind of tools with a Hadoop (lets say 4-10 large cloud servers) implementation with growing size of the matrix. I want to know the scale at which Hadoop really starts to pull away.Â
Â -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Michel Segel <[hidden email]> To: "[hidden email]" <[hidden email]> Sent: Friday, November 18, 2011 9:33 AM Subject: Re: Matrix multiplication in Hadoop Is Hadoop the best tool for doing large matrix math. Sure you can do it, but, aren't there better tools for these types of problems? Sent from a remote device. Please excuse any typos... Mike Segel On Nov 18, 2011, at 10:59 AM, Mike Spreitzer <[hidden email]> wrote: > Who is doing multiplication of large dense matrices using Hadoop?Â What is > a good way to do that computation using Hadoop? > > Thanks, > Mike |
In reply to this post by Mike Spreitzer
Ok Mike, First I admire that you are studying Hadoop. To answer your question... not well. Might I suggest that if you want to learn Hadoop, you try and find a problem which can easily be broken in to a series of parallel tasks where there is minimal communication requirements between each task? No offense, but if I could make a parallel... what you're asking is akin to taking a normalized relational model and trying to run it as is in HBase. Yes it can be done. But not the best use of resources. > To: [hidden email] > CC: [hidden email] > Subject: Re: Matrix multiplication in Hadoop > From: [hidden email] > Date: Fri, 18 Nov 2011 12:39:00 -0500 > > That's also an interesting question, but right now I am studying Hadoop > and want to know how well dense MM can be done in Hadoop. > > Thanks, > Mike > > > > From: Michel Segel <[hidden email]> > To: "[hidden email]" <[hidden email]> > Date: 11/18/2011 12:34 PM > Subject: Re: Matrix multiplication in Hadoop > > > > Is Hadoop the best tool for doing large matrix math. > Sure you can do it, but, aren't there better tools for these types of > problems? > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > |
In reply to this post by Ayon Sinha
A problem with matrix multiplication in hadoop is that hadoop is row
oriented for the most part. I have thought about this use case however and you can theoretically turn a 2D matrix into a 1D matrix and then that fits into the row oriented nature of hadoop. Also being that the typical mapper can have fairly large chunks of memory like 1024MB I have done work like this before were I loaded such datasets into memory to process them. That usage does not really fit the map reduce model. I have been wanting to look at: http://www.scidb.org/ Edward On Fri, Nov 18, 2011 at 1:48 PM, Ayon Sinha <[hidden email]> wrote: > I'd really be interested in a comparison of Numpy/Octave/Matlab kind of > tools with a Hadoop (lets say 4-10 large cloud servers) implementation with > growing size of the matrix. I want to know the scale at which Hadoop really > starts to pull away. > > -Ayon > See My Photos on Flickr > Also check out my Blog for answers to commonly asked questions. > > > > ________________________________ > From: Michel Segel <[hidden email]> > To: "[hidden email]" <[hidden email]> > Sent: Friday, November 18, 2011 9:33 AM > Subject: Re: Matrix multiplication in Hadoop > > Is Hadoop the best tool for doing large matrix math. > Sure you can do it, but, aren't there better tools for these types of > problems? > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Nov 18, 2011, at 10:59 AM, Mike Spreitzer <[hidden email]> wrote: > > > Who is doing multiplication of large dense matrices using Hadoop? What > is > > a good way to do that computation using Hadoop? > > > > Thanks, > > Mike |
In reply to this post by Michael Segel
Well, this mismatch may tell me something interesting about Hadoop. Matrix
multiplication has a lot of inherent parallelism, so from very crude considerations it is not obvious that there should be a mismatch. Why is matrix multiplication ill-suited for Hadoop? BTW, I looked into the Mahout documentation some, and did not find matrix multiplication there. It might be hidden inside one of the advertised algorithms; I looked at the documentation for a few, but did not notice mention of MM. Thanks, Mike From: Michael Segel <[hidden email]> To: <[hidden email]> Date: 11/18/2011 01:49 PM Subject: RE: Matrix multiplication in Hadoop Ok Mike, First I admire that you are studying Hadoop. To answer your question... not well. Might I suggest that if you want to learn Hadoop, you try and find a problem which can easily be broken in to a series of parallel tasks where there is minimal communication requirements between each task? No offense, but if I could make a parallel... what you're asking is akin to taking a normalized relational model and trying to run it as is in HBase. Yes it can be done. But not the best use of resources. > To: [hidden email] > CC: [hidden email] > Subject: Re: Matrix multiplication in Hadoop > From: [hidden email] > Date: Fri, 18 Nov 2011 12:39:00 -0500 > > That's also an interesting question, but right now I am studying Hadoop > and want to know how well dense MM can be done in Hadoop. > > Thanks, > Mike > > > > From: Michel Segel <[hidden email]> > To: "[hidden email]" <[hidden email]> > Date: 11/18/2011 12:34 PM > Subject: Re: Matrix multiplication in Hadoop > > > > Is Hadoop the best tool for doing large matrix math. > Sure you can do it, but, aren't there better tools for these types of > problems? > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > |
On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote:
> Why is matrix multiplication ill-suited for Hadoop? IMHO, a huge issue here is the JVM's inability to fully support cpu vendor specific SIMD instructions and, by extension, optimized BLAS routines. Running a large MM task using intel's MKL rather than relying on generic compiler optimization is orders of magnitude faster on a single multicore processor. I see almost no way that Hadoop could win such a CPU intensive task against an mpi cluster with even a tenth of the nodes running with a decently tuned BLAS library. Racing even against a single CPU might be difficult, given the i/o overhead. Still, it's a reasonably common problem and we shouldn't murder the good in favor of the best. I'm certain a MM/LinAlg Hadoop library with even mediocre performance, wrt C, would get used. -- Mike Davis |
Perhaps this is a good candidate for a native library, then?
________________________________________ From: Mike Davis [[hidden email]] Sent: Friday, November 18, 2011 7:39 PM To: [hidden email] Subject: Re: Matrix multiplication in Hadoop On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote: > Why is matrix multiplication ill-suited for Hadoop? IMHO, a huge issue here is the JVM's inability to fully support cpu vendor specific SIMD instructions and, by extension, optimized BLAS routines. Running a large MM task using intel's MKL rather than relying on generic compiler optimization is orders of magnitude faster on a single multicore processor. I see almost no way that Hadoop could win such a CPU intensive task against an mpi cluster with even a tenth of the nodes running with a decently tuned BLAS library. Racing even against a single CPU might be difficult, given the i/o overhead. Still, it's a reasonably common problem and we shouldn't murder the good in favor of the best. I'm certain a MM/LinAlg Hadoop library with even mediocre performance, wrt C, would get used. -- Mike Davis The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies of the original message. |
Sounds like a job for next gen map reduce native libraries and gpu's. A
modern day Dr frankenstein for sure. On Saturday, November 19, 2011, Tim Broberg <[hidden email]> wrote: > Perhaps this is a good candidate for a native library, then? > > ________________________________________ > From: Mike Davis [[hidden email]] > Sent: Friday, November 18, 2011 7:39 PM > To: [hidden email] > Subject: Re: Matrix multiplication in Hadoop > > On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote: >> Why is matrix multiplication ill-suited for Hadoop? > > IMHO, a huge issue here is the JVM's inability to fully support cpu vendor > specific SIMD instructions and, by extension, optimized BLAS routines. > Running a large MM task using intel's MKL rather than relying on generic > compiler optimization is orders of magnitude faster on a single multicore > processor. I see almost no way that Hadoop could win such a CPU intensive > task against an mpi cluster with even a tenth of the nodes running with a > decently tuned BLAS library. Racing even against a single CPU might be > difficult, given the i/o overhead. > > Still, it's a reasonably common problem and we shouldn't murder the good > favor of the best. I'm certain a MM/LinAlg Hadoop library with even > mediocre performance, wrt C, would get used. > > -- > Mike Davis > > The information and any attached documents contained in this message > may be confidential and/or legally privileged. The message is > intended solely for the addressee(s). If you are not the intended > recipient, you are hereby notified that any use, dissemination, or > reproduction is strictly prohibited and may be unlawful. If you are > not the intended recipient, please contact the sender immediately by > return e-mail and destroy all copies of the original message. > |
In reply to this post by Tim Broberg-2
Did you try Hama?
There are may methods. 1) use Hadoop MPI which allows you use MPI MM code based on Hadoop; 2) Hama is designed for MM 3) Use pure Hadoop Java MapReduce; I did this before but may not be optimal algorithm. Put your first matrix in DistributedCache and take second matrix line as inputsplit. For each line, use a mapper to let a array multply the first matrix in DistributedCache. Use reducer to collect the result matrix. This algorithm is limited by your DistributedCache size. It is suitable for a small matrix to multiply a huge matrix. Chen On Sat, Nov 19, 2011 at 10:34 AM, Tim Broberg <[hidden email]> wrote: > Perhaps this is a good candidate for a native library, then? > > ________________________________________ > From: Mike Davis [[hidden email]] > Sent: Friday, November 18, 2011 7:39 PM > To: [hidden email] > Subject: Re: Matrix multiplication in Hadoop > > On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote: > > Why is matrix multiplication ill-suited for Hadoop? > > IMHO, a huge issue here is the JVM's inability to fully support cpu vendor > specific SIMD instructions and, by extension, optimized BLAS routines. > Running a large MM task using intel's MKL rather than relying on generic > compiler optimization is orders of magnitude faster on a single multicore > processor. I see almost no way that Hadoop could win such a CPU intensive > task against an mpi cluster with even a tenth of the nodes running with a > decently tuned BLAS library. Racing even against a single CPU might be > difficult, given the i/o overhead. > > Still, it's a reasonably common problem and we shouldn't murder the good in > favor of the best. I'm certain a MM/LinAlg Hadoop library with even > mediocre performance, wrt C, would get used. > > -- > Mike Davis > > The information and any attached documents contained in this message > may be confidential and/or legally privileged. The message is > intended solely for the addressee(s). If you are not the intended > recipient, you are hereby notified that any use, dissemination, or > reproduction is strictly prohibited and may be unlawful. If you are > not the intended recipient, please contact the sender immediately by > return e-mail and destroy all copies of the original message. > |
In reply to this post by Edward Capriolo
Right, I agree with Edward Capriolo, Hadoop + GPGPU is a better choice.
On Sat, Nov 19, 2011 at 10:53 AM, Edward Capriolo <[hidden email]>wrote: > Sounds like a job for next gen map reduce native libraries and gpu's. A > modern day Dr frankenstein for sure. > > On Saturday, November 19, 2011, Tim Broberg <[hidden email]> wrote: > > Perhaps this is a good candidate for a native library, then? > > > > ________________________________________ > > From: Mike Davis [[hidden email]] > > Sent: Friday, November 18, 2011 7:39 PM > > To: [hidden email] > > Subject: Re: Matrix multiplication in Hadoop > > > > On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> > wrote: > >> Why is matrix multiplication ill-suited for Hadoop? > > > > IMHO, a huge issue here is the JVM's inability to fully support cpu > vendor > > specific SIMD instructions and, by extension, optimized BLAS routines. > > Running a large MM task using intel's MKL rather than relying on generic > > compiler optimization is orders of magnitude faster on a single multicore > > processor. I see almost no way that Hadoop could win such a CPU intensive > > task against an mpi cluster with even a tenth of the nodes running with a > > decently tuned BLAS library. Racing even against a single CPU might be > > difficult, given the i/o overhead. > > > > Still, it's a reasonably common problem and we shouldn't murder the good > in > > favor of the best. I'm certain a MM/LinAlg Hadoop library with even > > mediocre performance, wrt C, would get used. > > > > -- > > Mike Davis > > > > The information and any attached documents contained in this message > > may be confidential and/or legally privileged. The message is > > intended solely for the addressee(s). If you are not the intended > > recipient, you are hereby notified that any use, dissemination, or > > reproduction is strictly prohibited and may be unlawful. If you are > > not the intended recipient, please contact the sender immediately by > > return e-mail and destroy all copies of the original message. > > > |
In reply to this post by Chen He
I agree Hama (and BSP model) could be a good option, plus Hama also
supports MR nextgen now [1]. I know MM has been implemented with Hama in the past so it may be worth asking on the mailing list. My 2 cents, Tommaso [1] : http://svn.apache.org/repos/asf/incubator/hama/trunk/yarn/ 2011/11/19 He Chen <[hidden email]> > Did you try Hama? > > There are may methods. > > 1) use Hadoop MPI which allows you use MPI MM code based on Hadoop; > > 2) Hama is designed for MM > > 3) Use pure Hadoop Java MapReduce; > > I did this before but may not be optimal algorithm. Put your first matrix > in DistributedCache and take second matrix line as inputsplit. For each > line, use a mapper to let a array multply the first matrix in > DistributedCache. Use reducer to collect the result matrix. This algorithm > is limited by your DistributedCache size. It is suitable for a small matrix > to multiply a huge matrix. > > Chen > On Sat, Nov 19, 2011 at 10:34 AM, Tim Broberg <[hidden email]> > wrote: > > > Perhaps this is a good candidate for a native library, then? > > > > ________________________________________ > > From: Mike Davis [[hidden email]] > > Sent: Friday, November 18, 2011 7:39 PM > > To: [hidden email] > > Subject: Re: Matrix multiplication in Hadoop > > > > On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> > wrote: > > > Why is matrix multiplication ill-suited for Hadoop? > > > > IMHO, a huge issue here is the JVM's inability to fully support cpu > vendor > > specific SIMD instructions and, by extension, optimized BLAS routines. > > Running a large MM task using intel's MKL rather than relying on generic > > compiler optimization is orders of magnitude faster on a single multicore > > processor. I see almost no way that Hadoop could win such a CPU intensive > > task against an mpi cluster with even a tenth of the nodes running with a > > decently tuned BLAS library. Racing even against a single CPU might be > > difficult, given the i/o overhead. > > > > Still, it's a reasonably common problem and we shouldn't murder the good > in > > favor of the best. I'm certain a MM/LinAlg Hadoop library with even > > mediocre performance, wrt C, would get used. > > > > -- > > Mike Davis > > > > The information and any attached documents contained in this message > > may be confidential and/or legally privileged. The message is > > intended solely for the addressee(s). If you are not the intended > > recipient, you are hereby notified that any use, dissemination, or > > reproduction is strictly prohibited and may be unlawful. If you are > > not the intended recipient, please contact the sender immediately by > > return e-mail and destroy all copies of the original message. > > > |
In reply to this post by Edward Capriolo
You really don't need to wait...
If you're going to go down this path you can use a jni wrapper to do the c/c++ code for the gpu... You can do that now... If you want to go beyond the 1D you can do it but you have to get a bit creative... but it's doable... Sent from a remote device. Please excuse any typos... Mike Segel On Nov 19, 2011, at 10:53 AM, Edward Capriolo <[hidden email]> wrote: > Sounds like a job for next gen map reduce native libraries and gpu's. A > modern day Dr frankenstein for sure. > > On Saturday, November 19, 2011, Tim Broberg <[hidden email]> wrote: >> Perhaps this is a good candidate for a native library, then? >> >> ________________________________________ >> From: Mike Davis [[hidden email]] >> Sent: Friday, November 18, 2011 7:39 PM >> To: [hidden email] >> Subject: Re: Matrix multiplication in Hadoop >> >> On Friday, November 18, 2011, Mike Spreitzer <[hidden email]> wrote: >>> Why is matrix multiplication ill-suited for Hadoop? >> >> IMHO, a huge issue here is the JVM's inability to fully support cpu vendor >> specific SIMD instructions and, by extension, optimized BLAS routines. >> Running a large MM task using intel's MKL rather than relying on generic >> compiler optimization is orders of magnitude faster on a single multicore >> processor. I see almost no way that Hadoop could win such a CPU intensive >> task against an mpi cluster with even a tenth of the nodes running with a >> decently tuned BLAS library. Racing even against a single CPU might be >> difficult, given the i/o overhead. >> >> Still, it's a reasonably common problem and we shouldn't murder the good > in >> favor of the best. I'm certain a MM/LinAlg Hadoop library with even >> mediocre performance, wrt C, would get used. >> >> -- >> Mike Davis >> >> The information and any attached documents contained in this message >> may be confidential and/or legally privileged. The message is >> intended solely for the addressee(s). If you are not the intended >> recipient, you are hereby notified that any use, dissemination, or >> reproduction is strictly prohibited and may be unlawful. If you are >> not the intended recipient, please contact the sender immediately by >> return e-mail and destroy all copies of the original message. >> |
In reply to this post by Mike Spreitzer
Hey Mike
In mahout one place where matrix multiplication is used is in Collaborative Filtering distributed implementation. The recommendations here are generated by the multiplication of a cooccurence matrix with a user vector. This user vector is treated as a single column matrix and then the matrix multiplication takes place in there. Regards Bejoy K S -----Original Message----- From: Mike Spreitzer <[hidden email]> Date: Fri, 18 Nov 2011 14:52:05 To: <[hidden email]> Reply-To: [hidden email] Subject: RE: Matrix multiplication in Hadoop Well, this mismatch may tell me something interesting about Hadoop. Matrix multiplication has a lot of inherent parallelism, so from very crude considerations it is not obvious that there should be a mismatch. Why is matrix multiplication ill-suited for Hadoop? BTW, I looked into the Mahout documentation some, and did not find matrix multiplication there. It might be hidden inside one of the advertised algorithms; I looked at the documentation for a few, but did not notice mention of MM. Thanks, Mike From: Michael Segel <[hidden email]> To: <[hidden email]> Date: 11/18/2011 01:49 PM Subject: RE: Matrix multiplication in Hadoop Ok Mike, First I admire that you are studying Hadoop. To answer your question... not well. Might I suggest that if you want to learn Hadoop, you try and find a problem which can easily be broken in to a series of parallel tasks where there is minimal communication requirements between each task? No offense, but if I could make a parallel... what you're asking is akin to taking a normalized relational model and trying to run it as is in HBase. Yes it can be done. But not the best use of resources. > To: [hidden email] > CC: [hidden email] > Subject: Re: Matrix multiplication in Hadoop > From: [hidden email] > Date: Fri, 18 Nov 2011 12:39:00 -0500 > > That's also an interesting question, but right now I am studying Hadoop > and want to know how well dense MM can be done in Hadoop. > > Thanks, > Mike > > > > From: Michel Segel <[hidden email]> > To: "[hidden email]" <[hidden email]> > Date: 11/18/2011 12:34 PM > Subject: Re: Matrix multiplication in Hadoop > > > > Is Hadoop the best tool for doing large matrix math. > Sure you can do it, but, aren't there better tools for these types of > problems? > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > |
Hi,
there are two solutions suggested that take advantage of either (a) a vector x matrix (your CF / Mahout example ) or (b) a small matrix x large matrix (an earlier suggestion of putting the small matrix into the Distributed Cache). Not clear yet on good approaches of (c) large matrix x large matrix. 2011/11/19 <[hidden email]> > Hey Mike > In mahout one place where matrix multiplication is used is in > Collaborative Filtering distributed implementation. The recommendations > here are generated by the multiplication of a cooccurence matrix with a > user vector. This user vector is treated as a single column matrix and then > the matrix multiplication takes place in there. > > Regards > Bejoy K S > > -----Original Message----- > From: Mike Spreitzer <[hidden email]> > Date: Fri, 18 Nov 2011 14:52:05 > To: <[hidden email]> > Reply-To: [hidden email] > Subject: RE: Matrix multiplication in Hadoop > > Well, this mismatch may tell me something interesting about Hadoop. Matrix > multiplication has a lot of inherent parallelism, so from very crude > considerations it is not obvious that there should be a mismatch. Why is > matrix multiplication ill-suited for Hadoop? > > BTW, I looked into the Mahout documentation some, and did not find matrix > multiplication there. It might be hidden inside one of the advertised > algorithms; I looked at the documentation for a few, but did not notice > mention of MM. > > Thanks, > Mike > > > > From: Michael Segel <[hidden email]> > To: <[hidden email]> > Date: 11/18/2011 01:49 PM > Subject: RE: Matrix multiplication in Hadoop > > > > > Ok Mike, > > First I admire that you are studying Hadoop. > > To answer your question... not well. > > Might I suggest that if you want to learn Hadoop, you try and find a > problem which can easily be broken in to a series of parallel tasks where > there is minimal communication requirements between each task? > > No offense, but if I could make a parallel... what you're asking is akin > to taking a normalized relational model and trying to run it as is in > HBase. > Yes it can be done. But not the best use of resources. > > > To: [hidden email] > > CC: [hidden email] > > Subject: Re: Matrix multiplication in Hadoop > > From: [hidden email] > > Date: Fri, 18 Nov 2011 12:39:00 -0500 > > > > That's also an interesting question, but right now I am studying Hadoop > > and want to know how well dense MM can be done in Hadoop. > > > > Thanks, > > Mike > > > > > > > > From: Michel Segel <[hidden email]> > > To: "[hidden email]" <[hidden email]> > > Date: 11/18/2011 12:34 PM > > Subject: Re: Matrix multiplication in Hadoop > > > > > > > > Is Hadoop the best tool for doing large matrix math. > > Sure you can do it, but, aren't there better tools for these types of > > problems? > > > > > > Sent from a remote device. Please excuse any typos... > > > > Mike Segel > > > > > |
Look for uses of the DistributedRowMatrix in the Mahout code. The existing
Mahout jobs are generally end-to-end algorithm implementations which do things like matrix multiplication in the middle. Also, the Mahout algorithms generally prefer to use sparse data for distributed work. What is a "large" matrix? You may find that you really don't need to go to the effort of using Hadoop. Lance On Sat, Nov 19, 2011 at 3:07 PM, Stephen Boesch <[hidden email]> wrote: > Hi, > there are two solutions suggested that take advantage of either (a) a > vector x matrix (your CF / Mahout example ) or (b) a small matrix x large > matrix (an earlier suggestion of putting the small matrix into the > Distributed Cache). Not clear yet on good approaches of (c) large matrix > x large matrix. > > > 2011/11/19 <[hidden email]> > > > Hey Mike > > In mahout one place where matrix multiplication is used is in > > Collaborative Filtering distributed implementation. The recommendations > > here are generated by the multiplication of a cooccurence matrix with a > > user vector. This user vector is treated as a single column matrix and > then > > the matrix multiplication takes place in there. > > > > Regards > > Bejoy K S > > > > -----Original Message----- > > From: Mike Spreitzer <[hidden email]> > > Date: Fri, 18 Nov 2011 14:52:05 > > To: <[hidden email]> > > Reply-To: [hidden email] > > Subject: RE: Matrix multiplication in Hadoop > > > > Well, this mismatch may tell me something interesting about Hadoop. > Matrix > > multiplication has a lot of inherent parallelism, so from very crude > > considerations it is not obvious that there should be a mismatch. Why is > > matrix multiplication ill-suited for Hadoop? > > > > BTW, I looked into the Mahout documentation some, and did not find matrix > > multiplication there. It might be hidden inside one of the advertised > > algorithms; I looked at the documentation for a few, but did not notice > > mention of MM. > > > > Thanks, > > Mike > > > > > > > > From: Michael Segel <[hidden email]> > > To: <[hidden email]> > > Date: 11/18/2011 01:49 PM > > Subject: RE: Matrix multiplication in Hadoop > > > > > > > > > > Ok Mike, > > > > First I admire that you are studying Hadoop. > > > > To answer your question... not well. > > > > Might I suggest that if you want to learn Hadoop, you try and find a > > problem which can easily be broken in to a series of parallel tasks where > > there is minimal communication requirements between each task? > > > > No offense, but if I could make a parallel... what you're asking is akin > > to taking a normalized relational model and trying to run it as is in > > HBase. > > Yes it can be done. But not the best use of resources. > > > > > To: [hidden email] > > > CC: [hidden email] > > > Subject: Re: Matrix multiplication in Hadoop > > > From: [hidden email] > > > Date: Fri, 18 Nov 2011 12:39:00 -0500 > > > > > > That's also an interesting question, but right now I am studying Hadoop > > > and want to know how well dense MM can be done in Hadoop. > > > > > > Thanks, > > > Mike > > > > > > > > > > > > From: Michel Segel <[hidden email]> > > > To: "[hidden email]" <[hidden email] > > > > > Date: 11/18/2011 12:34 PM > > > Subject: Re: Matrix multiplication in Hadoop > > > > > > > > > > > > Is Hadoop the best tool for doing large matrix math. > > > Sure you can do it, but, aren't there better tools for these types of > > > problems? > > > > > > > > > Sent from a remote device. Please excuse any typos... > > > > > > Mike Segel > > > > > > > > > > -- Lance Norskog [hidden email] |
I am looking at large dense matrix multiplication as an example problem
for a class of middleware. I am also interested in sparse matrices, but am taking things one step at a time. There is a paper in IEEE CloudCom '10 about Hama, including a matrix multiplication technique. It is essentially the same as what is called "technique 4" in the 2009 monograph by John Norstad cited early in this thread. Which means that, despite the fact that Hama touts the virtues of BSP (a position with which I am very sympathetic), this technique doesn't really take advantage of the extra features that BSP has over MapReduce. Note also that this technique creates intermediate data of much greater volume than the input. For example, if each matrix is stored as an NxN grid of blocks, the intermediate data (the blocks paired up, awaiting multiplication) is a factor of N larger than the input. I have heard people saying that N may be rather larger than sqrt(number of machines) because in some circumstances N has to be chosen before the number of available machines is known and you want to be able to divide the NxN load among your machines rather evenly. Even if N is like sqrt(number of machines) this is still an unwelcome amount of bloat. In comparison, the SUMMA technique does matrix multiplication but its intermediate data volume is no greater than the input. Thanks, Mike |
Powered by Nabble | Edit this page |