

hi all devs,
I have read through the dev mail list, and have a rough idea of the
progress of Mahout. I have read the google paper and the nips paper.
As for the nips paper "map reduce for ML on Multicore", i have some
questions.
1. Sect. 4.1 Algorithm Time Complexity Analysis.
the paper assumes m >> n, that is, the training instances are much
larger than the features. its datasets do have very few features. but
this may not be true for many tasks, e.g. text classification, where
feature dimensions will reach 10^410^5. then will the analysis still
hold?
2. Sect. 4.1, too.
"reduce phase can minimize communication by combining data as it's
passed back; this accounts for the logP factor", Could you help me
figure out how logP is calculated.
3. Sect 5.4 Results and Discussion
"SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
or 13.6? From figure 2, it seems should be 13.6?
4. Sect 5.4, too.
"Finally, the above are runs on multiprocessor machines." No matter
multiprocess or multicore, it runs on a single machine which have a
share memory. But actually, M/R is for multimachine, which will
involve much more cost on intermachine communication. So the results
of the paper may be questionable?
Maybe some of the questions are a complete misinterpretation. Please
help me to get an full understanding of the paper, thanks.


On Tuesday 25 March 2008, Hao Zheng wrote:
> 1. Sect. 4.1 Algorithm Time Complexity Analysis.
> the paper assumes m >> n, that is, the training instances are much
> larger than the features. its datasets do have very few features. but
> this may not be true for many tasks, e.g. text classification, where
> feature dimensions will reach 10^410^5. then will the analysis still
> hold?
What I could directly read from the paper in the very same section: The
analysis will not hold in this case for those algorithms that require matrix
inversions or eigen decompositions as long as these operations are not
executed in parallel. The authors did not implement parallel versions for
these operations  the reason they state is the fact that in their datasets m
>> n.
The authors state themselves that there is extensive research on parallelising
eigen decomposition and matrix inversion as well  so if we assume that we do
have a matrix package that can do these operations in a distributed way, IMHO
the analysis in the paper should still hold even for algorithms that require
these steps.
> 2. Sect. 4.1, too.
> "reduce phase can minimize communication by combining data as it's
> passed back; this accounts for the logP factor", Could you help me
> figure out how logP is calculated.
Anyone else who can help out here?
> 3. Sect 5.4 Results and Discussion
> "SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
> or 13.6? From figure 2, it seems should be 13.6?
The axis on the graphs do not have clear titles, but I would agree that it
should be 13.6 as well.
> 4. Sect 5.4, too.
> "Finally, the above are runs on multiprocessor machines." No matter
> multiprocess or multicore, it runs on a single machine which have a
> share memory.
The main motivation for the paper was the rise of multi core machines that ask
for parallel algorithms even though one might not have a cluster available.
> But actually, M/R is for multimachine, which will involve much more cost on
> intermachine communication. So the results of the paper may be
> questionable?
I think you should not expect to get the exact same speedups on multimachine
clusters. Still I think one can expect faster computation for large datasets
even in this setting. What do others think?
Isabel

There is no TRUTH. There is no REALITY. There is no CONSISTENCY. There are
no ABSOLUTE STATEMENTS. I'm very probably wrong.
\ _,,,,,_ Web: < http://www.isabeldrost.de>
/,`.'`' . ;;;,_
,4 ) ),_..;\ ( `''
'''(_/' `'\_) (fL) IM: <xmpp:// [hidden email]>


On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>
>
>> 2. Sect. 4.1, too.
>> "reduce phase can minimize communication by combining data as it's
>> passed back; this accounts for the logP factor", Could you help me
>> figure out how logP is calculated.
>
> Anyone else who can help out here?
Isn't this just log(P) where P is the number of cores? The version of
the paper I have says log(P) not logP, so maybe there is a typo?
From earlier in 4.1:
"We assume that the dimension of the inputs is n (i.e., x
∈R
n
), that we have m
training examples, and that there are P cores"
Grant


it's log(P). I just don't know how log(P) is obtained.
On 3/26/08, Grant Ingersoll < [hidden email]> wrote:
>
> On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
> >
> >
> >> 2. Sect. 4.1, too.
> >> "reduce phase can minimize communication by combining data as it's
> >> passed back; this accounts for the logP factor", Could you help me
> >> figure out how logP is calculated.
> >
> > Anyone else who can help out here?
>
>
> Isn't this just log(P) where P is the number of cores? The version of
> the paper I have says log(P) not logP, so maybe there is a typo?
>
> From earlier in 4.1:
> "We assume that the dimension of the inputs is n (i.e., x
> ∈R
> n
> ), that we have m
> training examples, and that there are P cores"
>
>
>
> Grant
>
>
>


Isabel, thanks for your answer.
for the 4th question, maybe we can still gain some speedups on
multimachine clusters. But I suspect that we should also explicitly
consider the communication cost, which is nontrivial in such setting.
What do you think?
On 3/26/08, Isabel Drost < [hidden email]> wrote:
> On Tuesday 25 March 2008, Hao Zheng wrote:
> > 1. Sect. 4.1 Algorithm Time Complexity Analysis.
> > the paper assumes m >> n, that is, the training instances are much
> > larger than the features. its datasets do have very few features. but
> > this may not be true for many tasks, e.g. text classification, where
> > feature dimensions will reach 10^410^5. then will the analysis still
> > hold?
>
>
> What I could directly read from the paper in the very same section: The
> analysis will not hold in this case for those algorithms that require matrix
> inversions or eigen decompositions as long as these operations are not
> executed in parallel. The authors did not implement parallel versions for
> these operations  the reason they state is the fact that in their datasets m
> >> n.
>
> The authors state themselves that there is extensive research on parallelising
> eigen decomposition and matrix inversion as well  so if we assume that we do
> have a matrix package that can do these operations in a distributed way, IMHO
> the analysis in the paper should still hold even for algorithms that require
> these steps.
>
>
>
> > 2. Sect. 4.1, too.
> > "reduce phase can minimize communication by combining data as it's
> > passed back; this accounts for the logP factor", Could you help me
> > figure out how logP is calculated.
>
>
> Anyone else who can help out here?
>
>
>
> > 3. Sect 5.4 Results and Discussion
> > "SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
> > or 13.6? From figure 2, it seems should be 13.6?
>
>
> The axis on the graphs do not have clear titles, but I would agree that it
> should be 13.6 as well.
>
>
>
> > 4. Sect 5.4, too.
> > "Finally, the above are runs on multiprocessor machines." No matter
> > multiprocess or multicore, it runs on a single machine which have a
> > share memory.
>
>
> The main motivation for the paper was the rise of multi core machines that ask
> for parallel algorithms even though one might not have a cluster available.
>
>
>
> > But actually, M/R is for multimachine, which will involve much more cost on
> > intermachine communication. So the results of the paper may be
> > questionable?
>
>
> I think you should not expect to get the exact same speedups on multimachine
> clusters. Still I think one can expect faster computation for large datasets
> even in this setting. What do others think?
>
>
>
> Isabel
>
>
>
> 
> There is no TRUTH. There is no REALITY. There is no CONSISTENCY. There are
> no ABSOLUTE STATEMENTS. I'm very probably wrong.
> \ _,,,,,_ Web: < http://www.isabeldrost.de>
> /,`.'`' . ;;;,_
> ,4 ) ),_..;\ ( `''
> '''(_/' `'\_) (fL) IM: <xmpp:// [hidden email]>
>
>


On Wednesday 26 March 2008, Hao Zheng wrote:
> for the 4th question, maybe we can still gain some speedups on
> multimachine clusters. But I suspect that we should also explicitly
> consider the communication cost, which is nontrivial in such setting.
> What do you think?
I agree with you. I could imagine that it is not as easy as for a multicore
machine to get a correct result. There should be variables like amount of
information sent over the network, latency of the network, bandwidth and the
like...
Isabel

Any circuit design must contain at least one part which is obsolete, two parts
which are unobtainable, and three parts which are still under development.
\ _,,,,,_ Web: < http://www.isabeldrost.de>
/,`.'`' . ;;;,_
,4 ) ),_..;\ ( `''
'''(_/' `'\_) (fL) IM: <xmpp:// [hidden email]>


On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
> it's log(P). I just don't know how log(P) is obtained.
>
I am doing some guess work here but I think the log (P) factor arises
out of the combining the data that is passed back from the various
processors. Imagine tree of processors with depth log (P) and at each
level the communication amounts to some function of n and m, f(m,n),
then the total communication cost would be
f(m,n) log (P).
~ Niranjan
> On 3/26/08, Grant Ingersoll < [hidden email]> wrote:
>>
>> On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>>>
>>>
>>>> 2. Sect. 4.1, too.
>>>> "reduce phase can minimize communication by combining data as it's
>>>> passed back; this accounts for the logP factor", Could you help me
>>>> figure out how logP is calculated.
>>>
>>> Anyone else who can help out here?
>>
>>
>> Isn't this just log(P) where P is the number of cores? The
>> version of
>> the paper I have says log(P) not logP, so maybe there is a typo?
>>
>> From earlier in 4.1:
>> "We assume that the dimension of the inputs is n (i.e., x
>> ∈R
>> n
>> ), that we have m
>> training examples, and that there are P cores"
>>
>>
>>
>> Grant
>>
>>
>>


let it be a tree of depth log(P), then the total communication cost will be
f(m,n)*(1/2+1/4+1/8+...)*P = f(m,n)*P, but not log(P)
do I see what you mean?
2008/3/27 Niranjan Balasubramanian < [hidden email]>:
> On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
>
> > it's log(P). I just don't know how log(P) is obtained.
> >
>
> I am doing some guess work here but I think the log (P) factor arises
> out of the combining the data that is passed back from the various
> processors. Imagine tree of processors with depth log (P) and at each
> level the communication amounts to some function of n and m, f(m,n),
> then the total communication cost would be
>
> f(m,n) log (P).
>
> ~ Niranjan
>
>
>
> > On 3/26/08, Grant Ingersoll < [hidden email]> wrote:
> >>
> >> On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
> >>>
> >>>
> >>>> 2. Sect. 4.1, too.
> >>>> "reduce phase can minimize communication by combining data as it's
> >>>> passed back; this accounts for the logP factor", Could you help me
> >>>> figure out how logP is calculated.
> >>>
> >>> Anyone else who can help out here?
> >>
> >>
> >> Isn't this just log(P) where P is the number of cores? The
> >> version of
> >> the paper I have says log(P) not logP, so maybe there is a typo?
> >>
> >> From earlier in 4.1:
> >> "We assume that the dimension of the inputs is n (i.e., x
> >> ∈R
> >> n
> >> ), that we have m
> >> training examples, and that there are P cores"
> >>
> >>
> >>
> >> Grant
> >>
> >>
> >>
>
>


The cost is f(m,n) at each level in the tree (not at each node).
Thus, you have f(m,n) x the number of levels which gives f(m,n) log (P).
I might be way off here but this is what I guess.
~ Niranjan
On Mar 27, 2008, at 12:08 PM, Hao Zheng wrote:
> let it be a tree of depth log(P), then the total communication cost
> will be
> f(m,n)*(1/2+1/4+1/8+...)*P = f(m,n)*P, but not log(P)
> do I see what you mean?
>
> 2008/3/27 Niranjan Balasubramanian < [hidden email]>:
>> On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
>>
>>> it's log(P). I just don't know how log(P) is obtained.
>>>
>>
>> I am doing some guess work here but I think the log (P) factor
>> arises
>> out of the combining the data that is passed back from the various
>> processors. Imagine tree of processors with depth log (P) and at
>> each
>> level the communication amounts to some function of n and m, f(m,n),
>> then the total communication cost would be
>>
>> f(m,n) log (P).
>>
>> ~ Niranjan
>>
>>
>>
>>> On 3/26/08, Grant Ingersoll < [hidden email]> wrote:
>>>>
>>>> On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>>>>>
>>>>>
>>>>>> 2. Sect. 4.1, too.
>>>>>> "reduce phase can minimize communication by combining data as
>>>>>> it's
>>>>>> passed back; this accounts for the logP factor", Could you
>>>>>> help me
>>>>>> figure out how logP is calculated.
>>>>>
>>>>> Anyone else who can help out here?
>>>>
>>>>
>>>> Isn't this just log(P) where P is the number of cores? The
>>>> version of
>>>> the paper I have says log(P) not logP, so maybe there is a typo?
>>>>
>>>> From earlier in 4.1:
>>>> "We assume that the dimension of the inputs is n (i.e., x
>>>> ∈R
>>>> n
>>>> ), that we have m
>>>> training examples, and that there are P cores"
>>>>
>>>>
>>>>
>>>> Grant
>>>>
>>>>
>>>>
>>
>>


If the cost is at each level, you are right. Maybe this is the exact
answer. But I am still not clear how to implement it. Could anyone
provide a more detailed description?
2008/3/28 Niranjan Balasubramanian < [hidden email]>:
> The cost is f(m,n) at each level in the tree (not at each node).
> Thus, you have f(m,n) x the number of levels which gives f(m,n) log (P).
>
> I might be way off here but this is what I guess.
> ~ Niranjan
>
>
>
>
>
> On Mar 27, 2008, at 12:08 PM, Hao Zheng wrote:
>
> > let it be a tree of depth log(P), then the total communication cost
> > will be
> > f(m,n)*(1/2+1/4+1/8+...)*P = f(m,n)*P, but not log(P)
> > do I see what you mean?
> >
> > 2008/3/27 Niranjan Balasubramanian < [hidden email]>:
> >> On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
> >>
> >>> it's log(P). I just don't know how log(P) is obtained.
> >>>
> >>
> >> I am doing some guess work here but I think the log (P) factor
> >> arises
> >> out of the combining the data that is passed back from the various
> >> processors. Imagine tree of processors with depth log (P) and at
> >> each
> >> level the communication amounts to some function of n and m, f(m,n),
> >> then the total communication cost would be
> >>
> >> f(m,n) log (P).
> >>
> >> ~ Niranjan
> >>
> >>
> >>
> >>> On 3/26/08, Grant Ingersoll < [hidden email]> wrote:
> >>>>
> >>>> On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
> >>>>>
> >>>>>
> >>>>>> 2. Sect. 4.1, too.
> >>>>>> "reduce phase can minimize communication by combining data as
> >>>>>> it's
> >>>>>> passed back; this accounts for the logP factor", Could you
> >>>>>> help me
> >>>>>> figure out how logP is calculated.
> >>>>>
> >>>>> Anyone else who can help out here?
> >>>>
> >>>>
> >>>> Isn't this just log(P) where P is the number of cores? The
> >>>> version of
> >>>> the paper I have says log(P) not logP, so maybe there is a typo?
> >>>>
> >>>> From earlier in 4.1:
> >>>> "We assume that the dimension of the inputs is n (i.e., x
> >>>> ∈R
> >>>> n
> >>>> ), that we have m
> >>>> training examples, and that there are P cores"
> >>>>
> >>>>
> >>>>
> >>>> Grant
> >>>>
> >>>>
> >>>>
> >>
> >>
>
>

