some question about the nips paper

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

some question about the nips paper

Hao Zheng-2
hi all devs,

I have read through the dev mail list, and have a rough idea of the
progress of Mahout. I have read the google paper and the nips paper.
As for the nips paper "map reduce for ML on Multicore", i have some
questions.

1. Sect. 4.1 Algorithm Time Complexity Analysis.
the paper assumes m >> n, that is, the training instances are much
larger than the features. its datasets do have very few features. but
this may not be true for many tasks, e.g. text classification, where
feature dimensions will reach 10^4-10^5. then will the analysis still
hold?

2. Sect. 4.1, too.
"reduce phase can minimize communication by combining data as it's
passed back; this accounts for the logP factor", Could you help me
figure out how logP is calculated.

3. Sect 5.4 Results and Discussion
"SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
or 13.6? From figure 2, it seems should be 13.6?

4. Sect 5.4, too.
"Finally, the above are runs on multiprocessor machines." No matter
multiprocess or multicore, it runs on a single machine which have a
share memory. But actually, M/R is for multi-machine, which will
involve much more cost on inter-machine communication. So the results
of the paper may be questionable?

Maybe some of the questions are a complete misinterpretation. Please
help me to get an full understanding of the paper, thanks.
Reply | Threaded
Open this post in threaded view
|

Re: some question about the nips paper

Isabel Drost-3
On Tuesday 25 March 2008, Hao Zheng wrote:
> 1. Sect. 4.1 Algorithm Time Complexity Analysis.
> the paper assumes m >> n, that is, the training instances are much
> larger than the features. its datasets do have very few features. but
> this may not be true for many tasks, e.g. text classification, where
> feature dimensions will reach 10^4-10^5. then will the analysis still
> hold?

What I could directly read from the paper in the very same section: The
analysis will not hold in this case for those algorithms that require matrix
inversions or eigen decompositions as long as these operations are not
executed in parallel. The authors did not implement parallel versions for
these operations - the reason they state is the fact that in their datasets m
>> n.

The authors state themselves that there is extensive research on parallelising
eigen decomposition and matrix inversion as well - so if we assume that we do
have a matrix package that can do these operations in a distributed way, IMHO
the analysis in the paper should still hold even for algorithms that require
these steps.


> 2. Sect. 4.1, too.
> "reduce phase can minimize communication by combining data as it's
> passed back; this accounts for the logP factor", Could you help me
> figure out how logP is calculated.

Anyone else who can help out here?


> 3. Sect 5.4 Results and Discussion
> "SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
> or 13.6? From figure 2, it seems should be 13.6?

The axis on the graphs do not have clear titles, but I would agree that it
should be 13.6 as well.


> 4. Sect 5.4, too.
> "Finally, the above are runs on multiprocessor machines." No matter
> multiprocess or multicore, it runs on a single machine which have a
> share memory.

The main motivation for the paper was the rise of multi core machines that ask
for parallel algorithms even though one might not have a cluster available.


> But actually, M/R is for multi-machine, which will involve much more cost on
> inter-machine communication. So the results of the paper may be
> questionable?

I think you should not expect to get the exact same speedups on multi-machine
clusters. Still I think one can expect faster computation for large datasets
even in this setting. What do others think?



Isabel


--
There is no TRUTH.  There is no REALITY.  There is no CONSISTENCY. There are
no ABSOLUTE STATEMENTS.   I'm very probably wrong.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[hidden email]>

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: some question about the nips paper

Grant Ingersoll-2

On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>
>
>> 2. Sect. 4.1, too.
>> "reduce phase can minimize communication by combining data as it's
>> passed back; this accounts for the logP factor", Could you help me
>> figure out how logP is calculated.
>
> Anyone else who can help out here?

Isn't this just log(P) where P is the number of cores?  The version of  
the paper I have says log(P) not logP, so maybe there is a typo?

 From earlier in 4.1:
"We assume that the dimension of the inputs is n (i.e., x
∈R
n
), that we have m
training examples, and that there are P cores"


-Grant


Reply | Threaded
Open this post in threaded view
|

Re: some question about the nips paper

Hao Zheng-2
it's log(P). I just don't know how log(P) is obtained.

On 3/26/08, Grant Ingersoll <[hidden email]> wrote:

>
>  On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>  >
>  >
>  >> 2. Sect. 4.1, too.
>  >> "reduce phase can minimize communication by combining data as it's
>  >> passed back; this accounts for the logP factor", Could you help me
>  >> figure out how logP is calculated.
>  >
>  > Anyone else who can help out here?
>
>
> Isn't this just log(P) where P is the number of cores?  The version of
>  the paper I have says log(P) not logP, so maybe there is a typo?
>
>   From earlier in 4.1:
>  "We assume that the dimension of the inputs is n (i.e., x
>  ∈R
>  n
>  ), that we have m
>  training examples, and that there are P cores"
>
>
>
>  -Grant
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: some question about the nips paper

Hao Zheng-2
In reply to this post by Isabel Drost-3
Isabel, thanks for your answer.
for the 4th question, maybe we can still gain some speedups on
multi-machine clusters. But I suspect that we should also explicitly
consider the communication cost, which is non-trivial in such setting.
What do you think?

On 3/26/08, Isabel Drost <[hidden email]> wrote:

> On Tuesday 25 March 2008, Hao Zheng wrote:
>  > 1. Sect. 4.1 Algorithm Time Complexity Analysis.
>  > the paper assumes m >> n, that is, the training instances are much
>  > larger than the features. its datasets do have very few features. but
>  > this may not be true for many tasks, e.g. text classification, where
>  > feature dimensions will reach 10^4-10^5. then will the analysis still
>  > hold?
>
>
> What I could directly read from the paper in the very same section: The
>  analysis will not hold in this case for those algorithms that require matrix
>  inversions or eigen decompositions as long as these operations are not
>  executed in parallel. The authors did not implement parallel versions for
>  these operations - the reason they state is the fact that in their datasets m
>  >> n.
>
>  The authors state themselves that there is extensive research on parallelising
>  eigen decomposition and matrix inversion as well - so if we assume that we do
>  have a matrix package that can do these operations in a distributed way, IMHO
>  the analysis in the paper should still hold even for algorithms that require
>  these steps.
>
>
>
>  > 2. Sect. 4.1, too.
>  > "reduce phase can minimize communication by combining data as it's
>  > passed back; this accounts for the logP factor", Could you help me
>  > figure out how logP is calculated.
>
>
> Anyone else who can help out here?
>
>
>
>  > 3. Sect 5.4 Results and Discussion
>  > "SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
>  > or 13.6? From figure 2, it seems should be 13.6?
>
>
> The axis on the graphs do not have clear titles, but I would agree that it
>  should be 13.6 as well.
>
>
>
>  > 4. Sect 5.4, too.
>  > "Finally, the above are runs on multiprocessor machines." No matter
>  > multiprocess or multicore, it runs on a single machine which have a
>  > share memory.
>
>
> The main motivation for the paper was the rise of multi core machines that ask
>  for parallel algorithms even though one might not have a cluster available.
>
>
>
>  > But actually, M/R is for multi-machine, which will involve much more cost on
>  > inter-machine communication. So the results of the paper may be
>  > questionable?
>
>
> I think you should not expect to get the exact same speedups on multi-machine
>  clusters. Still I think one can expect faster computation for large datasets
>  even in this setting. What do others think?
>
>
>
>  Isabel
>
>
>
>  --
>  There is no TRUTH.  There is no REALITY.  There is no CONSISTENCY. There are
>  no ABSOLUTE STATEMENTS.   I'm very probably wrong.
>   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>   /,`.-'`'    -.  ;-;;,_
>   |,4-  ) )-,_..;\ (  `'-'
>  '---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[hidden email]>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: some question about the nips paper

Isabel Drost-3
On Wednesday 26 March 2008, Hao Zheng wrote:
> for the 4th question, maybe we can still gain some speedups on
> multi-machine clusters. But I suspect that we should also explicitly
> consider the communication cost, which is non-trivial in such setting.
> What do you think?

I agree with you. I could imagine that it is not as easy as for a multi-core
machine to get a correct result. There should be variables like amount of
information sent over the network, latency of the network, bandwidth and the
like...

Isabel


--
Any circuit design must contain at least one part which is obsolete, two parts
which are unobtainable, and three parts which are still under development.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xmpp://[hidden email]>

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: some question about the nips paper

Niranjan Balasubramanian
In reply to this post by Hao Zheng-2
On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
> it's log(P). I just don't know how log(P) is obtained.
>

I am doing some guess work here but I think the log (P) factor arises  
out of the combining the data that is passed back from the various  
processors. Imagine tree of processors with depth log (P) and at each  
level the communication amounts to some function of n and m, f(m,n),  
then the total communication cost would be

f(m,n) log (P).

~ Niranjan

> On 3/26/08, Grant Ingersoll <[hidden email]> wrote:
>>
>>  On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>>>
>>>
>>>> 2. Sect. 4.1, too.
>>>> "reduce phase can minimize communication by combining data as it's
>>>> passed back; this accounts for the logP factor", Could you help me
>>>> figure out how logP is calculated.
>>>
>>> Anyone else who can help out here?
>>
>>
>> Isn't this just log(P) where P is the number of cores?  The  
>> version of
>>  the paper I have says log(P) not logP, so maybe there is a typo?
>>
>>   From earlier in 4.1:
>>  "We assume that the dimension of the inputs is n (i.e., x
>>  ∈R
>>  n
>>  ), that we have m
>>  training examples, and that there are P cores"
>>
>>
>>
>>  -Grant
>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: some question about the nips paper

Hao Zheng-2
let it be a tree of depth log(P), then the total communication cost will be
f(m,n)*(1/2+1/4+1/8+...)*P = f(m,n)*P, but not log(P)
do I see what you mean?

2008/3/27 Niranjan Balasubramanian <[hidden email]>:

> On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
>
> > it's log(P). I just don't know how log(P) is obtained.
>  >
>
>  I am doing some guess work here but I think the log (P) factor arises
>  out of the combining the data that is passed back from the various
>  processors. Imagine tree of processors with depth log (P) and at each
>  level the communication amounts to some function of n and m, f(m,n),
>  then the total communication cost would be
>
>  f(m,n) log (P).
>
>  ~ Niranjan
>
>
>
>  > On 3/26/08, Grant Ingersoll <[hidden email]> wrote:
>  >>
>  >>  On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>  >>>
>  >>>
>  >>>> 2. Sect. 4.1, too.
>  >>>> "reduce phase can minimize communication by combining data as it's
>  >>>> passed back; this accounts for the logP factor", Could you help me
>  >>>> figure out how logP is calculated.
>  >>>
>  >>> Anyone else who can help out here?
>  >>
>  >>
>  >> Isn't this just log(P) where P is the number of cores?  The
>  >> version of
>  >>  the paper I have says log(P) not logP, so maybe there is a typo?
>  >>
>  >>   From earlier in 4.1:
>  >>  "We assume that the dimension of the inputs is n (i.e., x
>  >>  ∈R
>  >>  n
>  >>  ), that we have m
>  >>  training examples, and that there are P cores"
>  >>
>  >>
>  >>
>  >>  -Grant
>  >>
>  >>
>  >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: some question about the nips paper

Niranjan Balasubramanian
The cost is f(m,n) at each level in the tree (not at each node).  
Thus, you have f(m,n) x the number of levels which gives f(m,n) log (P).

I might be way off here but this is what I guess.
~ Niranjan



On Mar 27, 2008, at 12:08 PM, Hao Zheng wrote:

> let it be a tree of depth log(P), then the total communication cost  
> will be
> f(m,n)*(1/2+1/4+1/8+...)*P = f(m,n)*P, but not log(P)
> do I see what you mean?
>
> 2008/3/27 Niranjan Balasubramanian <[hidden email]>:
>> On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
>>
>>> it's log(P). I just don't know how log(P) is obtained.
>>>
>>
>>  I am doing some guess work here but I think the log (P) factor  
>> arises
>>  out of the combining the data that is passed back from the various
>>  processors. Imagine tree of processors with depth log (P) and at  
>> each
>>  level the communication amounts to some function of n and m, f(m,n),
>>  then the total communication cost would be
>>
>>  f(m,n) log (P).
>>
>>  ~ Niranjan
>>
>>
>>
>>> On 3/26/08, Grant Ingersoll <[hidden email]> wrote:
>>>>
>>>>  On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>>>>>
>>>>>
>>>>>> 2. Sect. 4.1, too.
>>>>>> "reduce phase can minimize communication by combining data as  
>>>>>> it's
>>>>>> passed back; this accounts for the logP factor", Could you  
>>>>>> help me
>>>>>> figure out how logP is calculated.
>>>>>
>>>>> Anyone else who can help out here?
>>>>
>>>>
>>>> Isn't this just log(P) where P is the number of cores?  The
>>>> version of
>>>>  the paper I have says log(P) not logP, so maybe there is a typo?
>>>>
>>>>   From earlier in 4.1:
>>>>  "We assume that the dimension of the inputs is n (i.e., x
>>>>  ∈R
>>>>  n
>>>>  ), that we have m
>>>>  training examples, and that there are P cores"
>>>>
>>>>
>>>>
>>>>  -Grant
>>>>
>>>>
>>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: some question about the nips paper

Hao Zheng-2
If the cost is at each level, you are right. Maybe this is the exact
answer. But I am still not clear how to implement it. Could anyone
provide a more detailed description?

2008/3/28 Niranjan Balasubramanian <[hidden email]>:

> The cost is f(m,n) at each level in the tree (not at each node).
>  Thus, you have f(m,n) x the number of levels which gives f(m,n) log (P).
>
>  I might be way off here but this is what I guess.
>  ~ Niranjan
>
>
>
>
>
>  On Mar 27, 2008, at 12:08 PM, Hao Zheng wrote:
>
>  > let it be a tree of depth log(P), then the total communication cost
>  > will be
>  > f(m,n)*(1/2+1/4+1/8+...)*P = f(m,n)*P, but not log(P)
>  > do I see what you mean?
>  >
>  > 2008/3/27 Niranjan Balasubramanian <[hidden email]>:
>  >> On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
>  >>
>  >>> it's log(P). I just don't know how log(P) is obtained.
>  >>>
>  >>
>  >>  I am doing some guess work here but I think the log (P) factor
>  >> arises
>  >>  out of the combining the data that is passed back from the various
>  >>  processors. Imagine tree of processors with depth log (P) and at
>  >> each
>  >>  level the communication amounts to some function of n and m, f(m,n),
>  >>  then the total communication cost would be
>  >>
>  >>  f(m,n) log (P).
>  >>
>  >>  ~ Niranjan
>  >>
>  >>
>  >>
>  >>> On 3/26/08, Grant Ingersoll <[hidden email]> wrote:
>  >>>>
>  >>>>  On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>  >>>>>
>  >>>>>
>  >>>>>> 2. Sect. 4.1, too.
>  >>>>>> "reduce phase can minimize communication by combining data as
>  >>>>>> it's
>  >>>>>> passed back; this accounts for the logP factor", Could you
>  >>>>>> help me
>  >>>>>> figure out how logP is calculated.
>  >>>>>
>  >>>>> Anyone else who can help out here?
>  >>>>
>  >>>>
>  >>>> Isn't this just log(P) where P is the number of cores?  The
>  >>>> version of
>  >>>>  the paper I have says log(P) not logP, so maybe there is a typo?
>  >>>>
>  >>>>   From earlier in 4.1:
>  >>>>  "We assume that the dimension of the inputs is n (i.e., x
>  >>>>  ∈R
>  >>>>  n
>  >>>>  ), that we have m
>  >>>>  training examples, and that there are P cores"
>  >>>>
>  >>>>
>  >>>>
>  >>>>  -Grant
>  >>>>
>  >>>>
>  >>>>
>  >>
>  >>
>
>