

Dear Apache Community,
I am looking to perform a linear regression on a rather large amount
of data in my hadoop cluster. It is part of my master's thesis at
harvard university.
After perusing the docs on the Mahout site, it seems like the
following algorithms havent been implemented yet
LocallyWeighted Linear Regression
Linear Regression
Logistic Regression
Basically, there is a stock market phenomenon which I'm trying to
predict. It is called a short squeeze. I have about 16,000 data points
 stocks and a point in time where the phenomenon has occurred. I'm
trying to develop a predictive model in a hadoop cluster.
The accuracy of the model doesn't matter much at this point, the goal
and what would make my prof happy is to see the cluster grinding away,
doing some relevant but perhaps not totally correct mathematical
operations. Read: If its a linear regression i'll be happy, but if it
isn't possible yet I dont mind.
Can anyone suggest something I can use? I've downloaded Mahout 0.2 and
searched through it, but nothing for performing regressions has jumped
out at me.
Thank you.
Best,
Rajat


We don't have these right now. We had a summer of code student start on
Logistic Regression, but she didn't complete the project.
Can you say more about your problem? Are you saying that you have 16,000
predictor variables sampled in time and one prediction variable (presence of
short squeeze)? Or is it possible for short squeezes to be applied to
individual equities so that you have 16,000 time series each annotated with
whether a short squeeze occurred?
If the former, then you have a much bigger problem than just doing the
regression. If the latter, then you might be able to use some online
learning software like Vowpal Wabbit to do your job.
Can you say more?
On Mon, Dec 7, 2009 at 3:04 PM, Rajat Banerjee < [hidden email]> wrote:
> Dear Apache Community,
> I am looking to perform a linear regression on a rather large amount
> of data in my hadoop cluster. It is part of my master's thesis at
> harvard university.
>
> After perusing the docs on the Mahout site, it seems like the
> following algorithms havent been implemented yet
> LocallyWeighted Linear Regression
> Linear Regression
> Logistic Regression
>
> Basically, there is a stock market phenomenon which I'm trying to
> predict. It is called a short squeeze. I have about 16,000 data points
>  stocks and a point in time where the phenomenon has occurred. I'm
> trying to develop a predictive model in a hadoop cluster.
>
> The accuracy of the model doesn't matter much at this point, the goal
> and what would make my prof happy is to see the cluster grinding away,
> doing some relevant but perhaps not totally correct mathematical
> operations. Read: If its a linear regression i'll be happy, but if it
> isn't possible yet I dont mind.
>
> Can anyone suggest something I can use? I've downloaded Mahout 0.2 and
> searched through it, but nothing for performing regressions has jumped
> out at me.
> Thank you.
> Best,
> Rajat
>

Ted Dunning, CTO
DeepDyve


Dear Ted, Thanks for your prompt reply.
There are 16,000 rows of data. There are only four significant
variables in my hypothesis. The regression shouldn't be too nasty.
I've looked at some nondistributed libraries and they seem capable,
but would love to get it started in hadoop since that's my end goal.
singlethreaded :
http://www.ee.ucl.ac.uk/~mflanaga/java/Regression.html#sumglThanks. Best,
Rajat
On Mon, Dec 7, 2009 at 6:21 PM, Ted Dunning < [hidden email]> wrote:
> We don't have these right now. We had a summer of code student start on
> Logistic Regression, but she didn't complete the project.
>
> Can you say more about your problem? Are you saying that you have 16,000
> predictor variables sampled in time and one prediction variable (presence of
> short squeeze)? Or is it possible for short squeezes to be applied to
> individual equities so that you have 16,000 time series each annotated with
> whether a short squeeze occurred?
>
> If the former, then you have a much bigger problem than just doing the
> regression. If the latter, then you might be able to use some online
> learning software like Vowpal Wabbit to do your job.
>
> Can you say more?
>
> On Mon, Dec 7, 2009 at 3:04 PM, Rajat Banerjee < [hidden email]> wrote:
>
>> Dear Apache Community,
>> I am looking to perform a linear regression on a rather large amount
>> of data in my hadoop cluster. It is part of my master's thesis at
>> harvard university.
>>
>> After perusing the docs on the Mahout site, it seems like the
>> following algorithms havent been implemented yet
>> LocallyWeighted Linear Regression
>> Linear Regression
>> Logistic Regression
>>
>> Basically, there is a stock market phenomenon which I'm trying to
>> predict. It is called a short squeeze. I have about 16,000 data points
>>  stocks and a point in time where the phenomenon has occurred. I'm
>> trying to develop a predictive model in a hadoop cluster.
>>
>> The accuracy of the model doesn't matter much at this point, the goal
>> and what would make my prof happy is to see the cluster grinding away,
>> doing some relevant but perhaps not totally correct mathematical
>> operations. Read: If its a linear regression i'll be happy, but if it
>> isn't possible yet I dont mind.
>>
>> Can anyone suggest something I can use? I've downloaded Mahout 0.2 and
>> searched through it, but nothing for performing regressions has jumped
>> out at me.
>> Thank you.
>> Best,
>> Rajat
>>
>
>
>
> 
> Ted Dunning, CTO
> DeepDyve
>


If there are only that few data points, you should just use R.
On Mon, Dec 7, 2009 at 3:29 PM, Rajat Banerjee < [hidden email]> wrote:


If you only have 4 variables and 16k rows, why do you need anything even
close to Hadoop? This is is a problem which could be regressed on an
iPhone,
couldn't it?
jake
On Mon, Dec 7, 2009 at 3:29 PM, Rajat Banerjee < [hidden email]> wrote:
> Dear Ted, Thanks for your prompt reply.
>
> There are 16,000 rows of data. There are only four significant
> variables in my hypothesis. The regression shouldn't be too nasty.
> I've looked at some nondistributed libraries and they seem capable,
> but would love to get it started in hadoop since that's my end goal.
>
> singlethreaded :
> http://www.ee.ucl.ac.uk/~mflanaga/java/Regression.html#sumgl< http://www.ee.ucl.ac.uk/%7Emflanaga/java/Regression.html#sumgl>
>
>
> Thanks. Best,
> Rajat
>
>
> On Mon, Dec 7, 2009 at 6:21 PM, Ted Dunning < [hidden email]> wrote:
> > We don't have these right now. We had a summer of code student start on
> > Logistic Regression, but she didn't complete the project.
> >
> > Can you say more about your problem? Are you saying that you have 16,000
> > predictor variables sampled in time and one prediction variable (presence
> of
> > short squeeze)? Or is it possible for short squeezes to be applied to
> > individual equities so that you have 16,000 time series each annotated
> with
> > whether a short squeeze occurred?
> >
> > If the former, then you have a much bigger problem than just doing the
> > regression. If the latter, then you might be able to use some online
> > learning software like Vowpal Wabbit to do your job.
> >
> > Can you say more?
> >
> > On Mon, Dec 7, 2009 at 3:04 PM, Rajat Banerjee < [hidden email]>
> wrote:
> >
> >> Dear Apache Community,
> >> I am looking to perform a linear regression on a rather large amount
> >> of data in my hadoop cluster. It is part of my master's thesis at
> >> harvard university.
> >>
> >> After perusing the docs on the Mahout site, it seems like the
> >> following algorithms havent been implemented yet
> >> LocallyWeighted Linear Regression
> >> Linear Regression
> >> Logistic Regression
> >>
> >> Basically, there is a stock market phenomenon which I'm trying to
> >> predict. It is called a short squeeze. I have about 16,000 data points
> >>  stocks and a point in time where the phenomenon has occurred. I'm
> >> trying to develop a predictive model in a hadoop cluster.
> >>
> >> The accuracy of the model doesn't matter much at this point, the goal
> >> and what would make my prof happy is to see the cluster grinding away,
> >> doing some relevant but perhaps not totally correct mathematical
> >> operations. Read: If its a linear regression i'll be happy, but if it
> >> isn't possible yet I dont mind.
> >>
> >> Can anyone suggest something I can use? I've downloaded Mahout 0.2 and
> >> searched through it, but nothing for performing regressions has jumped
> >> out at me.
> >> Thank you.
> >> Best,
> >> Rajat
> >>
> >
> >
> >
> > 
> > Ted Dunning, CTO
> > DeepDyve
> >
>


Yeah, this is exactly what R was meant for. Running regressions in R is a
oneliner in many cases. There's no reason to burden yourself with an added
layer of complexity in using Hadoop especially given the size of your data
set. I'd imagine with the kind of dataset you're dealing with, even a 1000x
increase in data (16 million rows) could be handled by a single machine (and
if you wanted to use R, you still have other options to get around its
memory constraints).
On Mon, Dec 7, 2009 at 6:33 PM, Jake Mannix < [hidden email]> wrote:
> If you only have 4 variables and 16k rows, why do you need anything even
> close to Hadoop? This is is a problem which could be regressed on an
> iPhone,
> couldn't it?
>
> jake
>
> On Mon, Dec 7, 2009 at 3:29 PM, Rajat Banerjee < [hidden email]>
> wrote:
>
> > Dear Ted, Thanks for your prompt reply.
> >
> > There are 16,000 rows of data. There are only four significant
> > variables in my hypothesis. The regression shouldn't be too nasty.
> > I've looked at some nondistributed libraries and they seem capable,
> > but would love to get it started in hadoop since that's my end goal.
> >
> > singlethreaded :
> > http://www.ee.ucl.ac.uk/~mflanaga/java/Regression.html#sumgl< http://www.ee.ucl.ac.uk/%7Emflanaga/java/Regression.html#sumgl>
> < http://www.ee.ucl.ac.uk/%7Emflanaga/java/Regression.html#sumgl>
> >
> >
> > Thanks. Best,
> > Rajat
> >
> >
> > On Mon, Dec 7, 2009 at 6:21 PM, Ted Dunning < [hidden email]>
> wrote:
> > > We don't have these right now. We had a summer of code student start
> on
> > > Logistic Regression, but she didn't complete the project.
> > >
> > > Can you say more about your problem? Are you saying that you have
> 16,000
> > > predictor variables sampled in time and one prediction variable
> (presence
> > of
> > > short squeeze)? Or is it possible for short squeezes to be applied to
> > > individual equities so that you have 16,000 time series each annotated
> > with
> > > whether a short squeeze occurred?
> > >
> > > If the former, then you have a much bigger problem than just doing the
> > > regression. If the latter, then you might be able to use some online
> > > learning software like Vowpal Wabbit to do your job.
> > >
> > > Can you say more?
> > >
> > > On Mon, Dec 7, 2009 at 3:04 PM, Rajat Banerjee < [hidden email]>
> > wrote:
> > >
> > >> Dear Apache Community,
> > >> I am looking to perform a linear regression on a rather large amount
> > >> of data in my hadoop cluster. It is part of my master's thesis at
> > >> harvard university.
> > >>
> > >> After perusing the docs on the Mahout site, it seems like the
> > >> following algorithms havent been implemented yet
> > >> LocallyWeighted Linear Regression
> > >> Linear Regression
> > >> Logistic Regression
> > >>
> > >> Basically, there is a stock market phenomenon which I'm trying to
> > >> predict. It is called a short squeeze. I have about 16,000 data points
> > >>  stocks and a point in time where the phenomenon has occurred. I'm
> > >> trying to develop a predictive model in a hadoop cluster.
> > >>
> > >> The accuracy of the model doesn't matter much at this point, the goal
> > >> and what would make my prof happy is to see the cluster grinding away,
> > >> doing some relevant but perhaps not totally correct mathematical
> > >> operations. Read: If its a linear regression i'll be happy, but if it
> > >> isn't possible yet I dont mind.
> > >>
> > >> Can anyone suggest something I can use? I've downloaded Mahout 0.2 and
> > >> searched through it, but nothing for performing regressions has jumped
> > >> out at me.
> > >> Thank you.
> > >> Best,
> > >> Rajat
> > >>
> > >
> > >
> > >
> > > 
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
>

Zaki Rahaman


Thanks Gentlemen, you were right.
Just finished coding it up in nondistributed mode with an external
library. It didn't take much time to run at all. Tomorrow I'll scale
it up to 16k rows and see how it does. Adding these small number of
rows makes absolutely no difference in running time.
Running the regression with 15 data points.
Operation #1 finished in 00:00:00.453.
Operation #2 finished in 00:00:00.375.
Running the regression with 30 data points.
Operation #1 finished in 00:00:00.516.
Operation #2 finished in 00:00:00.437.
Running the regression with 50 data points.
Operation #1 finished in 00:00:00.782.
Operation #2 finished in 00:00:00.453.
Running the regression with 75 data points.
Operation #1 finished in 00:00:00.516.
Operation #2 finished in 00:00:00.422.
Running the regression with 100 data points.
Operation #1 finished in 00:00:00.781.
Operation #2 finished in 00:00:00.657.


Hey Rajat,
> After perusing the docs on the Mahout site, it seems like the
> following algorithms havent been implemented yet
> LocallyWeighted Linear Regression
> Linear Regression
Implementing LWLR was an initial goal of the project since LWLR is also
mentioned in the Stanford paper that talks about doing machine learning in
a mapreduce way. I said I would look into implementing it a long time ago
(maybe a year or even one and a half) but so far just haven't gotten
around to actually do it. I don't think that it would be too much work,
maybe a weekend and some evening. I probably just should try to get my
shit together and just implement it. Now there would be a bit more
motivation with knowing that there's someone who would actually use it.
Linear Regression is just a degenerated LWLR where all weights are equal
to 1.
> Basically, there is a stock market phenomenon which I'm trying to
> predict. It is called a short squeeze. I have about 16,000 data points
>  stocks and a point in time where the phenomenon has occurred. I'm
> trying to develop a predictive model in a hadoop cluster.
As others have already pointed out, you wouldn't see a noticable
difference when using Mahout to do this. It could easily be done on a
single machine. However, if it's not about this particular problem but
about a principle implementation and showing that a speedup is possible,
it would make sense to implement it using Mahout/Hadoop. But for just
solving the regression problem I would just code it in Matlab (oneliner
using the \ operator).
Alex

PGP Public Key: http://www.tuilmenau.de/~alhain/ahans.ascFingerprint: E110 4CA3 288A 93F3 5237 E904 A85B 4B18 CFDC 63E3

