Mahout/Taste covariance between two items

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Mahout/Taste covariance between two items

Tamas Jambor-2
hi guys,
just wondering if you have a method implemeted which would calculate the covariance between two items. and the variance of an item. I looked itemSimilarities but that one does something different.

thanks
Tama
Reply | Threaded
Open this post in threaded view
|

Re: Mahout/Taste covariance between two items

Sean Owen
Yes. Look at PearsonCorrelationSimilarity. It implements
ItemSimilarity so it can compute a Pearson correlation between ratings
for two items. Pearson is the covariance divided by the product of the
standard deviations. So, just multiply the similarity value you get by
the standard deviations of the items' preference values.

The variance of each item's preference values is simply the square of
the standard deviation, if that's what you mean.

You can use RunningAverageAndStdDev to help compute standard deviation
if you like.

On Thu, Nov 26, 2009 at 3:14 PM, jamborta <[hidden email]> wrote:

>
> hi guys,
> just wondering if you have a method implemeted which would calculate the
> covariance between two items. and the variance of an item. I looked
> itemSimilarities but that one does something different.
>
> thanks
> Tama
> --
> View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26530825.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Mahout/Taste covariance between two items

Tamas Jambor-2
great. thanks a lot.

srowen wrote
Yes. Look at PearsonCorrelationSimilarity. It implements
ItemSimilarity so it can compute a Pearson correlation between ratings
for two items. Pearson is the covariance divided by the product of the
standard deviations. So, just multiply the similarity value you get by
the standard deviations of the items' preference values.

The variance of each item's preference values is simply the square of
the standard deviation, if that's what you mean.

You can use RunningAverageAndStdDev to help compute standard deviation
if you like.

On Thu, Nov 26, 2009 at 3:14 PM, jamborta <jamborta@gmail.com> wrote:
>
> hi guys,
> just wondering if you have a method implemeted which would calculate the
> covariance between two items. and the variance of an item. I looked
> itemSimilarities but that one does something different.
>
> thanks
> Tama
> --
> View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26530825.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Mahout/Taste covariance between two items

Tamas Jambor-2
hi. I tried to figure out how you calcualte pearson correlation, but it looks like you use this formula:

sumXY / sqrt(sumX2 * sumY2)

where sumXY = sumXY - meanY * sumX;
sumX2 = sumX2 - meanX * sumX;
sumY2 = sumY2 - meanY * sumY;

i don't really understand how you got these equations. could you explain it to me? I thought pearson correlation would be like this

E(x_i-meanX)(y_i-meanY) / sdX*sdY

for my project I would need to get sample correlation coefficient which would be something like this:

sum(x_i-meanX)(y_i-meanY)/(N-1)

could that just be derived from the values you've already calculated?

thanks a lot.

srowen wrote
Yes. Look at PearsonCorrelationSimilarity. It implements
ItemSimilarity so it can compute a Pearson correlation between ratings
for two items. Pearson is the covariance divided by the product of the
standard deviations. So, just multiply the similarity value you get by
the standard deviations of the items' preference values.

The variance of each item's preference values is simply the square of
the standard deviation, if that's what you mean.

You can use RunningAverageAndStdDev to help compute standard deviation
if you like.

On Thu, Nov 26, 2009 at 3:14 PM, jamborta <jamborta@gmail.com> wrote:
>
> hi guys,
> just wondering if you have a method implemeted which would calculate the
> covariance between two items. and the variance of an item. I looked
> itemSimilarities but that one does something different.
>
> thanks
> Tama
> --
> View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26530825.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Mahout/Taste covariance between two items

Sean Owen
On Fri, Nov 27, 2009 at 1:41 AM, jamborta <[hidden email]> wrote:
>
> hi. I tried to figure out how you calcualte pearson correlation, but it looks
> like you use this formula:
>
> sumXY / sqrt(sumX2 * sumY2)

Yes that's right -- this is what Pearson reduces to when the mean of X
and Y are 0. And they are here -- the implementation 'centers' the
data.

> where sumXY = sumXY - meanY * sumX;
> sumX2 = sumX2 - meanX * sumX;
> sumY2 = sumY2 - meanY * sumY;

You see the lines commented out there? Those are the full forms of the
expressions, which may make more sense. This is centering the data,
making the mean 0.

This is a simplification based on the observation that, for example,
sumX * meanY = sumY * meanX = n * meanY * meanX.

>
> i don't really understand how you got these equations. could you explain it
> to me? I thought pearson correlation would be like this
>
> E(x_i-meanX)(y_i-meanY) / sdX*sdY

That's right that's the expression for a population correlation, but
we can really only compute a sample Pearson correlation coefficient,
yes:


> for my project I would need to get sample correlation coefficient which
> would be something like this:
>
> sum(x_i-meanX)(y_i-meanY)/(N-1)

Yeah that's fine too, this is another way of expressing the formula,
though you're missing the two standard deviations in the denominator.
It'll be clearer if I note that the mean of X and Y are 0.
Reply | Threaded
Open this post in threaded view
|

Re: Mahout/Taste covariance between two items

Tamas Jambor-2
thanks you. much clearer now.

so for my purpose this will do:

sumXY/N-1

given that the data is 'centered'?

which hopefully would be the covariance of X and Y

On Fri, Nov 27, 2009 at 1:41 AM, jamborta <jamborta@gmail.com> wrote:
>
> hi. I tried to figure out how you calcualte pearson correlation, but it looks
> like you use this formula:
>
> sumXY / sqrt(sumX2 * sumY2)

Yes that's right -- this is what Pearson reduces to when the mean of X
and Y are 0. And they are here -- the implementation 'centers' the
data.

> where sumXY = sumXY - meanY * sumX;
> sumX2 = sumX2 - meanX * sumX;
> sumY2 = sumY2 - meanY * sumY;

You see the lines commented out there? Those are the full forms of the
expressions, which may make more sense. This is centering the data,
making the mean 0.

This is a simplification based on the observation that, for example,
sumX * meanY = sumY * meanX = n * meanY * meanX.

>
> i don't really understand how you got these equations. could you explain it
> to me? I thought pearson correlation would be like this
>
> E(x_i-meanX)(y_i-meanY) / sdX*sdY

That's right that's the expression for a population correlation, but
we can really only compute a sample Pearson correlation coefficient,
yes:


> for my project I would need to get sample correlation coefficient which
> would be something like this:
>
> sum(x_i-meanX)(y_i-meanY)/(N-1)

Yeah that's fine too, this is another way of expressing the formula,
though you're missing the two standard deviations in the denominator.
It'll be clearer if I note that the mean of X and Y are 0.


Reply | Threaded
Open this post in threaded view
|

Re: Mahout/Taste covariance between two items

Sean Owen
I'm not so familiar with this formula but you seem to be missing
something in the denominator... it's got to normalize somehow. I think
I said divide by standard deviation but that's not quite it. What you
are really summing are the products of z-scores.  See
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

But I think you should just use the formulation given in the code?
should be the same result. At least I hope these aren't different
definitions of Pearson!

On Fri, Nov 27, 2009 at 10:20 AM, jamborta <[hidden email]> wrote:

>
> thanks you. much clearer now.
>
> so for my purpose this will do:
>
> sumXY/N-1
>
> given that the data is 'centered'?
>
>
> On Fri, Nov 27, 2009 at 1:41 AM, jamborta <[hidden email]> wrote:
>>
>> hi. I tried to figure out how you calcualte pearson correlation, but it
>> looks
>> like you use this formula:
>>
>> sumXY / sqrt(sumX2 * sumY2)
>
> Yes that's right -- this is what Pearson reduces to when the mean of X
> and Y are 0. And they are here -- the implementation 'centers' the
> data.
>
>> where sumXY = sumXY - meanY * sumX;
>> sumX2 = sumX2 - meanX * sumX;
>> sumY2 = sumY2 - meanY * sumY;
>
> You see the lines commented out there? Those are the full forms of the
> expressions, which may make more sense. This is centering the data,
> making the mean 0.
>
> This is a simplification based on the observation that, for example,
> sumX * meanY = sumY * meanX = n * meanY * meanX.
>
>>
>> i don't really understand how you got these equations. could you explain
>> it
>> to me? I thought pearson correlation would be like this
>>
>> E(x_i-meanX)(y_i-meanY) / sdX*sdY
>
> That's right that's the expression for a population correlation, but
> we can really only compute a sample Pearson correlation coefficient,
> yes:
>
>
>> for my project I would need to get sample correlation coefficient which
>> would be something like this:
>>
>> sum(x_i-meanX)(y_i-meanY)/(N-1)
>
> Yeah that's fine too, this is another way of expressing the formula,
> though you're missing the two standard deviations in the denominator.
> It'll be clearer if I note that the mean of X and Y are 0.
>
>
>
> --
> View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26540395.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Mahout/Taste covariance between two items

Tamas Jambor-2
i really just want to get the sample covariance which is:

sum(X_i - meanX)(Y_i - meanY)/N-1

this is just

 pearson_x,y * sdX * sdY

i think sumXY/N-1 should be the right one.

srowen wrote
I'm not so familiar with this formula but you seem to be missing
something in the denominator... it's got to normalize somehow. I think
I said divide by standard deviation but that's not quite it. What you
are really summing are the products of z-scores.  See
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

But I think you should just use the formulation given in the code?
should be the same result. At least I hope these aren't different
definitions of Pearson!

On Fri, Nov 27, 2009 at 10:20 AM, jamborta <jamborta@gmail.com> wrote:
>
> thanks you. much clearer now.
>
> so for my purpose this will do:
>
> sumXY/N-1
>
> given that the data is 'centered'?
>
>
> On Fri, Nov 27, 2009 at 1:41 AM, jamborta <jamborta@gmail.com> wrote:
>>
>> hi. I tried to figure out how you calcualte pearson correlation, but it
>> looks
>> like you use this formula:
>>
>> sumXY / sqrt(sumX2 * sumY2)
>
> Yes that's right -- this is what Pearson reduces to when the mean of X
> and Y are 0. And they are here -- the implementation 'centers' the
> data.
>
>> where sumXY = sumXY - meanY * sumX;
>> sumX2 = sumX2 - meanX * sumX;
>> sumY2 = sumY2 - meanY * sumY;
>
> You see the lines commented out there? Those are the full forms of the
> expressions, which may make more sense. This is centering the data,
> making the mean 0.
>
> This is a simplification based on the observation that, for example,
> sumX * meanY = sumY * meanX = n * meanY * meanX.
>
>>
>> i don't really understand how you got these equations. could you explain
>> it
>> to me? I thought pearson correlation would be like this
>>
>> E(x_i-meanX)(y_i-meanY) / sdX*sdY
>
> That's right that's the expression for a population correlation, but
> we can really only compute a sample Pearson correlation coefficient,
> yes:
>
>
>> for my project I would need to get sample correlation coefficient which
>> would be something like this:
>>
>> sum(x_i-meanX)(y_i-meanY)/(N-1)
>
> Yeah that's fine too, this is another way of expressing the formula,
> though you're missing the two standard deviations in the denominator.
> It'll be clearer if I note that the mean of X and Y are 0.
>
>
>
> --
> View this message in context: http://old.nabble.com/Mahout-Taste-covariance-between-two-items-tp26530825p26540395.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>