RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

Jordi Abad
Hi,

I'm running a RecommenderJob (mahout-0.4 version) over hadoop like this:

hadoop-0.20 jar /mahout-distribution-0.4/mahout-core-0.4-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=input -Dmapred.output.dir=output -s
SIMILARITY_TANIMOTO_COEFFICIENT -b true

The job works fine but when I examine the result I get things like:

12    [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,168:1.0,173:1.0,180:1.0,199:1.0]
14    [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,14:1.0,21:1.0,22:1.0,23:1.0]
...

I can't understand why each recommendation gets 1.0 of score. It doesn't
matter which SimilarityClass I set. I always get a score of 1.0.

My input file is a "boolean file" (1391374 rows) with values like:

1,6496241
1,4368916
1,4922226
1,4958662
...

If I run
"org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" job
over the same file I get good results for items.

Any ideas?

Thanks in advance.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

Sean Owen
This is because all the ratings are implicitly 1.0 when there are no ratings.

But I actually think this is symptomatic of a problem, since I note
that those recommendations are quite suspiciously in order by item ID.
I am not sure the current state of the distributed recommender is
compatible with boolean data, but I am not an expert here --

Sebastian can we discuss what might be going on here? In the
non-distributed code, items are given a "fake" estimated preferences
which is not actually an estimated preference (because that would
always be 1.0) but some other number that functions as a score --
average similarity to other items for example. This is used as a
ranking and also returned as an "estimated preference" even though
it's not.

Can we do something like that here? or is it already working this way
if certain values / options are set?

On Fri, Nov 26, 2010 at 6:26 PM, Jordi Abad <[hidden email]> wrote:

> Hi,
>
> I'm running a RecommenderJob (mahout-0.4 version) over hadoop like this:
>
> hadoop-0.20 jar /mahout-distribution-0.4/mahout-core-0.4-job.jar
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=input -Dmapred.output.dir=output -s
> SIMILARITY_TANIMOTO_COEFFICIENT -b true
>
> The job works fine but when I examine the result I get things like:
>
> 12    [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,168:1.0,173:1.0,180:1.0,199:1.0]
> 14    [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,14:1.0,21:1.0,22:1.0,23:1.0]
> ...
>
> I can't understand why each recommendation gets 1.0 of score. It doesn't
> matter which SimilarityClass I set. I always get a score of 1.0.
>
> My input file is a "boolean file" (1391374 rows) with values like:
>
> 1,6496241
> 1,4368916
> 1,4922226
> 1,4958662
> ...
>
> If I run
> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" job
> over the same file I get good results for items.
>
> Any ideas?
>
> Thanks in advance.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

Sebastian Schelter-4
In reply to this post by Jordi Abad
Hi Jordi,

That's because you compute recommendations on *boolean* data (-b true).
There is no weight involved in the preferences then, you either know
that a user likes something or you don't know it. The result of that is
that you can also not assign a weight to a computed recommendation
either. That's where the 1.0s are coming from.

Things might be clearer if we take a look at the math:

u = a user
i = an item not yet rated by u
N = all items similar to i

Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) /
sum(all n from N: abs(similarity(i,n)))

If all ratings have value 1 (cause we use boolean data) the result of
the Predicition can also only be 1.

--sebastian



Am 26.11.2010 19:26, schrieb Jordi Abad:

> Hi,
>
> I'm running a RecommenderJob (mahout-0.4 version) over hadoop like this:
>
> hadoop-0.20 jar /mahout-distribution-0.4/mahout-core-0.4-job.jar
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=input -Dmapred.output.dir=output -s
> SIMILARITY_TANIMOTO_COEFFICIENT -b true
>
> The job works fine but when I examine the result I get things like:
>
> 12    [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,168:1.0,173:1.0,180:1.0,199:1.0]
> 14    [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,14:1.0,21:1.0,22:1.0,23:1.0]
> ...
>
> I can't understand why each recommendation gets 1.0 of score. It doesn't
> matter which SimilarityClass I set. I always get a score of 1.0.
>
> My input file is a "boolean file" (1391374 rows) with values like:
>
> 1,6496241
> 1,4368916
> 1,4922226
> 1,4958662
> ...
>
> If I run
> "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" job
> over the same file I get good results for items.
>
> Any ideas?
>
> Thanks in advance.
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

Sean Owen
But is it then ranking the recommendations by the estimated pref? If
it's always 1, then the ordering is not meaningful.

Maybe it is, I just haven't looked at your changes in much detail
since you made them although it looked broadly correct and proper.

On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]> wrote:
> If all ratings have value 1 (cause we use boolean data) the result of
> the Predicition can also only be 1.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

Sebastian Schelter-4
Hi Sean,

the prediction computation for boolean data is done in
AggregateAndRecommendReducer.reduceBooleanData()

It computes *all* possible items to recommend for the current user and
writes out only the n first after that, with n being the number
specified in the parameter --numRecommendations given to RecommenderJob.

Can you point me to the code where the non-distributed code handles the
problem of ranking them? We could certainly emulate that behaviour in
the distributed code too.

--sebastian



Am 26.11.2010 19:35, schrieb Sean Owen:

> But is it then ranking the recommendations by the estimated pref? If
> it's always 1, then the ordering is not meaningful.
>
> Maybe it is, I just haven't looked at your changes in much detail
> since you made them although it looked broadly correct and proper.
>
> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]> wrote:
>  
>> If all ratings have value 1 (cause we use boolean data) the result of
>> the Predicition can also only be 1.
>>    

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

Sean Owen
The behavior difference is fairly simple. Instead of a weighted
average of preferences (which will always equal 1.0), compute some
other function of those weights -- for example, the average of the
weights.

See GenericBooleanPrefItemBasedRecommender. It's actually just summing
the weights. This is nearly the same thing since the number of items
participating in the average is the same for all estimates. *Nearly*
the same since some can be NaN.

It's an open question whether there aren't better functions of the
weights to use, but this is a fine start, IMHO.


On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter <[hidden email]> wrote:

> Hi Sean,
>
> the prediction computation for boolean data is done in
> AggregateAndRecommendReducer.reduceBooleanData()
>
> It computes *all* possible items to recommend for the current user and
> writes out only the n first after that, with n being the number
> specified in the parameter --numRecommendations given to RecommenderJob.
>
> Can you point me to the code where the non-distributed code handles the
> problem of ranking them? We could certainly emulate that behaviour in
> the distributed code too.
>
> --sebastian
>
>
>
> Am 26.11.2010 19:35, schrieb Sean Owen:
>> But is it then ranking the recommendations by the estimated pref? If
>> it's always 1, then the ordering is not meaningful.
>>
>> Maybe it is, I just haven't looked at your changes in much detail
>> since you made them although it looked broadly correct and proper.
>>
>> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]> wrote:
>>
>>> If all ratings have value 1 (cause we use boolean data) the result of
>>> the Predicition can also only be 1.
>>>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

Jordi Abad
Hi,

I applied the changes of MAHOUT-553 (thanks Sebastian!) against mahout-0.4.
Everything makes sense now. I've tried it with different similarities
(SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT,
SIMILARITY_UNCENTERED_COSINE) and it works fine (i.e. I got good
recommendations with different scores) but when I tried
SIMILARITY_PEARSON_CORRELATION, I got an empty part-00000 file. Is it
normal?

On Fri, Nov 26, 2010 at 7:50 PM, Sean Owen <[hidden email]> wrote:

> The behavior difference is fairly simple. Instead of a weighted
> average of preferences (which will always equal 1.0), compute some
> other function of those weights -- for example, the average of the
> weights.
>
> See GenericBooleanPrefItemBasedRecommender. It's actually just summing
> the weights. This is nearly the same thing since the number of items
> participating in the average is the same for all estimates. *Nearly*
> the same since some can be NaN.
>
> It's an open question whether there aren't better functions of the
> weights to use, but this is a fine start, IMHO.
>
>
> On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter <[hidden email]>
> wrote:
> > Hi Sean,
> >
> > the prediction computation for boolean data is done in
> > AggregateAndRecommendReducer.reduceBooleanData()
> >
> > It computes *all* possible items to recommend for the current user and
> > writes out only the n first after that, with n being the number
> > specified in the parameter --numRecommendations given to RecommenderJob.
> >
> > Can you point me to the code where the non-distributed code handles the
> > problem of ranking them? We could certainly emulate that behaviour in
> > the distributed code too.
> >
> > --sebastian
> >
> >
> >
> > Am 26.11.2010 19:35, schrieb Sean Owen:
> >> But is it then ranking the recommendations by the estimated pref? If
> >> it's always 1, then the ordering is not meaningful.
> >>
> >> Maybe it is, I just haven't looked at your changes in much detail
> >> since you made them although it looked broadly correct and proper.
> >>
> >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]>
> wrote:
> >>
> >>> If all ratings have value 1 (cause we use boolean data) the result of
> >>> the Predicition can also only be 1.
> >>>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

Sebastian Schelter-4
Pearson-Correlation and boolean data don't fit, all cooccurring ratings
will have value 1 and therefore no correlation can be computed as the
compared vectors are identical.

--sebastian

Am 28.11.2010 11:28, schrieb Jordi Abad:

> Hi,
>
> I applied the changes of MAHOUT-553 (thanks Sebastian!) against
> mahout-0.4. Everything makes sense now. I've tried it with different
> similarities (SIMILARITY_LOGLIKELIHOOD,
> SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE) and it
> works fine (i.e. I got good recommendations with different scores) but
> when I tried SIMILARITY_PEARSON_CORRELATION, I got an empty part-00000
> file. Is it normal?
>
> On Fri, Nov 26, 2010 at 7:50 PM, Sean Owen <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     The behavior difference is fairly simple. Instead of a weighted
>     average of preferences (which will always equal 1.0), compute some
>     other function of those weights -- for example, the average of the
>     weights.
>
>     See GenericBooleanPrefItemBasedRecommender. It's actually just summing
>     the weights. This is nearly the same thing since the number of items
>     participating in the average is the same for all estimates. *Nearly*
>     the same since some can be NaN.
>
>     It's an open question whether there aren't better functions of the
>     weights to use, but this is a fine start, IMHO.
>
>
>     On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter
>     <[hidden email] <mailto:[hidden email]>> wrote:
>     > Hi Sean,
>     >
>     > the prediction computation for boolean data is done in
>     > AggregateAndRecommendReducer.reduceBooleanData()
>     >
>     > It computes *all* possible items to recommend for the current
>     user and
>     > writes out only the n first after that, with n being the number
>     > specified in the parameter --numRecommendations given to
>     RecommenderJob.
>     >
>     > Can you point me to the code where the non-distributed code
>     handles the
>     > problem of ranking them? We could certainly emulate that
>     behaviour in
>     > the distributed code too.
>     >
>     > --sebastian
>     >
>     >
>     >
>     > Am 26.11.2010 19:35, schrieb Sean Owen:
>     >> But is it then ranking the recommendations by the estimated
>     pref? If
>     >> it's always 1, then the ordering is not meaningful.
>     >>
>     >> Maybe it is, I just haven't looked at your changes in much detail
>     >> since you made them although it looked broadly correct and proper.
>     >>
>     >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter
>     <[hidden email] <mailto:[hidden email]>> wrote:
>     >>
>     >>> If all ratings have value 1 (cause we use boolean data) the
>     result of
>     >>> the Predicition can also only be 1.
>     >>>
>     >
>     >
>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: RecommenderJob in mahout-0.4 returning 1.0 score for each recommendation

Jordi Abad
Ok Sebastian, thanks for the explanation. I'll study each similarity in more
detail.

On Sun, Nov 28, 2010 at 11:37 AM, Sebastian Schelter <[hidden email]> wrote:

>  Pearson-Correlation and boolean data don't fit, all cooccurring ratings
> will have value 1 and therefore no correlation can be computed as the
> compared vectors are identical.
>
> --sebastian
>
> Am 28.11.2010 11:28, schrieb Jordi Abad:
>
> Hi,
>
> I applied the changes of MAHOUT-553 (thanks Sebastian!) against mahout-0.4.
> Everything makes sense now. I've tried it with different similarities
> (SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT,
> SIMILARITY_UNCENTERED_COSINE) and it works fine (i.e. I got good
> recommendations with different scores) but when I tried
> SIMILARITY_PEARSON_CORRELATION, I got an empty part-00000 file. Is it
> normal?
>
> On Fri, Nov 26, 2010 at 7:50 PM, Sean Owen <[hidden email]> wrote:
>
>> The behavior difference is fairly simple. Instead of a weighted
>> average of preferences (which will always equal 1.0), compute some
>> other function of those weights -- for example, the average of the
>> weights.
>>
>> See GenericBooleanPrefItemBasedRecommender. It's actually just summing
>> the weights. This is nearly the same thing since the number of items
>> participating in the average is the same for all estimates. *Nearly*
>> the same since some can be NaN.
>>
>> It's an open question whether there aren't better functions of the
>> weights to use, but this is a fine start, IMHO.
>>
>>
>> On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter <[hidden email]>
>> wrote:
>> > Hi Sean,
>> >
>> > the prediction computation for boolean data is done in
>> > AggregateAndRecommendReducer.reduceBooleanData()
>> >
>> > It computes *all* possible items to recommend for the current user and
>> > writes out only the n first after that, with n being the number
>> > specified in the parameter --numRecommendations given to RecommenderJob.
>> >
>> > Can you point me to the code where the non-distributed code handles the
>> > problem of ranking them? We could certainly emulate that behaviour in
>> > the distributed code too.
>> >
>> > --sebastian
>> >
>> >
>> >
>> > Am 26.11.2010 19:35, schrieb Sean Owen:
>> >> But is it then ranking the recommendations by the estimated pref? If
>> >> it's always 1, then the ordering is not meaningful.
>> >>
>> >> Maybe it is, I just haven't looked at your changes in much detail
>> >> since you made them although it looked broadly correct and proper.
>> >>
>> >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]>
>> wrote:
>> >>
>> >>> If all ratings have value 1 (cause we use boolean data) the result of
>> >>> the Predicition can also only be 1.
>> >>>
>> >
>> >
>>
>
>
>
Loading...