|
Hi,
I'm running a RecommenderJob (mahout-0.4 version) over hadoop like this: hadoop-0.20 jar /mahout-distribution-0.4/mahout-core-0.4-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=input -Dmapred.output.dir=output -s SIMILARITY_TANIMOTO_COEFFICIENT -b true The job works fine but when I examine the result I get things like: 12 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,168:1.0,173:1.0,180:1.0,199:1.0] 14 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,14:1.0,21:1.0,22:1.0,23:1.0] ... I can't understand why each recommendation gets 1.0 of score. It doesn't matter which SimilarityClass I set. I always get a score of 1.0. My input file is a "boolean file" (1391374 rows) with values like: 1,6496241 1,4368916 1,4922226 1,4958662 ... If I run "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" job over the same file I get good results for items. Any ideas? Thanks in advance. |
|
This is because all the ratings are implicitly 1.0 when there are no ratings.
But I actually think this is symptomatic of a problem, since I note that those recommendations are quite suspiciously in order by item ID. I am not sure the current state of the distributed recommender is compatible with boolean data, but I am not an expert here -- Sebastian can we discuss what might be going on here? In the non-distributed code, items are given a "fake" estimated preferences which is not actually an estimated preference (because that would always be 1.0) but some other number that functions as a score -- average similarity to other items for example. This is used as a ranking and also returned as an "estimated preference" even though it's not. Can we do something like that here? or is it already working this way if certain values / options are set? On Fri, Nov 26, 2010 at 6:26 PM, Jordi Abad <[hidden email]> wrote: > Hi, > > I'm running a RecommenderJob (mahout-0.4 version) over hadoop like this: > > hadoop-0.20 jar /mahout-distribution-0.4/mahout-core-0.4-job.jar > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob > -Dmapred.input.dir=input -Dmapred.output.dir=output -s > SIMILARITY_TANIMOTO_COEFFICIENT -b true > > The job works fine but when I examine the result I get things like: > > 12 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,168:1.0,173:1.0,180:1.0,199:1.0] > 14 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,14:1.0,21:1.0,22:1.0,23:1.0] > ... > > I can't understand why each recommendation gets 1.0 of score. It doesn't > matter which SimilarityClass I set. I always get a score of 1.0. > > My input file is a "boolean file" (1391374 rows) with values like: > > 1,6496241 > 1,4368916 > 1,4922226 > 1,4958662 > ... > > If I run > "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" job > over the same file I get good results for items. > > Any ideas? > > Thanks in advance. > |
|
In reply to this post by Jordi Abad
Hi Jordi,
That's because you compute recommendations on *boolean* data (-b true). There is no weight involved in the preferences then, you either know that a user likes something or you don't know it. The result of that is that you can also not assign a weight to a computed recommendation either. That's where the 1.0s are coming from. Things might be clearer if we take a look at the math: u = a user i = an item not yet rated by u N = all items similar to i Prediction(u,i) = sum(all n from N: similarity(i,n) * rating(u,n)) / sum(all n from N: abs(similarity(i,n))) If all ratings have value 1 (cause we use boolean data) the result of the Predicition can also only be 1. --sebastian Am 26.11.2010 19:26, schrieb Jordi Abad: > Hi, > > I'm running a RecommenderJob (mahout-0.4 version) over hadoop like this: > > hadoop-0.20 jar /mahout-distribution-0.4/mahout-core-0.4-job.jar > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob > -Dmapred.input.dir=input -Dmapred.output.dir=output -s > SIMILARITY_TANIMOTO_COEFFICIENT -b true > > The job works fine but when I examine the result I get things like: > > 12 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,168:1.0,173:1.0,180:1.0,199:1.0] > 14 [1:1.0,2:1.0,3:1.0,5:1.0,6:1.0,11:1.0,14:1.0,21:1.0,22:1.0,23:1.0] > ... > > I can't understand why each recommendation gets 1.0 of score. It doesn't > matter which SimilarityClass I set. I always get a score of 1.0. > > My input file is a "boolean file" (1391374 rows) with values like: > > 1,6496241 > 1,4368916 > 1,4922226 > 1,4958662 > ... > > If I run > "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob" job > over the same file I get good results for items. > > Any ideas? > > Thanks in advance. > |
|
But is it then ranking the recommendations by the estimated pref? If
it's always 1, then the ordering is not meaningful. Maybe it is, I just haven't looked at your changes in much detail since you made them although it looked broadly correct and proper. On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]> wrote: > If all ratings have value 1 (cause we use boolean data) the result of > the Predicition can also only be 1. |
|
Hi Sean,
the prediction computation for boolean data is done in AggregateAndRecommendReducer.reduceBooleanData() It computes *all* possible items to recommend for the current user and writes out only the n first after that, with n being the number specified in the parameter --numRecommendations given to RecommenderJob. Can you point me to the code where the non-distributed code handles the problem of ranking them? We could certainly emulate that behaviour in the distributed code too. --sebastian Am 26.11.2010 19:35, schrieb Sean Owen: > But is it then ranking the recommendations by the estimated pref? If > it's always 1, then the ordering is not meaningful. > > Maybe it is, I just haven't looked at your changes in much detail > since you made them although it looked broadly correct and proper. > > On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]> wrote: > >> If all ratings have value 1 (cause we use boolean data) the result of >> the Predicition can also only be 1. >> |
|
The behavior difference is fairly simple. Instead of a weighted
average of preferences (which will always equal 1.0), compute some other function of those weights -- for example, the average of the weights. See GenericBooleanPrefItemBasedRecommender. It's actually just summing the weights. This is nearly the same thing since the number of items participating in the average is the same for all estimates. *Nearly* the same since some can be NaN. It's an open question whether there aren't better functions of the weights to use, but this is a fine start, IMHO. On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter <[hidden email]> wrote: > Hi Sean, > > the prediction computation for boolean data is done in > AggregateAndRecommendReducer.reduceBooleanData() > > It computes *all* possible items to recommend for the current user and > writes out only the n first after that, with n being the number > specified in the parameter --numRecommendations given to RecommenderJob. > > Can you point me to the code where the non-distributed code handles the > problem of ranking them? We could certainly emulate that behaviour in > the distributed code too. > > --sebastian > > > > Am 26.11.2010 19:35, schrieb Sean Owen: >> But is it then ranking the recommendations by the estimated pref? If >> it's always 1, then the ordering is not meaningful. >> >> Maybe it is, I just haven't looked at your changes in much detail >> since you made them although it looked broadly correct and proper. >> >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]> wrote: >> >>> If all ratings have value 1 (cause we use boolean data) the result of >>> the Predicition can also only be 1. >>> > > |
|
Hi,
I applied the changes of MAHOUT-553 (thanks Sebastian!) against mahout-0.4. Everything makes sense now. I've tried it with different similarities (SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE) and it works fine (i.e. I got good recommendations with different scores) but when I tried SIMILARITY_PEARSON_CORRELATION, I got an empty part-00000 file. Is it normal? On Fri, Nov 26, 2010 at 7:50 PM, Sean Owen <[hidden email]> wrote: > The behavior difference is fairly simple. Instead of a weighted > average of preferences (which will always equal 1.0), compute some > other function of those weights -- for example, the average of the > weights. > > See GenericBooleanPrefItemBasedRecommender. It's actually just summing > the weights. This is nearly the same thing since the number of items > participating in the average is the same for all estimates. *Nearly* > the same since some can be NaN. > > It's an open question whether there aren't better functions of the > weights to use, but this is a fine start, IMHO. > > > On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter <[hidden email]> > wrote: > > Hi Sean, > > > > the prediction computation for boolean data is done in > > AggregateAndRecommendReducer.reduceBooleanData() > > > > It computes *all* possible items to recommend for the current user and > > writes out only the n first after that, with n being the number > > specified in the parameter --numRecommendations given to RecommenderJob. > > > > Can you point me to the code where the non-distributed code handles the > > problem of ranking them? We could certainly emulate that behaviour in > > the distributed code too. > > > > --sebastian > > > > > > > > Am 26.11.2010 19:35, schrieb Sean Owen: > >> But is it then ranking the recommendations by the estimated pref? If > >> it's always 1, then the ordering is not meaningful. > >> > >> Maybe it is, I just haven't looked at your changes in much detail > >> since you made them although it looked broadly correct and proper. > >> > >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]> > wrote: > >> > >>> If all ratings have value 1 (cause we use boolean data) the result of > >>> the Predicition can also only be 1. > >>> > > > > > |
|
Pearson-Correlation and boolean data don't fit, all cooccurring ratings
will have value 1 and therefore no correlation can be computed as the compared vectors are identical. --sebastian Am 28.11.2010 11:28, schrieb Jordi Abad: > Hi, > > I applied the changes of MAHOUT-553 (thanks Sebastian!) against > mahout-0.4. Everything makes sense now. I've tried it with different > similarities (SIMILARITY_LOGLIKELIHOOD, > SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE) and it > works fine (i.e. I got good recommendations with different scores) but > when I tried SIMILARITY_PEARSON_CORRELATION, I got an empty part-00000 > file. Is it normal? > > On Fri, Nov 26, 2010 at 7:50 PM, Sean Owen <[hidden email] > <mailto:[hidden email]>> wrote: > > The behavior difference is fairly simple. Instead of a weighted > average of preferences (which will always equal 1.0), compute some > other function of those weights -- for example, the average of the > weights. > > See GenericBooleanPrefItemBasedRecommender. It's actually just summing > the weights. This is nearly the same thing since the number of items > participating in the average is the same for all estimates. *Nearly* > the same since some can be NaN. > > It's an open question whether there aren't better functions of the > weights to use, but this is a fine start, IMHO. > > > On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter > <[hidden email] <mailto:[hidden email]>> wrote: > > Hi Sean, > > > > the prediction computation for boolean data is done in > > AggregateAndRecommendReducer.reduceBooleanData() > > > > It computes *all* possible items to recommend for the current > user and > > writes out only the n first after that, with n being the number > > specified in the parameter --numRecommendations given to > RecommenderJob. > > > > Can you point me to the code where the non-distributed code > handles the > > problem of ranking them? We could certainly emulate that > behaviour in > > the distributed code too. > > > > --sebastian > > > > > > > > Am 26.11.2010 19:35, schrieb Sean Owen: > >> But is it then ranking the recommendations by the estimated > pref? If > >> it's always 1, then the ordering is not meaningful. > >> > >> Maybe it is, I just haven't looked at your changes in much detail > >> since you made them although it looked broadly correct and proper. > >> > >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter > <[hidden email] <mailto:[hidden email]>> wrote: > >> > >>> If all ratings have value 1 (cause we use boolean data) the > result of > >>> the Predicition can also only be 1. > >>> > > > > > > |
|
Ok Sebastian, thanks for the explanation. I'll study each similarity in more
detail. On Sun, Nov 28, 2010 at 11:37 AM, Sebastian Schelter <[hidden email]> wrote: > Pearson-Correlation and boolean data don't fit, all cooccurring ratings > will have value 1 and therefore no correlation can be computed as the > compared vectors are identical. > > --sebastian > > Am 28.11.2010 11:28, schrieb Jordi Abad: > > Hi, > > I applied the changes of MAHOUT-553 (thanks Sebastian!) against mahout-0.4. > Everything makes sense now. I've tried it with different similarities > (SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT, > SIMILARITY_UNCENTERED_COSINE) and it works fine (i.e. I got good > recommendations with different scores) but when I tried > SIMILARITY_PEARSON_CORRELATION, I got an empty part-00000 file. Is it > normal? > > On Fri, Nov 26, 2010 at 7:50 PM, Sean Owen <[hidden email]> wrote: > >> The behavior difference is fairly simple. Instead of a weighted >> average of preferences (which will always equal 1.0), compute some >> other function of those weights -- for example, the average of the >> weights. >> >> See GenericBooleanPrefItemBasedRecommender. It's actually just summing >> the weights. This is nearly the same thing since the number of items >> participating in the average is the same for all estimates. *Nearly* >> the same since some can be NaN. >> >> It's an open question whether there aren't better functions of the >> weights to use, but this is a fine start, IMHO. >> >> >> On Fri, Nov 26, 2010 at 6:45 PM, Sebastian Schelter <[hidden email]> >> wrote: >> > Hi Sean, >> > >> > the prediction computation for boolean data is done in >> > AggregateAndRecommendReducer.reduceBooleanData() >> > >> > It computes *all* possible items to recommend for the current user and >> > writes out only the n first after that, with n being the number >> > specified in the parameter --numRecommendations given to RecommenderJob. >> > >> > Can you point me to the code where the non-distributed code handles the >> > problem of ranking them? We could certainly emulate that behaviour in >> > the distributed code too. >> > >> > --sebastian >> > >> > >> > >> > Am 26.11.2010 19:35, schrieb Sean Owen: >> >> But is it then ranking the recommendations by the estimated pref? If >> >> it's always 1, then the ordering is not meaningful. >> >> >> >> Maybe it is, I just haven't looked at your changes in much detail >> >> since you made them although it looked broadly correct and proper. >> >> >> >> On Fri, Nov 26, 2010 at 6:33 PM, Sebastian Schelter <[hidden email]> >> wrote: >> >> >> >>> If all ratings have value 1 (cause we use boolean data) the result of >> >>> the Predicition can also only be 1. >> >>> >> > >> > >> > > > |
| Powered by Nabble | See how NAML generates this page |
