[gsoc] Collaborative filtering algorithms

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[gsoc] Collaborative filtering algorithms

Allen Chue
Hi mahout folks,

For this year's GSoC, I'm particularly interested in CF-related
algorithms running on MapReduce-like environments. Will anyone tell me
about the current status of recommender algorithms in Mahout please?
Does it need any improvement?

Thanks a lot.

--
Yin Qiu
Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Sean Owen
Yes there is a framework in the code for running a Recommender across
machines in Hadoop, and a Hadoop job which distributes part of the
processing for a slope one recommender.

Both could use testing, refinement and enhancement.

I do not know of an algorithm which is by nature efficiently distributable.
Finding and implementing such a thing would be great.

I would be the person to contact about this so feel free to run your
proposals by me.

On 4 Mar 2009, 7:15 AM, "QIU, Yin" <[hidden email]> wrote:

Hi mahout folks,

For this year's GSoC, I'm particularly interested in CF-related
algorithms running on MapReduce-like environments. Will anyone tell me
about the current status of recommender algorithms in Mahout please?
Does it need any improvement?

Thanks a lot.

--
Yin Qiu
Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Grant Ingersoll-2

On Mar 4, 2009, at 3:55 AM, Sean Owen wrote:

> Yes there is a framework in the code for running a Recommender across
> machines in Hadoop, and a Hadoop job which distributes part of the
> processing for a slope one recommender.
>
> Both could use testing, refinement and enhancement.
>
> I do not know of an algorithm which is by nature efficiently  
> distributable.
> Finding and implementing such a thing would be great.
>
> I would be the person to contact about this so feel free to run your
> proposals by me.

Sean definitely is the expert, but please don't have discussions off  
list (if that is what is implied here).  We'll all benefit from the  
discussion and I know I've learned a lot about CF already just from  
what has taken place so far.

Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Sean Owen
(Oops, of course. Didn't mean to imply there should be a side conversation
but that's how it came out. I just mean there is definitely at least one
person here who could and would 'mentor' such a project.)

On 4 Mar 2009, 12:20 PM, "Grant Ingersoll" <[hidden email]> wrote:

On Mar 4, 2009, at 3:55 AM, Sean Owen wrote: > Yes there is a framework in
the code for running a ...
Sean definitely is the expert, but please don't have discussions off list
(if that is what is implied here).  We'll all benefit from the discussion
and I know I've learned a lot about CF already just from what has taken
place so far.
Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Allen Chue
In reply to this post by Sean Owen
Hi!

> Yes there is a framework in the code for running a Recommender across
> machines in Hadoop, and a Hadoop job which distributes part of the
> processing for a slope one recommender.

I don't know slope one recommender yet. Maybe I should read that first
to know how you manage to divide the tasks. However, a little
explanation in advance would be appreciated.

> Both could use testing, refinement and enhancement.

By "refinement and enhancement", could you be more specific?

> I do not know of an algorithm which is by nature efficiently distributable.
> Finding and implementing such a thing would be great.

Actually I don't know either. But I have two naive clues.

First, as Hofmann introduced pLSA into CF [3] and I heard SVD on
MapReduce had been tackled (is that true?), is it possible to port his
algorithm to Mahout?

Another one. I know that Canny proposed an algorithm [1] that runs on
different nodes, theoretically without a central database, though for
the sake of privacy. Wang et al. also suggested CF for P2P systems
[2]. But I don't know if they are helpful for defining Hadoop jobs.

> I would be the person to contact about this so feel free to run your
> proposals by me.

I get it. And I won't let this discussion go off the list. :-)


[1] J. Canny. Collaborative Filtering with Privacy. In Proceedings of
IEEE Symposium on Security and Privacy, 2002.
[2] J. Wang, J. Pouwelse, R. L. Lagendijk and M. J. T. Reinders.
Distributed collaborative filtering for peer-to-peer file sharing
systems. In SAC '06: Proceedings of the 2006 ACM symposium on Applied
computing, p/p. 1026-1030. 2006.
[3] T. Hofmann. Latent semantic models for collaborative filtering.
ACM Transactions on Information Systems, volume 22, p.p. 89-115. 2004.

--
Yin Qiu

3
Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Sean Owen
On Thu, Mar 5, 2009 at 7:27 AM, QIU, Yin <[hidden email]> wrote:
> I don't know slope one recommender yet. Maybe I should read that first
> to know how you manage to divide the tasks. However, a little
> explanation in advance would be appreciated.

http://en.wikipedia.org/wiki/Slope_One explains slope one pretty well.

> By "refinement and enhancement", could you be more specific?

Really, I have never run this code in a real Hadoop environment. There
could be bugs, or improvements, that fall out from that. For example
there might be some more efficient way to use Hadoop that I don't see.
I don't have anything specific in mind -- these are unknown-unknowns
to me. But I think this could form part of a decent project.

> First, as Hofmann introduced pLSA into CF [3] and I heard SVD on
> MapReduce had been tackled (is that true?), is it possible to port his
> algorithm to Mahout?

This would be a fantastic project, implementing a Recommender based on
this approach . I tried implementing an SVD technique a couple years
ago and it was waaay too slow on one machine. Revisiting with Hadoop
sounds great.

> Another one. I know that Canny proposed an algorithm [1] that runs on
> different nodes, theoretically without a central database, though for
> the sake of privacy. Wang et al. also suggested CF for P2P systems
> [2]. But I don't know if they are helpful for defining Hadoop jobs.

It's interesting, and I personally find this a worthy project too. On
my list of priorities, I don't find a Recommender that prioritizes
privacy or minimizing information sharing as compelling. In most
real-world cases where exposing preference data might be a concern, I
think it can be solved by just using opaque user/item IDs or
something. But, I wouldn't object if someone thought they could
implement this usefully.
Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Allen Chue
Hi Sean,

> Really, I have never run this code in a real Hadoop environment. There
> could be bugs, or improvements, that fall out from that. For example
> there might be some more efficient way to use Hadoop that I don't see.
> I don't have anything specific in mind -- these are unknown-unknowns
> to me. But I think this could form part of a decent project.

Okay. I won't comment on this before I get to know slope one.

> This would be a fantastic project, implementing a Recommender based on
> this approach . I tried implementing an SVD technique a couple years
> ago and it was waaay too slow on one machine. Revisiting with Hadoop
> sounds great.

Glad that you are so positive about this. I just googled and found the
article addressing parallel SVD [1], which was devised by Google. I
shall spend some time reading this. If we are really going to do this
project, implementing only the SVD part would be, in my opinion, good
enough. We can leave implementation of those algorithm relying on SVD
as later work.

> It's interesting, and I personally find this a worthy project too. On
> my list of priorities, I don't find a Recommender that prioritizes
> privacy or minimizing information sharing as compelling. In most
> real-world cases where exposing preference data might be a concern, I
> think it can be solved by just using opaque user/item IDs or
> something. But, I wouldn't object if someone thought they could
> implement this usefully.

Privacy was not my concern. I was talking about whether we can get
some inspiration from the idea that the CF process can be distributed
across multiple nodes, though unfortunately, I haven't got a clue :(


[1] Gengxin Miao, Yangqiu Song, Dong Zhang, and  Hongjie Bai. Parallel
Spectral Clustering Algorithm for Large-Scale Community Data Mining.
http://yqsong.googlepages.com/swsm08_submission_16.pdf. 2008.

--
Yin Qiu
Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Sean Owen
On Thu, Mar 5, 2009 at 1:08 PM, QIU, Yin <[hidden email]> wrote:
> Glad that you are so positive about this. I just googled and found the
> article addressing parallel SVD [1], which was devised by Google. I
> shall spend some time reading this. If we are really going to do this
> project, implementing only the SVD part would be, in my opinion, good
> enough. We can leave implementation of those algorithm relying on SVD
> as later work.

I agree that getting a parallel SVD running is in and of itself
probably a good project in terms of size. On the other hand it would
be better to end up with a basic recommender as a final product. But
even if SVD by itself doesn't make up a complete unit by itself for
collaborative filtering purposes, it does seem interesting enough as a
unit within the broader mandate of Mahout as a machine learning
project. So I personally could support this as a project indeed.

I suppose I'd say the first step is to see if anyone's done SVD on
Hadoop yet, and if so, finish the recommender. If not, SVD is useful
by itself.


> Privacy was not my concern. I was talking about whether we can get
> some inspiration from the idea that the CF process can be distributed
> across multiple nodes, though unfortunately, I haven't got a clue :(

I think you have a good point. Skimming the paper again I get the
sense it itself is a sort of distributed SVD approach, so perhaps it
is the same idea as above.

All in all this sounds like a great area for a project, in my opinion.
Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Jason Rennie-2
In reply to this post by Sean Owen
On Thu, Mar 5, 2009 at 4:24 AM, Sean Owen <[hidden email]> wrote:

> This would be a fantastic project, implementing a Recommender based on
> this approach . I tried implementing an SVD technique a couple years
> ago and it was waaay too slow on one machine. Revisiting with Hadoop
> sounds great.


SVD (at least how it is used for CF---matrix approximation) is really just
least squares, which is not hard (quite simple if you have a Non-linear
Conjugate Gradients solver) and easy to regularize to get even better
answers.  I'd be happy to discuss further if anyone is interested (did a
large portion of my PhD work in this area).

Cheers,

Jason

--
Jason Rennie
Research Scientist, ITA Software
http://www.itasoftware.com/
Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Ted Dunning
In reply to this post by Sean Owen
Simple co-occurrence counting is at the heart of most large-scale
recommendation systems.  Counting plus simple (but sound) statistical
filtering suffices for a broad range of recommendation tasks with very high
quality results.  For statistical filtering, I typically recommend the G^2
statistic as a heuristic score (see my blog about surprise and
coincidence<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>for
details).  These are completely scalable algorithms but Mahout doesn't
have an implementation.

On Wed, Mar 4, 2009 at 12:55 AM, Sean Owen <[hidden email]> wrote:

> I do not know of an algorithm which is by nature efficiently distributable.
> Finding and implementing such a thing would be great.
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: [gsoc] Collaborative filtering algorithms

Ted Dunning
In reply to this post by Jason Rennie-2
SVD or a cousin was a very common feature among the leading netflix
entries.  SVD is, indeed, very slow if you do a complete decomposition.  The
point, of course, for large sparse matrices is that you want an
approximation so you only compute the first few singular vectors/values.  To
do this efficiently on large sparse matrices you need to use something akin
to Lanczos algorithm or a good solver as Jason suggests.  The key trick here
is to only compute a few singular vectors and to avoid fill-in of the
original matrix.

The key step in either the solver or Lanczos algorithms is multiplication by
the observation matrix.  This is easily parallelized and by far dominates
the computation.

One really interesting parallel is that the Gibb's sampler for MDCA or LDA
has a very similar structure to power methods like Lanczos algorithm, just
with a different cost function.  Since these probabilistic approaches are
known to be considerably better than LSA that is an exciting parallel.  The
improvement from LSA to LDA is similar to the improvement from Pearson's
chi^2 tests to G^2 as I described in my paper mentioned above.  The deal is
that LSA (and Pearson's chi^2) both use an assumption of normal errors that
is unjustified for most small count applications.  Recommendation systems
are definitely in the small count regime.

On Thu, Mar 5, 2009 at 5:52 AM, Jason Rennie <[hidden email]> wrote:

> On Thu, Mar 5, 2009 at 4:24 AM, Sean Owen <[hidden email]> wrote:
>
> > This would be a fantastic project, implementing a Recommender based on
> > this approach . I tried implementing an SVD technique a couple years
> > ago and it was waaay too slow on one machine. Revisiting with Hadoop
> > sounds great.
>
>
> SVD (at least how it is used for CF---matrix approximation) is really just
> least squares, which is not hard (quite simple if you have a Non-linear
> Conjugate Gradients solver) and easy to regularize to get even better
> answers.  I'd be happy to discuss further if anyone is interested (did a
> large portion of my PhD work in this area).
>
> Cheers,
>
> Jason
>
> --
> Jason Rennie
> Research Scientist, ITA Software
> http://www.itasoftware.com/
>



--
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)