Welcome Jake Mannix

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Welcome Jake Mannix

Grant Ingersoll-2
I'm pleased to announce that the Lucene PMC has elected to grant committer status to Jake Mannix.  Jake has been doing some really great work with Mahout recently and I am sure I speak on everyone's behalf when I say I look forward to working more with Jake on Mahout.

Jake, it is customary in Lucene when adding a new committer that the committer provide a little background on themselves, so feel free to jump in!

Cheers,
Grant


--------------
Want to be a Mahout committer?  See http://cwiki.apache.org/MAHOUT/howtobecomeacommitter.html for more information.
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Jake Mannix
Hey Grant, Mahouts,

  I'm very happy for the opportunity to really dig in and help build this
project towards its potential, and no longer feel as much of the guilt by
filing a gazillion JIRA tickets which I know will fall on somebody else's
head to review and commit. :)

  Background, where to begin... "I was born in London...", no, wait, that's
a bit far back.  Here we go: Originally trained in math and physics, to the
point of being enrolled in enough PhD programs (none completed, oh well)
that my financial counselor wonders whether I'll have my student loans paid
off before my kids start college (current bet: 2:1 odds against).  I got
diverted into the software world during TechBubble_1.0, where I've been ever
since.  I've worked at a bunch of startups (as well as a few bigger
companies I think are more adequately described as an "endup" in
comparison), where I learned a bunch about IR/search and a little about
NLP.  Now I work at LinkedIn, where I originally helped build our
distributed real-time search engine (on top of Lucene as well as some other
extensions we built in-house and subsequently open-sourced, such as
the zoie<http://zoie.googlecode.com/>project for the real-time search
piece, and
bobo-browse <http://bobo-browse.googlecode.com> for faceting), which uses
our social graph as a key component in search relevance.

   Now I'm responsible for recommender systems at LinkedIn, and am building
a generalized entity-to-entity recommendation engine platform (think: when
you post a job on our jobs portion of the site, we recommend some profiles
of members who would be a good match for the job [Job as Query for Members],
or the reverse: jobs you might want to look at [Member as query for Jobs],
or news articles you might like [Members as query for News], or personalized
ad targeting [Member as query for Ads], or questions relevant to an interest
group [Group as query for Q&A], or people like this person [Member as query
for Members], etc... ).  This current role is what kick-started me to
collect the bits of code I'd written over the years to do massive matrix
computations and put them in one place - the decomposer
project<http://decomposer.googlecode.com>is where I put them, until I
realized that matrix decompositions and
dimensional reduction aren't really as "sexy" enough a project to actually
get much of a community around all by themselves, so I'm currently porting
all that code into Mahout which isn't already here (follow MAHOUT-180 for
more details as that progresses, although now that I can assign JIRA tasks
to myself [once my account is set up], I may break that into many more
sub-tasks to break of more bite-sized chunks for Mahout to digest).

  So enough blathering on.  If you want to know more about me, my LinkedIn
profile <http://www.linkedin.com/in/jakemannix> has a more detailed
professional view, Twitter has me in roughly
1/(fine-structure-constant-of-QED)-byte sized
snippets<http://www.twitter.com/pbrane>,
and I've got the beginnings of a blog <http://www.decomposer.org/blog> as
well, but I'm not updating that terribly much lately, because just as Sean
is hard-at-work on a book on Mahout, I'm supposed to be spending all of my
"authoring" time writing a book for Manning as well: "Lucene in Depth", as a
kind-of follow-on / advanced topics book to go beyond Hatcher, Gospodnetic
and McCandless' awesome Lucene in Action introductory text.

  Looking forward to working with y'all more in the future.

  -jake

  p.s. For those of you in the S.F. Bay Area, LinkedIn is hiring both
Analytics Scientists (to research ways of using our Big Data to make
interesting new products), as well as IR and ML software engineers to work
on our distributed platforms (including hadoop and voldemort), search
infrastructure, and working with me on recommender systems.  Email me for
details if you're interested in hearing more!

On Tue, Dec 8, 2009 at 3:13 AM, Grant Ingersoll <[hidden email]> wrote:

> I'm pleased to announce that the Lucene PMC has elected to grant committer
> status to Jake Mannix.  Jake has been doing some really great work with
> Mahout recently and I am sure I speak on everyone's behalf when I say I look
> forward to working more with Jake on Mahout.
>
> Jake, it is customary in Lucene when adding a new committer that the
> committer provide a little background on themselves, so feel free to jump
> in!
>
> Cheers,
> Grant
>
>
> --------------
> Want to be a Mahout committer?  See
> http://cwiki.apache.org/MAHOUT/howtobecomeacommitter.html for more
> information.
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Grant Ingersoll-2

On Dec 8, 2009, at 10:22 AM, Jake Mannix wrote:

> although now that I can assign JIRA tasks
> to myself [once my account is set up], I may break that into many more
> sub-tasks to break of more bite-sized chunks for Mahout to digest).

You should be able to assign yourself in JIRA now.
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Jake Mannix
On Tue, Dec 8, 2009 at 9:28 AM, Grant Ingersoll <[hidden email]> wrote:

>
> On Dec 8, 2009, at 10:22 AM, Jake Mannix wrote:
>
> > although now that I can assign JIRA tasks
> > to myself [once my account is set up], I may break that into many more
> > sub-tasks to break of more bite-sized chunks for Mahout to digest).
>
> You should be able to assign yourself in JIRA now.
>

Rockin.
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Sean Owen
In reply to this post by Jake Mannix
On Fri, Dec 11, 2009 at 10:23 PM, Jake Mannix <[hidden email]> wrote:
> Where are these hooks you're describing here?  The kind of general
> framework I would imagine would be nice to have is something like this:
> users and items themselves live as (semi-structured) documents (e.g. like
> a Lucene Document, or more generally a Map<String, Map<String, Float>>,
> where the first key is the "field name", and the values are bag-of-words
> term-vectors or phrase vectors).

In particular I'm referring to the ItemSimilarity interface. You stick
that into an item-based recommender (which is really what Ted has been
describing). So to do content-based recommendation, you just implement
the notion of similarity based on content and send it in this way.

Same with UserSimilarity and user-based recommenders.

I imagine this problem can be reduced to a search problem. Maybe vice
versa. I suppose my take on it -- and the reality of it -- is that
what's there is highly specialized for CF. I think it's a good thing,
since the API will be more natural and I imagine it'll be a lot
faster. On my laptop I can do recommendations in about 10ms over 10M
ratings.


> Now the set of users by themselves, instead of just being labels on the
> rows of the preference matrix, is a users-by-terms matrix, and the items,
> instead of being just labels on the columns of the preference matrix, is
> also a items-by-terms matrix.

Yes, this is a fundamentally offline approach right? What exists now
is entirely online. A change in data is reflected immediately. That's
interesting and simple and powerful, but doesn't really scale -- my
rule of thumb is that past 100M data points the non-distributed code
isn't going to work. Below that size -- and that actually describe

The way forward is indeed to write exactly what you and Ted are
talking about: something distributable. And yeah it's going to be a
matrix-based sort of approach. I've started that.

What exists now is more a literal translation of the canonical CF
algorithms, which aren't really rocket science. It's more accessible
than matrix-based, Hadoop-based approaches. But we now need those too.


> The real magic in here is in this last piece, and in an implied piece
> in generating the content-based matrix: different semi-structured
> fields for both the items and users can be pegged against each
> other in different ways, with different weights - let's be concrete,
> and imagine the item-to-item content-based calculation:

It'll be a challenge to integrate content-based approaches to a larger
degree than they already are: what can you really do but offer a hook
to plug in some notion of similarity?

But yes I think we want to re-use the UserSimilarity/ItemSimilarity
hooks even in matrix-based approaches for consistentcy. So...


> Calculating the text-based similarity of *unstructured* documents is
> one thing, and resolves just to figuring out whether you're doing
> BM25, Lucene scoring, pure cosine - just a Similarity decision.

Exactly and this is already implemented in some form as
PearsonCorrelationSimilarity, for example. So the same bits of ideas
are in the existing non-distributed code, it just looks different.


Basically you are clearly interested in
org.apache.mahout.cf.taste.hadoop, and probably don't need to care
about the rest unless you wish to. That's good because the new bits
are the bits that aren't written and that I don't know a lot about.

For example look at .item: this implement Ted's ideas. It's not quite
complete -- I'm not normalizing the recommendation vector yet for
example. So maybe that's a good place to dive in.


... we might even consider naming the distributed CF stuff something
else, since it's actually a totally different implementation than
"cf.taste"
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Ted Dunning
Sean,

It is clear also that I am woefully uninformed about Taste other than what I
imagine to be how it works based on what you say and my estimate that you
have generally good sense.

Can you send chapters of your book as you write them to me and Jake so we
can know your frame of reference better?

On Fri, Dec 11, 2009 at 3:01 PM, Sean Owen <[hidden email]> wrote:

> On Fri, Dec 11, 2009 at 10:23 PM, Jake Mannix <[hidden email]>
> wrote:
> > Where are these hooks you're describing here?  The kind of general
> > framework I would imagine would be nice to have is something like this:
> > users and items themselves live as (semi-structured) documents (e.g. like
> > a Lucene Document, or more generally a Map<String, Map<String, Float>>,
> > where the first key is the "field name", and the values are bag-of-words
> > term-vectors or phrase vectors).
>
> In particular I'm referring to the ItemSimilarity interface. You stick
> that into an item-based recommender (which is really what Ted has been
> describing). So to do content-based recommendation, you just implement
> the notion of similarity based on content and send it in this way.
>
> Same with UserSimilarity and user-based recommenders.
>
> I imagine this problem can be reduced to a search problem. Maybe vice
> versa. I suppose my take on it -- and the reality of it -- is that
> what's there is highly specialized for CF. I think it's a good thing,
> since the API will be more natural and I imagine it'll be a lot
> faster. On my laptop I can do recommendations in about 10ms over 10M
> ratings.
>
>
> > Now the set of users by themselves, instead of just being labels on the
> > rows of the preference matrix, is a users-by-terms matrix, and the items,
> > instead of being just labels on the columns of the preference matrix, is
> > also a items-by-terms matrix.
>
> Yes, this is a fundamentally offline approach right? What exists now
> is entirely online. A change in data is reflected immediately. That's
> interesting and simple and powerful, but doesn't really scale -- my
> rule of thumb is that past 100M data points the non-distributed code
> isn't going to work. Below that size -- and that actually describe
>
> The way forward is indeed to write exactly what you and Ted are
> talking about: something distributable. And yeah it's going to be a
> matrix-based sort of approach. I've started that.
>
> What exists now is more a literal translation of the canonical CF
> algorithms, which aren't really rocket science. It's more accessible
> than matrix-based, Hadoop-based approaches. But we now need those too.
>
>
> > The real magic in here is in this last piece, and in an implied piece
> > in generating the content-based matrix: different semi-structured
> > fields for both the items and users can be pegged against each
> > other in different ways, with different weights - let's be concrete,
> > and imagine the item-to-item content-based calculation:
>
> It'll be a challenge to integrate content-based approaches to a larger
> degree than they already are: what can you really do but offer a hook
> to plug in some notion of similarity?
>
> But yes I think we want to re-use the UserSimilarity/ItemSimilarity
> hooks even in matrix-based approaches for consistentcy. So...
>
>
> > Calculating the text-based similarity of *unstructured* documents is
> > one thing, and resolves just to figuring out whether you're doing
> > BM25, Lucene scoring, pure cosine - just a Similarity decision.
>
> Exactly and this is already implemented in some form as
> PearsonCorrelationSimilarity, for example. So the same bits of ideas
> are in the existing non-distributed code, it just looks different.
>
>
> Basically you are clearly interested in
> org.apache.mahout.cf.taste.hadoop, and probably don't need to care
> about the rest unless you wish to. That's good because the new bits
> are the bits that aren't written and that I don't know a lot about.
>
> For example look at .item: this implement Ted's ideas. It's not quite
> complete -- I'm not normalizing the recommendation vector yet for
> example. So maybe that's a good place to dive in.
>
>
> ... we might even consider naming the distributed CF stuff something
> else, since it's actually a totally different implementation than
> "cf.taste"
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Sean Owen
Yes the editor says it's OK to share drafts with a handful of project
people. I think their main concern is that you'd make good reviewers,
and they'd want to be involved in recording and managing your
feedback.

But there seems to be benefit, and little harm, to showing you the
relatively-finished draft of the key chapters, 2-5. I will mail them
to you separately.

Sean

On Fri, Dec 11, 2009 at 11:28 PM, Ted Dunning <[hidden email]> wrote:
> Sean,
>
> It is clear also that I am woefully uninformed about Taste other than what I
> imagine to be how it works based on what you say and my estimate that you
> have generally good sense.
>
> Can you send chapters of your book as you write them to me and Jake so we
> can know your frame of reference better?
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Jake Mannix
In reply to this post by Sean Owen
On Fri, Dec 11, 2009 at 3:01 PM, Sean Owen <[hidden email]> wrote:

> On Fri, Dec 11, 2009 at 10:23 PM, Jake Mannix <[hidden email]>
> wrote:
> > Where are these hooks you're describing here?  The kind of general
> > framework I would imagine would be nice to have is something like this:
> > users and items themselves live as (semi-structured) documents (e.g. like
> > a Lucene Document, or more generally a Map<String, Map<String, Float>>,
> > where the first key is the "field name", and the values are bag-of-words
> > term-vectors or phrase vectors).
>
> In particular I'm referring to the ItemSimilarity interface. You stick
> that into an item-based recommender (which is really what Ted has been
> describing). So to do content-based recommendation, you just implement
> the notion of similarity based on content and send it in this way.
>

Ok, this kind of hook is good, but it leaves all of the work to the user -
it would
be nice to extend it along the lines I described, whereby developers can
define how to pull out various features of their items (or users), and then
give them a set of Similarities between those features, as well as
interesting
combining functions among those.


> Same with UserSimilarity and user-based recommenders.
>
> I imagine this problem can be reduced to a search problem. Maybe vice
> versa. I suppose my take on it -- and the reality of it -- is that
> what's there is highly specialized for CF. I think it's a good thing,
> since the API will be more natural and I imagine it'll be a lot
> faster. On my laptop I can do recommendations in about 10ms over 10M
> ratings.
>

Yeah, this is viewing it as a search problem, and similarly, you can do
search over 10-50M documents with often even under that latency with
Lucene, so there's no reason why the two could not be tied nicely together
to provide a blend of content and usage-based recommendations/searches.


> > Now the set of users by themselves, instead of just being labels on the
> > rows of the preference matrix, is a users-by-terms matrix, and the items,
> > instead of being just labels on the columns of the preference matrix, is
> > also a items-by-terms matrix.
>
> Yes, this is a fundamentally offline approach right? What exists now
> is entirely online. A change in data is reflected immediately. That's
> interesting and simple and powerful, but doesn't really scale -- my
> rule of thumb is that past 100M data points the non-distributed code
> isn't going to work. Below that size -- and that actually describe
>

Well, computing the user-item content-based similarity matrix *can*
be done offline, and once you have it, it can be used to produce
recommendations online, but another way to do it (and the way we do
it at LinkedIn), is to keep the items in Voldemort, and store them
"in transpose" in a Lucene index, and then compute similar items in
real time as a Lucene query.  Doing item-based recommendations this
way is just grabbing the sparse set of items a user prefers, OR'ing
these together (with boosts which encode the preferences), and
firing away a live search request.

It'll be a challenge to integrate content-based approaches to a larger
> degree than they already are: what can you really do but offer a hook
> to plug in some notion of similarity?
>

There are a ton of pluggable pieces: there's the hook for field-by-field
similarity (and not just the hook, but a bunch of common
implementations), sure, but then there's also a "feature processing /
extracting" phase, which will be very domain specific, and then the
scoring hook, where pairwise similarities among fields can be combined
nontrivially (via logistic regression, via some nonlinear kernel function,
etc...), as well as a separate system for people to actually *train* those
scorers - that in itself is a huge component.


> > Calculating the text-based similarity of *unstructured* documents is
> > one thing, and resolves just to figuring out whether you're doing
> > BM25, Lucene scoring, pure cosine - just a Similarity decision.
>
> Exactly and this is already implemented in some form as
> PearsonCorrelationSimilarity, for example. So the same bits of ideas
> are in the existing non-distributed code, it just looks different.
>

Again - the combination of "field" similarities into a whole Item similarity
is a piece which isn't as simple as Pearson / Cosine / Tanimoto - it's
a choice of parametrized function which may need to be trained, this
part is a new idea (to our recommenders).


> Basically you are clearly interested in
> org.apache.mahout.cf.taste.hadoop, and probably don't need to care
> about the rest unless you wish to. That's good because the new bits
> are the bits that aren't written and that I don't know a lot about.
>
> For example look at .item: this implement Ted's ideas. It's not quite
> complete -- I'm not normalizing the recommendation vector yet for
> example. So maybe that's a good place to dive in.
>

Yep, I'll look at those shortly, I'm definitely intersted in this.

  -jake
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Sean Owen
On Sat, Dec 12, 2009 at 12:28 AM, Jake Mannix <[hidden email]> wrote:
> Ok, this kind of hook is good, but it leaves all of the work to the user -
> it would
> be nice to extend it along the lines I described, whereby developers can
> define how to pull out various features of their items (or users), and then
> give them a set of Similarities between those features, as well as
> interesting
> combining functions among those.

Maybe we mean different things. How can you write in, say, a notion of
content similarity for books, or travel destinations, or fruit,
without opening the door to a million domain-specific subprojects in
the library? I imagine you're just saying, well, at least you could
take one step, and write something that computes similarity in terms
of abstract features. Yeah, that's a great addition.


> Yeah, this is viewing it as a search problem, and similarly, you can do
> search over 10-50M documents with often even under that latency with
> Lucene, so there's no reason why the two could not be tied nicely together
> to provide a blend of content and usage-based recommendations/searches.

You're saying you want to build out a new style of content-based
recommender? That's good indeed. I imagine it's not really a new
Recommender but a new ItemSimilarity/UserSimilarity framework. Which
is good news, just means it's simpler. If it leverages Lucene, great,
but is that a big dependency to bring in?


> Well, computing the user-item content-based similarity matrix *can*
> be done offline, and once you have it, it can be used to produce
> recommendations online, but another way to do it (and the way we do
> it at LinkedIn), is to keep the items in Voldemort, and store them
> "in transpose" in a Lucene index, and then compute similar items in
> real time as a Lucene query.  Doing item-based recommendations this
> way is just grabbing the sparse set of items a user prefers, OR'ing
> these together (with boosts which encode the preferences), and
> firing away a live search request.

Right now in my mind there are two distinct breeds of recommender in
the framework: the existing online non-distributed bits, and the
forthcoming mostly-offline distributed bits. I'm trying to place the
direction you're going into one of those buckets. It could go into
both in different ways. Which do you have in mind?

While it would be nice to integrate this approach harmoniously into
the existing item-based recommender implementation, it's no big deal
to add in a different style of item-based recommender. Just hoping to
avoid repetition where possible; the project is already becoming a
rich and varied but intimidating bag of tools to solve the same
problem.


> There are a ton of pluggable pieces: there's the hook for field-by-field
> similarity (and not just the hook, but a bunch of common
> implementations), sure, but then there's also a "feature processing /
> extracting" phase, which will be very domain specific, and then the
> scoring hook, where pairwise similarities among fields can be combined
> nontrivially (via logistic regression, via some nonlinear kernel function,
> etc...), as well as a separate system for people to actually *train* those
> scorers - that in itself is a huge component.

The feature processing bit feel outside of scope to me, purely because
I can't see how you would write a general framework to extract
features from 'things' where 'things' are musical instruments,
parties, restaurants, etc. How would you? But everything past that
point is obviously in scope and does not exist yet and should.
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Ted Dunning
We already have it as part of the Lucene vector creator thing thing.

On Fri, Dec 11, 2009 at 5:48 PM, Sean Owen <[hidden email]> wrote:

> If it leverages Lucene, great,
> but is that a big dependency to bring in?
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Ted Dunning
In reply to this post by Sean Owen
On Fri, Dec 11, 2009 at 5:48 PM, Sean Owen <[hidden email]> wrote:

> While it would be nice to integrate this approach harmoniously into
> the existing item-based recommender implementation, it's no big deal
> to add in a different style of item-based recommender. Just hoping to
> avoid repetition where possible; the project is already becoming a
> rich and varied but intimidating bag of tools to solve the same
> problem.
>

It should be pretty harmonious as far as the off-line part is concerned, but
the on-line part is likely to be considerably less flexible if only because
scoring has to fit the lucene mold.

This is still a very powerful deployment strategy.


> ... then there's also a "feature processing /
> > extracting" phase, which will be very domain specific, and then the
> > scoring hook, where pairwise similarities among fields can be combined
> > nontrivially (via logistic regression, via some nonlinear kernel
> function,
> > etc...), as well as a separate system for people to actually *train*
> those
> > scorers - that in itself is a huge component.
>
> The feature processing bit feel outside of scope to me, purely because
> I can't see how you would write a general framework to extract
> features from 'things' where 'things' are musical instruments,
> parties, restaurants, etc. How would you? But everything past that
> point is obviously in scope and does not exist yet and should.
>

You are right that general similarity is a but much, but anything that can
be expressed as a set of terms in in-scope.  Recommendations can be
meta-data or history driven.  Both are valuable and both are easily combined
in a Lucene-ish framework.



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Welcome Jake Mannix

Sean Owen
On Sat, Dec 12, 2009 at 8:09 AM, Ted Dunning <[hidden email]> wrote:

> On Fri, Dec 11, 2009 at 5:48 PM, Sean Owen <[hidden email]> wrote:
>
>> While it would be nice to integrate this approach harmoniously into
>> the existing item-based recommender implementation, it's no big deal
>> to add in a different style of item-based recommender. Just hoping to
>> avoid repetition where possible; the project is already becoming a
>> rich and varied but intimidating bag of tools to solve the same
>> problem.
>>
>
> It should be pretty harmonious as far as the off-line part is concerned, but
> the on-line part is likely to be considerably less flexible if only because
> scoring has to fit the lucene mold.
>
> This is still a very powerful deployment strategy.

I'm into it because it does, it seems, address a gap in content-based
recommendation. It's also not as if it means everyone using any
implementation needs Lucene now. And if you do implement Recommender
then it fits very cleanly into the framework and you get benefits
there. The idea of the recommender and the backing store (DataModel)
are purposely quite separate so there's no assumption to work around
there.


> You are right that general similarity is a but much, but anything that can
> be expressed as a set of terms in in-scope.  Recommendations can be
> meta-data or history driven.  Both are valuable and both are easily combined
> in a Lucene-ish framework.

I think we're all on the same page then, I agree that this is also a
gap and a need. It goes without saying I'd be pleased for anyone to do
some work in that area as I think it all completely fits into the
framework.