Clustering techniques, tips and tricks

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Clustering techniques, tips and tricks

Grant Ingersoll-2
As some of you may know, I'm working on a book (it's a long time coming, but I'm getting there) about open source techniques for working with text.  One of my chapters is on clustering and in it, I want to talk about generic clustering approaches and then show concrete examples of them in action.   I've got the concrete side of it down.

Based on my research, it seems people typically divide up the clustering space into two approaches: hierarchical and flat/partitioning.  In overlaying that knowledge with what we have for techniques in Mahout, I'm a bit stumped about where things like LDA and Dirichlet fit into those two approaches or is there, perhaps a third that I'm missing?  They don't seem particularly hierarchical but they don't seem flat either, if that makes any sense, given the probabilistic/mixture nature of the algorithms.  Perhaps I should forgo the traditional division that previous authors have taken and just talk about a suite of techniques at a little lower level?  Thoughts?

The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.  For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?  What didn't work?  Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)

Thanks in advance and Happy New Year,
Grant
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Ted Dunning
On Thu, Dec 31, 2009 at 9:10 AM, Grant Ingersoll <[hidden email]>wrote:

> As some of you may know, I'm working on a book (it's a long time coming,
> but I'm getting there) about open source techniques for working with text.
> ...
> Based on my research, it seems people typically divide up the clustering
> space into two approaches: hierarchical and flat/partitioning.


There are a number of ways of dividing the space of all clustering
techniques.  Whether the final result is hierarchical is an important
criterion.

Other important characteristics include:

- are cluster memberships hard or soft (k-means = hard, LDA and Dirichlet =
soft)

- can the clustering algorithm be viewed in a probabilistic framework
(k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
neighbors = not so much)

- is the definition of a cluster abstract enough to be flexible with regard
to whether a cluster is a model or does it require stronger limits.
(k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
probabilistic model)

- is the algorithm updateable or does it have to run from scratch (k-means,
Dirichlet = yes, agglomerative clustering = not easily)

- is the algorithm non-parametric (which for clustering pretty much reduces
to whether the number and complexity of clusters can increase without bound
as the amount of data increases, Dirchlet = yes)

- does the algorithm operationally take linear time in the size of the data
(k-means yes, LDA = not sure, Dirichlet = pretty much, agglomerative
clustering = no for most algorithms)

- can the algorithm make use of pre-existing knowledge or user adjustments?
(k-means yes, Dirichlet yes)

Note that it is pretty easy to adapt several algorithms like k-means to be
hierarchical.


> In overlaying that knowledge with what we have for techniques in Mahout,
> I'm a bit stumped about where things like LDA and Dirichlet fit into those
> two approaches or is there, perhaps a third that I'm missing?  They don't
> seem particularly hierarchical but they don't seem flat either, if that
> makes any sense, given the probabilistic/mixture nature of the algorithms.
>  Perhaps I should forgo the traditional division that previous authors have
> taken and just talk about a suite of techniques at a little lower level?
>  Thoughts?
>

I think that some of these distinctions are interesting but I think it is
also very easy to confuse newcomers with too many distinctions.  A big part
of the confusion has to do with the fact that none of these distinctions is
comprehensive, nor are any of these completely clear cut.


> The other thing I'm interested in is people's real world feedback on using
> clustering to solve their text related problems.  For instance, what type of
> feature reduction did you do (stopword removal, stemming, etc.)?  What
> algorithms worked for you?  What didn't work?


My experience has been that almost any reasonable implementation of an
algorithm in the k-means family will work reasonably well (this includes LDA
and Dirichlet effectively) and none of them are completely excellent.
Success with text clustering strongly depends on setting expectations
correctly in the interface.  If this user expects the clustering to be
useful but not definitive then they tend to find clustering to be high
value.  If the user expects the clustering to be absolutely definitive, they
will be sorely disappointed.  For cases where perfection is expected, it is
often important to have good integration between clustering and human edits.

The carrot2 implementation is among the best that I have used in terms of
the quality of clustering small batches of results such as search result
lists.
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Bogdan94202
Hi guys,

First of all - Happy New Year -  wish you healthy and successful 2010 year!

I would like to give some feedback. And ask some questions as well :).

I find the classification above very useful (I actually found some details
on the behavior of the k-means and Dirichlet algorithms that I did not see
before in the docs around Mahout).
I played around with Carrot2 for 2 weeks and it really has great level of
usability and simplicity but ...I had to give up on it since my very first
practical clustering task required to cluster 23K+ documents. And while I
was able to run the Lingo algorithm on 5 000 docs with 6 GB of heap memory
for approx. 2 min. it "failed" when I tried to cluster 8 000 docs...(it just
ran for 30 mins and I had to kill the app since it was definitely not
practical - compared to 5 000 docs clustered for 2 minutes).
That was the point where I really had to give up on Carrot2 and started
learning Mahout/Hadoop.

I have managed to do some clustering on my 23 000+ docs with Mahout/k-means
for something like 10 min (in standalone mode - no parallel processing at
all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but I
am still learning and still trying to analyze if the result clusters are
really meaningful for my docs.

One thing I can tell already now is that I definitely, desperately, need
word-stopping  (which worked like a charm in Carrot2) but I am struggling a
little bit with Mahout code to come up with the right way to apply
stopwords. As far as I know there is no such feature already in the code. I
could use any piece of advice from you guys on where and how to apply
stopwords in Mahout clustering algorithms. Or maybe I should do it already
during the Text-to-Vector phase? But it would be valuable for me to be able
to come back later to the complete context of a document (i.e. with the
stopwords inside) - maybe it is a question on its own - how can I easily go
back from clusters->original docs (an not just vectors), I do not know maybe
some kind of mapper which maps vectors to the original documents somehow
(e.g. sort of URL for a document based on the vector id/index or
something?).

So, I definitely need to apply stopwords - without it I can't make any use
of my clusters - at least the ones I received after running (one iteration)
k-means over my original docs (Lucene derived vector) were not really
practical as I ended up with lots of clusters based on words that should
have been stopped.

Another question I have (maybe I simply overlooked something in the docs)
but as far as I got it one should really run the Canopy algorithm first to
get some "good" seeds for initial clusters and only then go for k-means
clustering but I could not really find the example doing so, as far as I got
it the examples are showing how to run the different algorithms (Canopy,
k-means) but only separately and not like chained one ofter the other. Did I
miss something? Can I get more guidance on how to chain Canopy and k-means?
Yet another question - I read in some docs around that k-means actually has
to be run multiple times somehow but I could not find it in the very
examples. How can I do that?

One thing is for sure - I cannot go back to Carrot2 - it is very useful for
small volumes of docs but won't work on what I am heading for:
My initial attempt is to cluster 23K+ docs but then I would like to move to
my complete set of 100K+ docs.

I would definitely like to be in control of the number (parametrized) of
clusters.
I think I will get better results if I can also apply stemming. What would
be you recommendation when using mahout? Should I do the stemming again
somewhere in the input vector forming? Or do you expect this to be part of
the clustering algorithms?

It is also really essential for me to have "updateable" algorithms as I am
adding new documents on daily basis, and I definitely like to have them
clustered immediately (incrementally) - I do not know if this is what is
called "classification" in Mahout and I did not reach these examples yet (I
wanted to really get acquainted with the clustering first).
And that is not all - I do not only want to have new documents clustered
against existing clusters but what I want in addition is that clusters could
actually change with new docs coming.
Of course one could not observe new clusters popping up after a single new
doc is added to the analysis but clusters should really be
adaptable/updateable with new docs.
Would that be possible with k-means for example?

I could give more feedback and possibly more questions :) when get a little
bit further on this.

Best regards,
Bogdan


On Thu, Dec 31, 2009 at 10:39 PM, Ted Dunning <[hidden email]> wrote:

> On Thu, Dec 31, 2009 at 9:10 AM, Grant Ingersoll <[hidden email]
> >wrote:
>
> > As some of you may know, I'm working on a book (it's a long time coming,
> > but I'm getting there) about open source techniques for working with
> text.
> > ...
> > Based on my research, it seems people typically divide up the clustering
> > space into two approaches: hierarchical and flat/partitioning.
>
>
> There are a number of ways of dividing the space of all clustering
> techniques.  Whether the final result is hierarchical is an important
> criterion.
>
> Other important characteristics include:
>
> - are cluster memberships hard or soft (k-means = hard, LDA and Dirichlet =
> soft)
>
> - can the clustering algorithm be viewed in a probabilistic framework
> (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
> neighbors = not so much)
>
> - is the definition of a cluster abstract enough to be flexible with regard
> to whether a cluster is a model or does it require stronger limits.
> (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
> probabilistic model)
>
> - is the algorithm updateable or does it have to run from scratch (k-means,
> Dirichlet = yes, agglomerative clustering = not easily)
>
> - is the algorithm non-parametric (which for clustering pretty much reduces
> to whether the number and complexity of clusters can increase without bound
> as the amount of data increases, Dirchlet = yes)
>
> - does the algorithm operationally take linear time in the size of the data
> (k-means yes, LDA = not sure, Dirichlet = pretty much, agglomerative
> clustering = no for most algorithms)
>
> - can the algorithm make use of pre-existing knowledge or user adjustments?
> (k-means yes, Dirichlet yes)
>
> Note that it is pretty easy to adapt several algorithms like k-means to be
> hierarchical.
>
>
> > In overlaying that knowledge with what we have for techniques in Mahout,
> > I'm a bit stumped about where things like LDA and Dirichlet fit into
> those
> > two approaches or is there, perhaps a third that I'm missing?  They don't
> > seem particularly hierarchical but they don't seem flat either, if that
> > makes any sense, given the probabilistic/mixture nature of the
> algorithms.
> >  Perhaps I should forgo the traditional division that previous authors
> have
> > taken and just talk about a suite of techniques at a little lower level?
> >  Thoughts?
> >
>
> I think that some of these distinctions are interesting but I think it is
> also very easy to confuse newcomers with too many distinctions.  A big part
> of the confusion has to do with the fact that none of these distinctions is
> comprehensive, nor are any of these completely clear cut.
>
>
> > The other thing I'm interested in is people's real world feedback on
> using
> > clustering to solve their text related problems.  For instance, what type
> of
> > feature reduction did you do (stopword removal, stemming, etc.)?  What
> > algorithms worked for you?  What didn't work?
>
>
> My experience has been that almost any reasonable implementation of an
> algorithm in the k-means family will work reasonably well (this includes
> LDA
> and Dirichlet effectively) and none of them are completely excellent.
> Success with text clustering strongly depends on setting expectations
> correctly in the interface.  If this user expects the clustering to be
> useful but not definitive then they tend to find clustering to be high
> value.  If the user expects the clustering to be absolutely definitive,
> they
> will be sorely disappointed.  For cases where perfection is expected, it is
> often important to have good integration between clustering and human
> edits.
>
> The carrot2 implementation is among the best that I have used in terms of
> the quality of clustering small batches of results such as search result
> lists.
>



--
Bogdan Vatkov
email: [hidden email]
phone: +359 889 197 756
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Ted Dunning
On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <[hidden email]>wrote:

>
> I would like to give some feedback. And ask some questions as well :).
>

Thank you!

Very helpful feedback.


> ... Carrot2 for 2 weeks ... has great level of
> usability and simplicity but ...I had to give up on it since my very first
> practical clustering task required to cluster 23K+ documents.


Not too surprising.


> ...
> I have managed to do some clustering on my 23 000+ docs with Mahout/k-means
> for something like 10 min (in standalone mode - no parallel processing at
> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but
> I
> am still learning and still trying to analyze if the result clusters are
> really meaningful for my docs.
>

I have seen this effect before where a map-reduce program run sequentially
is much faster than an all-in-memory implementation.


> One thing I can tell already now is that I definitely, desperately, need
> word-stopping


You should be able to do this in the document -> vector conversion.  You
could also do this at the vector level by multiplying the coordinates of all
stop words by zero, but that is not as nice a solution.


> ... But it would be valuable for me to be able
> to come back later to the complete context of a document (i.e. with the
> stopwords inside) - maybe it is a question on its own - how can I easily go
> back from clusters->original docs (an not just vectors), I do not know
> maybe
> some kind of mapper which maps vectors to the original documents somehow
> (e.g. sort of URL for a document based on the vector id/index or
> something?).
>

To do this, you should use the document ID and just return the original
content from some other content store.  Lucene or especially SOLR can help
with this.


> ...
> I think I will get better results if I can also apply stemming. What would
> be you recommendation when using mahout? Should I do the stemming again
> somewhere in the input vector forming?


Yes.  That is exactly correct.

It is also really essential for me to have "updateable" algorithms as I am
> adding new documents on daily basis, and I definitely like to have them
> clustered immediately (incrementally) - I do not know if this is what is
> called "classification" in Mahout and I did not reach these examples yet (I
> wanted to really get acquainted with the clustering first).
>

I can't comment on exactly how this should be done, but we definitely need
to support this use case.


> And that is not all - I do not only want to have new documents clustered
> against existing clusters but what I want in addition is that clusters
> could
> actually change with new docs coming.
>

Exactly.  This is easy algorithmically with k-means.  It just needs to be
supported by the software.


> Of course one could not observe new clusters popping up after a single new
> doc is added to the analysis but clusters should really be
> adaptable/updateable with new docs.
>

Yes.  It is eminently doable.  Occasionally you should run back through all
of the document vectors so you can look at old documents in light of new
data but that should be very, very fast in your case.
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Grant Ingersoll-2

On Jan 1, 2010, at 5:00 AM, Ted Dunning wrote:

> On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <[hidden email]>wrote:
>
>>
>> I would like to give some feedback. And ask some questions as well :).
>>
>
> Thank you!
>
> Very helpful feedback.
>
>
>> ... Carrot2 for 2 weeks ... has great level of
>> usability and simplicity but ...I had to give up on it since my very first
>> practical clustering task required to cluster 23K+ documents.
>
>
> Not too surprising.

Right, Carrot2 is designed for clustering search results, and of that mainly the title and snippet.  While it can do larger docs, they are specifically not the target.  Plus, C2 is an in-memory tool designed to be very fast for search results.


>
>
>> ...
>> I have managed to do some clustering on my 23 000+ docs with Mahout/k-means
>> for something like 10 min (in standalone mode - no parallel processing at
>> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout) but
>> I
>> am still learning and still trying to analyze if the result clusters are
>> really meaningful for my docs.
>>
>
> I have seen this effect before where a map-reduce program run sequentially
> is much faster than an all-in-memory implementation.
>
>
>> One thing I can tell already now is that I definitely, desperately, need
>> word-stopping
>
>
> You should be able to do this in the document -> vector conversion.  You
> could also do this at the vector level by multiplying the coordinates of all
> stop words by zero, but that is not as nice a solution.

Right, or if you are using the Lucene extraction method, at Lucene indexing time.


>
>
>> ... But it would be valuable for me to be able
>> to come back later to the complete context of a document (i.e. with the
>> stopwords inside) - maybe it is a question on its own - how can I easily go
>> back from clusters->original docs (an not just vectors), I do not know
>> maybe
>> some kind of mapper which maps vectors to the original documents somehow
>> (e.g. sort of URL for a document based on the vector id/index or
>> something?).
>>
>
> To do this, you should use the document ID and just return the original
> content from some other content store.  Lucene or especially SOLR can help
> with this.

Right, Mahout's vector can take labels.

>
>
>> ...
>> I think I will get better results if I can also apply stemming. What would
>> be you recommendation when using mahout? Should I do the stemming again
>> somewhere in the input vector forming?
>
>
> Yes.  That is exactly correct.

Again, really easy to do if you use the Lucene method for creating vectors.


>
> It is also really essential for me to have "updateable" algorithms as I am
>> adding new documents on daily basis, and I definitely like to have them
>> clustered immediately (incrementally) - I do not know if this is what is
>> called "classification" in Mahout and I did not reach these examples yet (I
>> wanted to really get acquainted with the clustering first).
>>
>
> I can't comment on exactly how this should be done, but we definitely need
> to support this use case.

Don't people usually see if the new docs fit into an existing cluster and if they are a good fit, add them there, otherwise, maybe put them in the best match and kick off a new job.


>
>
>> And that is not all - I do not only want to have new documents clustered
>> against existing clusters but what I want in addition is that clusters
>> could
>> actually change with new docs coming.
>>
>
> Exactly.  This is easy algorithmically with k-means.  It just needs to be
> supported by the software.

Makes sense and shouldn't be that hard to do.  I'd imagine we just need to be able to use the centroids from the previous run as the seeds for the new run.

>
>
>> Of course one could not observe new clusters popping up after a single new
>> doc is added to the analysis but clusters should really be
>> adaptable/updateable with new docs.
>>
>
> Yes.  It is eminently doable.  Occasionally you should run back through all
> of the document vectors so you can look at old documents in light of new
> data but that should be very, very fast in your case.

Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Bogdan94202
On Fri, Jan 1, 2010 at 3:24 PM, Grant Ingersoll <[hidden email]> wrote:

>
> On Jan 1, 2010, at 5:00 AM, Ted Dunning wrote:
>
> > On Thu, Dec 31, 2009 at 10:41 PM, Bogdan Vatkov <[hidden email]
> >wrote:
> >
> >>
> >> I would like to give some feedback. And ask some questions as well :).
> >>
> >
> > Thank you!
> >
> > Very helpful feedback.
> >
> >
> >> ... Carrot2 for 2 weeks ... has great level of
> >> usability and simplicity but ...I had to give up on it since my very
> first
> >> practical clustering task required to cluster 23K+ documents.
> >
> >
> > Not too surprising.
>
> Right, Carrot2 is designed for clustering search results, and of that
> mainly the title and snippet.  While it can do larger docs, they are
> specifically not the target.  Plus, C2 is an in-memory tool designed to be
> very fast for search results.
>
>
> >
> >
> >> ...
> >> I have managed to do some clustering on my 23 000+ docs with
> Mahout/k-means
> >> for something like 10 min (in standalone mode - no parallel processing
> at
> >> all, I didn't even use all of my (3:-) ) cores yet with Hadoop/Mahout)
> but
> >> I
> >> am still learning and still trying to analyze if the result clusters are
> >> really meaningful for my docs.
> >>
> >
> > I have seen this effect before where a map-reduce program run
> sequentially
> > is much faster than an all-in-memory implementation.
> >
> >
> >> One thing I can tell already now is that I definitely, desperately, need
> >> word-stopping
> >
> >
> > You should be able to do this in the document -> vector conversion.  You
> > could also do this at the vector level by multiplying the coordinates of
> all
> > stop words by zero, but that is not as nice a solution.
>
> Right, or if you are using the Lucene extraction method, at Lucene indexing
> time.
>
>
Ok, so it seems I have to use the stop wording feature of Lucene itself,
right? I just saw there is something about stop words in Lucene but I am yet
to find out how to use that capability.


> >
> >
> >> ... But it would be valuable for me to be able
> >> to come back later to the complete context of a document (i.e. with the
> >> stopwords inside) - maybe it is a question on its own - how can I easily
> go
> >> back from clusters->original docs (an not just vectors), I do not know
> >> maybe
> >> some kind of mapper which maps vectors to the original documents somehow
> >> (e.g. sort of URL for a document based on the vector id/index or
> >> something?).
> >>
> >
> > To do this, you should use the document ID and just return the original
> > content from some other content store.  Lucene or especially SOLR can
> help
> > with this.
>
> Right, Mahout's vector can take labels.
>
> >
> >
> >> ...
> >> I think I will get better results if I can also apply stemming. What
> would
> >> be you recommendation when using mahout? Should I do the stemming again
> >> somewhere in the input vector forming?
> >
> >
> > Yes.  That is exactly correct.
>
> Again, really easy to do if you use the Lucene method for creating vectors.
>
> Do you mean I have to apply stemming during the vector creation or already
in Lucene indexing? Maybe from clustering POV it is the same but what would
you recommend?


> >
> > It is also really essential for me to have "updateable" algorithms as I
> am
> >> adding new documents on daily basis, and I definitely like to have them
> >> clustered immediately (incrementally) - I do not know if this is what is
> >> called "classification" in Mahout and I did not reach these examples yet
> (I
> >> wanted to really get acquainted with the clustering first).
> >>
> >
> > I can't comment on exactly how this should be done, but we definitely
> need
> > to support this use case.
>
> Don't people usually see if the new docs fit into an existing cluster and
> if they are a good fit, add them there, otherwise, maybe put them in the
> best match and kick off a new job.
>
> Actually this question goes back to the original attempt - to analyze
documents automatically by the machine, and not by people :). One of my
goals is to not read the new document but rather the system to tell me if I
should read it ;) - e.g. if it gets clustered/classified against given
cluster/topic which I am interested (not interested) in I could then take
more informed decision whether to read it (not to read it).

>
> >
> >
> >> And that is not all - I do not only want to have new documents clustered
> >> against existing clusters but what I want in addition is that clusters
> >> could
> >> actually change with new docs coming.
> >>
> >
> > Exactly.  This is easy algorithmically with k-means.  It just needs to be
> > supported by the software.
>
> Makes sense and shouldn't be that hard to do.  I'd imagine we just need to
> be able to use the centroids from the previous run as the seeds for the new
> run.
>
> >
> >
> >> Of course one could not observe new clusters popping up after a single
> new
> >> doc is added to the analysis but clusters should really be
> >> adaptable/updateable with new docs.
> >>
> >
> > Yes.  It is eminently doable.  Occasionally you should run back through
> all
> > of the document vectors so you can look at old documents in light of new
> > data but that should be very, very fast in your case.
>
>
I do not know how this updatable clustering works (using previous results as
centroids for new clusterings), is there an example I could see in action?
Additionally I would like to see an example of how could one combine Canopy
and k-means,  I just saw this described in theory somehow but could not find
an example of it.

Best regards,
Bogdan
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Ted Dunning
Either would work.  Spilling vectors from a Lucene index is more developed
at this point in time and thus would probably be easier.

On Fri, Jan 1, 2010 at 7:45 AM, Bogdan Vatkov <[hidden email]>wrote:

> Do you mean I have to apply stemming during the vector creation or already
> in Lucene indexing? Maybe from clustering POV it is the same but what would
> you recommend?
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Ted Dunning
In reply to this post by Bogdan94202
If you start gathering data about whether a particular user would like a
document then you might also be interested in the on-line learning
algorithms that are currently being developed.  These should be able to let
the user say whether they wanted to read a document or not.  The clustering
could be used initially, but as interest information is collected, you can
build a per user profile of interest.

On Fri, Jan 1, 2010 at 7:45 AM, Bogdan Vatkov <[hidden email]>wrote:

> > Don't people usually see if the new docs fit into an existing cluster and
> > if they are a good fit, add them there, otherwise, maybe put them in the
> > best match and kick off a new job.
> >
> > Actually this question goes back to the original attempt - to analyze
> documents automatically by the machine, and not by people :). One of my
> goals is to not read the new document but rather the system to tell me if I
> should read it ;) - e.g. if it gets clustered/classified against given
> cluster/topic which I am interested (not interested) in I could then take
> more informed decision whether to read it (not to read it).




--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Ted Dunning
In reply to this post by Bogdan94202
I don't think that this code exists yet.  It is possible in theory, but not
yet reduced to practice.

On Fri, Jan 1, 2010 at 7:45 AM, Bogdan Vatkov <[hidden email]>wrote:

> I do not know how this updatable clustering works (using previous results
> as
> centroids for new clusterings), is there an example I could see in action?
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Shashikant Kore
In reply to this post by Grant Ingersoll-2
On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[hidden email]> wrote:
>
> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
> What didn't work?  Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of
> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)
>

Let me start by saying Mahout works great for us. We can run k-means
on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a
single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to
Ted for this suggestion. In text clustering, you have lots of small
documents. This results into very sparse vectors (total of 100K
features with each vector having 200 features.) Using vanilla TFIDF
weights doesn't work as nicely.

Even if we don't do explicit stop word removal, the threshold values
for document count does that in a better fashion. If you exclude the
features which are extremely common (say more than 40% documents) or
extremely rare (say in less than 50 documents in a corpus of 100K
docs), you have a meaningful set of features. The current K-Means
already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster
centroid, the feature may not be meaningful. For example,  if a
prominent feature is "beautiful", when you get it back, you will get
"beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good
results. May be the corpus had the issue.

One observation about the clustering process is that it is geared, by
accident or by design, towards batch processing. There is no
support for real-time clustering. There needs to be glue which ties
all the components together to make the process seamless. I suppose,
someone in need of this feature will contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Grant Ingersoll-2
In reply to this post by Bogdan94202

On Jan 1, 2010, at 10:45 AM, Bogdan Vatkov wrote:
> I do not know how this updatable clustering works (using previous results as
> centroids for new clusterings), is there an example I could see in action?
> Additionally I would like to see an example of how could one combine Canopy
> and k-means,  I just saw this described in theory somehow but could not find
> an example of it.

You should just be able to run Canopy and then use the output of that as the value for --clusters on the KMeansDriver.
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Grant Ingersoll-2
In reply to this post by Shashikant Kore

On Jan 2, 2010, at 2:15 AM, Shashikant Kore wrote:

> On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[hidden email]> wrote:
>>
>> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
>> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
>> What didn't work?  Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of
>> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)
>>
>
> Let me start by saying Mahout works great for us. We can run k-means
> on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a
> single host.
>
> Using vector normalization like L2 norm helped quite a bit. Thanks to
> Ted for this suggestion. In text clustering, you have lots of small
> documents. This results into very sparse vectors (total of 100K
> features with each vector having 200 features.) Using vanilla TFIDF
> weights doesn't work as nicely.
>
> Even if we don't do explicit stop word removal, the threshold values
> for document count does that in a better fashion. If you exclude the
> features which are extremely common (say more than 40% documents) or
> extremely rare (say in less than 50 documents in a corpus of 100K
> docs), you have a meaningful set of features. The current K-Means
> already accepts these parameters.

You mean the Lucene Driver that creates the vectors, right?

>
> Stemming can be used for feature reduction, but it has a minor issue.
> When you want to find out prominent features of the resulting cluster
> centroid, the feature may not be meaningful. For example,  if a
> prominent feature is "beautiful", when you get it back, you will get
> "beauti." Ouch.

Right, but this is easily handled  via something like Lucene's highlighter functionality.  I bet it could be made to work on Mahout's vectors (+ a dictionary) fairly easily.


>
> I tried fuzzy K-Means for soft clustering, but I didn't get good
> results. May be the corpus had the issue.
>
> One observation about the clustering process is that it is geared, by
> accident or by design, towards batch processing. There is no
> support for real-time clustering. There needs to be glue which ties
> all the components together to make the process seamless. I suppose,
> someone in need of this feature will contribute it to Mahout.

Right.  This should be pretty easy to remedy, though.  One could simply use the previous results as the --clusters option, right?

>
> Grant,  If I recall more, I will mail it to you.

Great!  Thank you.

>
> --shashi

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Grant Ingersoll-2
In reply to this post by Shashikant Kore

On Jan 2, 2010, at 2:15 AM, Shashikant Kore wrote:

> On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[hidden email]> wrote:
>>
>> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
>> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
>> What didn't work?  Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of
>> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)
>>
>
>
> Using vector normalization like L2 norm helped quite a bit.

As I recall, it is important that the choice of norms aligns with the choice of distance measures, as well as data source (http://www.lucidimagination.com/search/document/34ffc2a83a71a055/centroid_calculations_with_sparse_vectors and http://www.lucidimagination.com/search/document/34ffc2a83a71a055/centroid_calculations_with_sparse_vectors#3d8310376b6cdf6b)
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Grant Ingersoll-2
In reply to this post by Ted Dunning

On Dec 31, 2009, at 3:39 PM, Ted Dunning wrote:


> - can the clustering algorithm be viewed in a probabilistic framework
> (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
> neighbors = not so much)
>
> - is the definition of a cluster abstract enough to be flexible with regard
> to whether a cluster is a model or does it require stronger limits.
> (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
> probabilistic model)

Can you elaborate a bit more on these two?  I can see a bit on the probability side, as those approaches play a factor in how similarity is determined, but I don't get the significance of "cluster as a model".  Is it just a simplification that then makes it easier to ask: does this document fit into the model?

-Grant
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Ted Dunning
Clustering in the k-means style can be viewed as fitting a mixture model to
the data.  A mixture model is a model of the generation of random data where
you first pick one of n sub-models and then generate a point from the
sub-model.  There are parameters that express the probability of each
sub-model and then the parameters for each sub-model.

With k-means, you don't explicitly compute the mixing parameters and you
assume that the sub-models are symmetric Gaussians with fixed variance.

One of the simplest algorithms for this is the so-called E-M algorithm.  For
mixture distributions, this algorithm breaks into two passes.  The first
assigns data points to models, the second estimates each sub-model from the
data assigned to that model.  This algorithm is easy to implement and works
well in some cases, particularly for k-means.

Unfortunately, the E-M algorithm as it stands is doing what is called a
maximum likelihood fit.  If you take k-means and try to use a more elaborate
model than identical Gaussians, the use of maximum likelihood instantly
causes horrible problems because if you allow the variance to vary, the
algorithm will position some clusters on a single point and drive the
variance to zero.  The result is some point-like clusters that each describe
a single observation, and a few diffuse clusters to handle the rest of the
data.  Generalization goes out the window because the point-like clusters
won't describe new data at all.

So... back to your question.  k-means, LDA and the Dirichlet clustering are
all implementations of E-M algorithms for fitting mixture distributions (or
first cousins to avoid the max-likelihood traps).  This allows very nice
mathematical frameworks to be brought to bear on the problem of how to make
these algorithms work well.  For the second question, the distinction is
based on the fact that k-means is pretty inflexible with regard to the
sub-models because of the maximum likelihood problems.  The Dirichlet
clustering tackles this problem head-on and thus allows much more general
models.

This transition from purely distance based metaphors to probabilistic
mixture models as a fundamental metaphor for clustering is a difficult one
to make.  You can work to interpret the mixture models in terms of the
distance based metaphor, but I find that counter-productive because the
analogies used lead to some very dangerous algorithms without obvious
repair.  Viewing the problem as a statistical learning problems not only
makes the extension of the algorithms to new domains easier, but it provides
nice fixes to the problems encountered.  If you want to stick with the
distance (similarity) based metaphor, you can view the probability p(X |
m_i) of an observed data point X given the i-th model m_i as the similarity
of the data X to model m_i.  Then the estimation of the new version of the
model from a number of data points can be considered to be the analog of the
centroid computation in k-means.

On Sun, Jan 3, 2010 at 8:10 AM, Grant Ingersoll <[hidden email]> wrote:

>
> On Dec 31, 2009, at 3:39 PM, Ted Dunning wrote:
>
>
> > - can the clustering algorithm be viewed in a probabilistic framework
> > (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
> > neighbors = not so much)
> >
> > - is the definition of a cluster abstract enough to be flexible with
> regard
> > to whether a cluster is a model or does it require stronger limits.
> > (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
> > probabilistic model)
>
> Can you elaborate a bit more on these two?  I can see a bit on the
> probability side, as those approaches play a factor in how similarity is
> determined, but I don't get the significance of "cluster as a model".  Is it
> just a simplification that then makes it easier to ask: does this document
> fit into the model?
>
> -Grant




--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Ian Holsman (Lists)
In reply to this post by Shashikant Kore
On 1/2/10 6:15 PM, Shashikant Kore wrote:

> On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll<[hidden email]>  wrote:
>    
>> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
>> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
>> What didn't work?  Any and all insight is welcome and I don't particularly care if it is Mahout specific (for instance, part of
>> the chapter is about search result clustering using Carrot2 and so Mahout isn't applicable)
>>
>>      
> Let me start by saying Mahout works great for us. We can run k-means
> on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a
> single host.
>
> Using vector normalization like L2 norm helped quite a bit. Thanks to
> Ted for this suggestion. In text clustering, you have lots of small
> documents. This results into very sparse vectors (total of 100K
> features with each vector having 200 features.) Using vanilla TFIDF
> weights doesn't work as nicely.
>    

I'm not sure what L2 norm is, but wouldn't the frequent pattern mining
feature help here?
(from mahout-157) I was hoping to use it for feature reduction.

> Even if we don't do explicit stop word removal, the threshold values
> for document count does that in a better fashion. If you exclude the
> features which are extremely common (say more than 40% documents) or
> extremely rare (say in less than 50 documents in a corpus of 100K
> docs), you have a meaningful set of features. The current K-Means
> already accepts these parameters.
>
> Stemming can be used for feature reduction, but it has a minor issue.
> When you want to find out prominent features of the resulting cluster
> centroid, the feature may not be meaningful. For example,  if a
> prominent feature is "beautiful", when you get it back, you will get
> "beauti." Ouch.
>
> I tried fuzzy K-Means for soft clustering, but I didn't get good
> results. May be the corpus had the issue.
>
> One observation about the clustering process is that it is geared, by
> accident or by design, towards batch processing. There is no
> support for real-time clustering. There needs to be glue which ties
> all the components together to make the process seamless. I suppose,
> someone in need of this feature will contribute it to Mahout.
>
> Grant,  If I recall more, I will mail it to you.
>
> --shashi
>
>    

Reply | Threaded
Open this post in threaded view
|

RE: Clustering techniques, tips and tricks

Pallavi Palleti-2
In reply to this post by Shashikant Kore
Here are my two cents.

I experimented with both kmeans and fuzzy kmeans on a data similar to movielens. And, what I observed is that the initial cluster seeds are more important. Initially, I used canopy clustering seeds as initial seeds but the results weren't good and the number of clusters depends on the distance thresholds we give as input. Later, I have considered randomly selecting some points from the input dataset and consider them as initial seeds. Again, the results were not good. Now, I have chosen initial seeds from input set in such a way that the points are far from each other and I have observed better clustering using Fuzzy Kmeans. I have not implemented a map-reducable version for this seed selection. I will soon implement a map-reducable version and submit a patch.

Also, regarding canopy, I found that what actually canopy is defined in theory is different from what we have in mahout. However, some one can clarify on that.

Thanks
Pallavi
 
-----Original Message-----
From: Shashikant Kore [mailto:[hidden email]]
Sent: Saturday, January 02, 2010 12:45 PM
To: [hidden email]
Subject: Re: Clustering techniques, tips and tricks

On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[hidden email]> wrote:
>
> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
> What didn't work?  Any and all insight is welcome and I don't
> particularly care if it is Mahout specific (for instance, part of the
> chapter is about search result clustering using Carrot2 and so Mahout
> isn't applicable)
>

Let me start by saying Mahout works great for us. We can run k-means on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for this suggestion. In text clustering, you have lots of small documents. This results into very sparse vectors (total of 100K features with each vector having 200 features.) Using vanilla TFIDF weights doesn't work as nicely.

Even if we don't do explicit stop word removal, the threshold values for document count does that in a better fashion. If you exclude the features which are extremely common (say more than 40% documents) or extremely rare (say in less than 50 documents in a corpus of 100K docs), you have a meaningful set of features. The current K-Means already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster centroid, the feature may not be meaningful. For example,  if a prominent feature is "beautiful", when you get it back, you will get "beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good results. May be the corpus had the issue.

One observation about the clustering process is that it is geared, by accident or by design, towards batch processing. There is no support for real-time clustering. There needs to be glue which ties all the components together to make the process seamless. I suppose, someone in need of this feature will contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Ted Dunning
Pallavi,

This is very useful feedback.

What you have done is very similar to the k-means++ algorithm and it is
clearly a very good thing.

There is already an issue for tracking a k-means++ implementation:
http://issues.apache.org/jira/browse/MAHOUT-153

Could you post your patch there?

On Mon, Jan 4, 2010 at 4:03 AM, Palleti, Pallavi <
[hidden email]> wrote:

> Initially, I used canopy clustering seeds as initial seeds but the results
> weren't good and the number of clusters depends on the distance thresholds
> we give as input. Later, I have considered randomly selecting some points
> from the input dataset and consider them as initial seeds. Again, the
> results were not good. Now, I have chosen initial seeds from input set in
> such a way that the points are far from each other and I have observed
> better clustering using Fuzzy Kmeans. I have not implemented a map-reducable
> version for this seed selection. I will soon implement a map-reducable
> version and submit a patch.
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: Clustering techniques, tips and tricks

Dawid Weiss-2
In reply to this post by Ted Dunning
> The carrot2 implementation is among the best that I have used in terms of
> the quality of clustering small batches of results such as search result
> lists.

(blushing). Thanks, it's a big compliment coming from you, Ted.
Reply | Threaded
Open this post in threaded view
|

RE: Clustering techniques, tips and tricks

Vaijanathrao
In reply to this post by Pallavi Palleti-2
Hi,

I have another suggestion that I am working at, it's not yet complete but I should be able to submit some patch in coming weeks.

The idea is to have two distances one each for
A) Across the cluster
B) Within the cluster

These distances help in setting up the inclusion and exclusion of the input data point in the cluster.

In the iter-1 I allow inifite distance for both of these distance and run the regular logic of clustering. At the end of it calculate the min and max for both of these distances.

For the remaining iterations do the following

A) find the Lowest distance cluster for the input point. If the distance of the input point increases the distance within the cluster then the previously assigned then ignore it and create a new cluster with this point. If it does not voilate the older distance keep it with following probability
        1. 1 if the distance of Input point is less then min distance.
        2. in the ratio of (max-distance of the input from center)/max_distance

B) If there are two clusters which have cluster distance less then the minimum cluster distance then merge them and calculate the within cluster distance and across cluster distance.

I have observed that this has helped me in getting good clusters. Currently I have a single version code which has the above calculation and I am wokring on getting this into Map-red and should be able to submit a patch very soon.

Let me know If you think this will generally work.

--Thanks and Regards
Vaijanath

-----Original Message-----
From: Palleti, Pallavi [mailto:[hidden email]]
Sent: Monday, January 04, 2010 5:34 PM
To: [hidden email]
Subject: RE: Clustering techniques, tips and tricks

Here are my two cents.

I experimented with both kmeans and fuzzy kmeans on a data similar to movielens. And, what I observed is that the initial cluster seeds are more important. Initially, I used canopy clustering seeds as initial seeds but the results weren't good and the number of clusters depends on the distance thresholds we give as input. Later, I have considered randomly selecting some points from the input dataset and consider them as initial seeds. Again, the results were not good. Now, I have chosen initial seeds from input set in such a way that the points are far from each other and I have observed better clustering using Fuzzy Kmeans. I have not implemented a map-reducable version for this seed selection. I will soon implement a map-reducable version and submit a patch.

Also, regarding canopy, I found that what actually canopy is defined in theory is different from what we have in mahout. However, some one can clarify on that.

Thanks
Pallavi
 
-----Original Message-----
From: Shashikant Kore [mailto:[hidden email]]
Sent: Saturday, January 02, 2010 12:45 PM
To: [hidden email]
Subject: Re: Clustering techniques, tips and tricks

On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[hidden email]> wrote:
>
> The other thing I'm interested in is people's real world feedback on using clustering to solve their text related problems.
> For instance, what type of feature reduction did you do (stopword removal, stemming, etc.)?  What algorithms worked for you?
> What didn't work?  Any and all insight is welcome and I don't
> particularly care if it is Mahout specific (for instance, part of the
> chapter is about search result clustering using Carrot2 and so Mahout
> isn't applicable)
>

Let me start by saying Mahout works great for us. We can run k-means on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for this suggestion. In text clustering, you have lots of small documents. This results into very sparse vectors (total of 100K features with each vector having 200 features.) Using vanilla TFIDF weights doesn't work as nicely.

Even if we don't do explicit stop word removal, the threshold values for document count does that in a better fashion. If you exclude the features which are extremely common (say more than 40% documents) or extremely rare (say in less than 50 documents in a corpus of 100K docs), you have a meaningful set of features. The current K-Means already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster centroid, the feature may not be meaningful. For example,  if a prominent feature is "beautiful", when you get it back, you will get "beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good results. May be the corpus had the issue.

One observation about the clustering process is that it is geared, by accident or by design, towards batch processing. There is no support for real-time clustering. There needs to be glue which ties all the components together to make the process seamless. I suppose, someone in need of this feature will contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi
12