[
https://issues.apache.org/jira/browse/MAHOUT4?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12586417#action_12586417 ]
Ted Dunning commented on MAHOUT4:

EM clustering is very seriously prone to overfitting if you give reasonable flexibility to the clusters.
An important adjustment is to put a reasonable prior on the distributions being mixed. This serves as regularization that helps avoid the problem. Kmeans (sort of) avoids the problem by assuming all clusters are symmetric with identical variance.
EM clustering can also be changed very slightly by assigning points to single clusters chosen at random according to the probability of membership. This turns EM clustering into Gibb's sampling. The important property that is changed is that you now can sample over the distribution of possible samplings which can be very important if some parts of your data are well defined and some parts not so well defined.
Further extension can also be made by assuming an infinite mixture model. The implementation is only slightly more difficult and the result is a (nearly) nonparametric clustering algorithm. I will attach an R implementation for reference.
> Simple prototype for Expectation Maximization (EM)
> 
>
> Key: MAHOUT4
> URL:
https://issues.apache.org/jira/browse/MAHOUT4> Project: Mahout
> Issue Type: New Feature
> Reporter: Ankur
> Attachments: Mahout_EM.patch
>
>
> Create a simple prototype implementing Expectation Maximization  EM that demonstrates the algorithm functionality given a set of (user, clickurl) data.
> The prototype should be functionally complete and should serve as a basis for the MapReduce version of the EM algorithm.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.