[
https://issues.apache.org/jira/browse/MAHOUT4?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12584544#action_12584544 ]
Ankur commented on MAHOUT4:

Hi,
So after scratching my head for past couple of days and
understanding EM from a general perspective I built a mental model
for EM in the context of clustering.
I thought its better to share my understanding for getting inputs before
turning out the code.
So here is a short writeup in my words, please feel free to
fill any gaps/errors found
Expectation Maximization for clustering

Let
z = unobserved data, clusters in our case.
y = observed data, points in our case.
We start by initializing the model paramters p(yz) and p(z) to appropriately
normalized random values.
For eg: let number of points = 4 and number of clusters = 2
then for cluster z1
p(y1z1) + p(y2z1) + p(y3z1) + p(y4z1) = 1
p(z1) + p(z2) = 1
EStep.

Calculate posteriori estimates of probabilities for various
values of the unknown z as follows:
p(y,z) p(yz)*p(z)
p(zy) =  = 
p(y) summation over z { p(yz)*p(z) }
Calculate expected value of log likelihood as follows:
Q(y) = summation over z { p(zy) * log(p(y,z))}
Q(y) remains unchanged if the calue calculated is = previous Q(y) value.
The algorithm terminates when no improvement is seen for Q(y) for all y
Use Q(y) as the value for p(y,z) for MStep to recompute model parameters
p(yz) and p(z)
MStep

p(y,z) (from EStep)
p(yz) = 
summation over y { p(y,z) } (from EStep)
p(z) = summation over y { p(y,z) } (from EStep)
Questions
=========
1. When and how do we recompute the cluster centers ?
2. As per my understanding points and clusters are simply labels with some
conditional probability assigned to them. A distance metric like one
used in Kmeans is nowhere involved, is that correct ?
> Simple prototype for Expectation Maximization (EM)
> 
>
> Key: MAHOUT4
> URL:
https://issues.apache.org/jira/browse/MAHOUT4> Project: Mahout
> Issue Type: New Feature
> Reporter: Ankur
> Attachments: Mahout_EM.patch
>
>
> Create a simple prototype implementing Expectation Maximization  EM that demonstrates the algorithm functionality given a set of (user, clickurl) data.
> The prototype should be functionally complete and should serve as a basis for the MapReduce version of the EM algorithm.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.