[jira] Commented: (MAHOUT-399) LDA on Mahout 0.3 does not converge to correct solution for overlapping pyramids toy problem.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Commented: (MAHOUT-399) LDA on Mahout 0.3 does not converge to correct solution for overlapping pyramids toy problem.

Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894640#action_12894640 ]

Ted Dunning commented on MAHOUT-399:

I think that this needs more study.  I got email from Mike and it does seem that there is a reasonable likelihood that there is still a serious problem.  The problem is that I respect both Mike and David's opinions pretty highly and they seem to draw incompatible conclusions.  That still leaves me with the feeling that a problem is reasonably likely (> 10% chance at least).

Hi Ted,

I have implemented a parallel version of LDA in C# that separates the processing, but not the data.  It is based on collapsed Gibbs sampling.  And it converges to the correct solution on the overlapping pyramids dataset.

The last e-mail from David Hall indicated to me that he did not think the result for the dataset was conclusive evidence there is a bug.  I disagree.  The statistics of the dataset are overwhelming.  And when you look at the computed likelihood of the corpus it typically reaches its maximum at 5 topics.  

It took me a while to get hadoop up and running on ec2 and then to get the Mahout examples running.  After David's e-mail indicating he did not think the result was conclusive, I decided to implement something for the environment I am working in.

I did not see much in the way of documentation for the Mahout implementation, but my guess at the algorithm was that it was using a variational method.  Since I have not implemented that approach, I do not have an idea where the bug is yet.

Blei's C version implementation does converge as well.  On rare occasion it does not converge, but rerunning it will almost always yield convergence.

I have run David Hall's implementation for different numbers of topics and repeatedly for each number of topics.  It has never converged.

I did send a document along describing the dataset and providing a sample so that someone else could corroborate the result.  I may have made a procedural error in running LDA even though I think I ran everything correctly.  

I would be interested in looking at the variational approach and then trying to debug the current algorithm, but I do not have time to do that at the moment.  Another option would be to convince David Hall to take a second look.

I hope that helps a little.  I would be happy to talk to anyone in more detail.


> LDA on Mahout 0.3 does not converge to correct solution for overlapping pyramids toy problem.
> ---------------------------------------------------------------------------------------------
>                 Key: MAHOUT-399
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-399
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.3
>         Environment: Mac OS X 10.6.2, Hadoop 0.20.2, Mahout 0.3.
>            Reporter: Michael Lazarus
>            Priority: Critical
>         Attachments: olt.tar, Overlapping Pyramids Toy Dataset.pdf
> Hello,
> Apologies if I have not labeled this correctly.
> I have run a toy problem on Mahout 0.3 (locally) for LDA that I used to test Blei's c version of LDA that he posts on his site. It has an exact solution that the LDA should converge to.  Please see attached PDF that describes the intended output.
> Is LDA working?  The following output indicates some sort of collapsing behavior to me.
> T0 T1 T2 T3 T4
> x w x u x
> u u g j n
> l r i m l
> j q h h p
> v p e i q
> e t f g v
> d s d f o
> b c b n k
> y f c l m
> w v u v u
> c d p y t
> k o l r r
> i b j k j
> f e k e f
> g x y s y
> t y w b w
> h i s p s
> o l v x d
> q j t d i
> n k o t b
> The intended output is (again, please see attached):
> D I N S X
> d i n s x
> c h m t y
> e j o r w
> b k l u v
> f g p q a
> a f k p b
> g l q v u
> h m j w t
> y u r o c
> n s d d i
> s e x f f
> r q i i n
> m v w c o
> o w u a h
> q n s h g
> p t c x d
> t x f e l
> x d e j s
> w y g b j
> i r y n r
> u o h y m
> k b t l e
> v c a m k
> j a b g p
> l p v k q
> What tests do you run to make sure the output is correct?
> Thank you,
> Mike.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.