[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Updated: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-11:

    Attachment: MAHOUT-11-kmeans-cleanup.patch

Attached a patch that takes Isabel's original patch to remove static fields in kmeans clustering, makes the discussed change for the output collectors, cleans up some warnings and unused instance of the convergenceDelta variable. Fixes the RandomSeedGenerator in kmeans clustering and adds a unit test for it. Also, KMeansClusterer no longer extends Cluster -- it wasn't necessary to do so.

Isabel, are you planning on taking a crack at the rest of the clustering code that uses static fields? I'm finding this issue a great way to become familiar with the code, and if you're not already intending to work on it, I could give it a try.

> Static fields used throughout clustering code (Canopy, K-Means).
> ----------------------------------------------------------------
>                 Key: MAHOUT-11
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-11
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Dawid Weiss
>             Fix For: 0.3
>         Attachments: MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, MAHOUT-11.patch
> I file this as a bug, even though I'm not 100% sure it is one. In the currect code the information is exchanged via static fields (for example, distance measure and thresholds for Canopies are static field). Is it always true in Hadoop that one job runs inside one JVM with exclusive access? I haven't seen it anywhere in Hadoop documentation and my impression was that everything uses JobConf to pass configuration to jobs, but jobs are configured on a per-object basis (a job is an object, a mapper is an object and everything else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is a limitation and bug in our code that needs to be addressed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.