[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786144#action_12786144 ]

Drew Farris commented on MAHOUT-11:

Thanks very much for the review Isabel.

Regarding the canopyId adjustments in TestMeanShift: Prior to the patch, classes like Canopy and MeanShiftCanopy assigned their ids in their constructors using a static field. I've changed this so that classes that create Canopy/MeanShiftCanopy objects are responsible for assigning cluster ids. As a result, there is no longer a need to offset the canopy id's in TestMeanShift because both the reference set and map/reduce (clusterer) generated canopies start are able to generate canopies starting at an id of 0.

Using the static fields in the Classes themselves and then having to work around that in the unit tests felt a little funny to me. It could be limiting in cases where id's must be assigned from some external source, or perhaps to be consistent in a distributed environment. More specifically it seemed to go along with the spirit of the original issue and pushing the responsibility for id management up to the users of the classes seems like a good thing.

FWIW, While reviewing this, I happened to noticed that the nextCanopyId field could be removed from MeanShiftCanopyClusterer entirely. Can you fix this prior to commit Isabel, or should I submit another patch?

> Static fields used throughout clustering code (Canopy, K-Means).
> ----------------------------------------------------------------
>                 Key: MAHOUT-11
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-11
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Dawid Weiss
>             Fix For: 0.3
>         Attachments: MAHOUT-11-all-cleanup-20091128.patch, MAHOUT-11-kmeans-cleanup.patch, MAHOUT-11-RandomSeedGenerator.patch, MAHOUT-11.patch
> I file this as a bug, even though I'm not 100% sure it is one. In the currect code the information is exchanged via static fields (for example, distance measure and thresholds for Canopies are static field). Is it always true in Hadoop that one job runs inside one JVM with exclusive access? I haven't seen it anywhere in Hadoop documentation and my impression was that everything uses JobConf to pass configuration to jobs, but jobs are configured on a per-object basis (a job is an object, a mapper is an object and everything else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is a limitation and bug in our code that needs to be addressed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.