Using Canopy algorithm to create clusters from command line--Issues

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Using Canopy algorithm to create clusters from command line--Issues

ashokharnal
This post has NOT been accepted by the mailing list yet.
I am trying to cluster the following example set of (x,y) coordinates:

(1,1) , (2,1) , (1,2), (2,2), (3,3), (8,8), (8,9), (9,8), (9,9)

These coordinates should form two clusters:
(1,1) , (2,1) , (1,2), (2,2), (3,3)
AND
(8,8), (8,9), (9,8), (9,9)

This is how I proceeded:
Step 1: Stored coordinates  as a tab-separated data in a file on hadoop as:
1 1
2 1
1 2
2 2
3 3
8 8
8 9
9 8
9 9
Step 2:
Converted this file into Sequence file as:
mahout seqdirectory --input my.data --output kdraft -c UTF-8

Step 3:
Created a Sparse data file as:
mahout seq2sparse -i kdraft -o kfinal -wt tf

Step 4:
Ran Canopy algorithm to generate clusters as:
mahout canopy -i kfinal/tf-vectors --clustering -o xz -t1 5 -t2 2 -ow

Step 4:
Used
mahout clusterdump --input xz/clusters-0-final  --pointsDir xz/clusteredPoints/ --output /home/ashokharnal/data/c.txt

The output appears in c.txt as:

C-0{n=1 c=[4.000, 2.000, 4.000, 4.000, 4.000] r=[]}
        Weight : [props - optional]:  Point:
        1.0: [4.000, 2.000, 4.000, 4.000, 4.000]
       
I wanted to know which coordinates are in which cluster? I am not getting it.
Will be grateful for help