

This post has NOT been accepted by the mailing list yet.
Hi,
I'm new to Mahout (and machine learning) but did quite a lot of reading, especially "Mahout in Action".
I'm trying to cluster users based on their profiles.
By profile I mean attributes such as: age, gender, location and set of interests
All the examples I saw so far were about vectors having dimensions of similar type (e.g. occurance of words in text) but in my case, each dimension is of different "type" and seem to require different distance measure.
* Gender has two possible values  what distance measure should I use here?
* Age has a larger set of possible values  euclidean distance?
* Location, expressed as latitude and longitude  euclidean distance, but between pairs of points?
* Interests, if expressed as a subset of a finite set  so the distance number of items, shared between two vectors. I assume I can write a custom distance measure for it.
Other than deciding on correct distance measures, I'm not sure how to combine them into one clustering process.
As I said, I'm new to this field so any help would be much appreciated.
Thanks,
Raviv


This post has NOT been accepted by the mailing list yet.
Looking at the problem from a developers perspective, the questing it this:
Can I develop a custom vector where each dimension has a different data type (where the type can complex, e.g. Set<String>)
and use a different distance measure class for each dimension?


This post has NOT been accepted by the mailing list yet.
If you write Customer vector, you might need to update other classes as well, not just the distance measure. But your vectors looks like, we can bring it to a vector<int>, * Gender has two possible values  what distance measure should I use here?  [ 1, 2]
* Age has a larger set of possible values  euclidean distance? another integer* Location, expressed as latitude and longitude  euclidean distance, but between pairs of points?
I think this can be separate X,Y which can go as separate integer. * Interests, if expressed as a subset of a finite set  so the
distance number of items, shared between two vectors. I assume I can
write a custom distance measure for it.  This can be a boolean vector like [ 0,1,0,0,0,1]. Now there is a way we can assign weightage to each of these dimensions, in calculating distance. Hence not overweighing these [ 0,1,0,0,0,1] entries. Please check this online.
If the interest subset is going to be too big , then probably this is not a good idea.Abin On Thu, Jan 12, 2012 at 3:40 PM, Raviv Pavel [via Lucene] <[hidden email]> wrote:
Looking at the problem from a developers perspective, the questing it this:
Can I develop a custom vector where each dimension has a different data type (where the type can complex, e.g. Set<String>)
and use a different distance measure class for each dimension?
Abin Varghese
Software Engineer
NY


This post has NOT been accepted by the mailing list yet.
My initial plan was to do exactly that, use 0 & 1 for gender, age as is, lat & lon in two dimensions, and one dimension holding 0 or 1 per possible interest (each value is mapped to an offset in the dimension)
For simplicity let's assume I have 3 types of interests, so a vector of a person would look like this:
d[0] = 1 (gender)
d[1] = 15.5 (latitude)
d[2] = 50.5 (longitude)
d[3] = 41 (age)
d[4] = 0 (not interested in A)
d[5] = 1 (interested in B)
d[6] = 0 (not interested in C)
I'm probably misunderstanding something here, but with this approach no single builtin distance measure will take into account that dimensions 1 & 2 should be compared as a pair using euclidean distance, and dimensions 4,5 and 6 should be compared by counting the common values between two vectors.


This post has NOT been accepted by the mailing list yet.
Hi,
I am having the same problem. please let me know if you have founf any solution.
Thanks
Nirmal


This post has NOT been accepted by the mailing list yet.
Remind what was the problem?


This post has NOT been accepted by the mailing list yet.
My initial plan was to do exactly that, use 0 & 1 for gender, age as is, lat & lon in two dimensions, and one dimension holding 0 or 1 per possible interest (each value is mapped to an offset in the dimension)
For simplicity let's assume I have 3 types of interests, so a vector of a person would look like this:
d[0] = 1 (gender)
d[1] = 15.5 (latitude)
d[2] = 50.5 (longitude)
d[3] = 41 (age)
d[4] = 0 (not interested in A)
d[5] = 1 (interested in B)
d[6] = 0 (not interested in C)
I'm probably misunderstanding something here, but with this approach no single builtin distance measure will take into account that dimensions 1 & 2 should be compared as a pair using euclidean distance, and dimensions 4,5 and 6 should be compared by counting the common values between two vectors.


This post has NOT been accepted by the mailing list yet.
It was a long time ago but I think I created a custom distance measure.

