Clustering user profiles

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Clustering user profiles

Raviv Pavel
This post has NOT been accepted by the mailing list yet.
Hi,

I'm new to Mahout (and machine learning) but did quite a lot of reading, especially "Mahout in Action".

I'm trying to cluster users based on their profiles.
By profile I mean attributes such as: age, gender, location and set of interests

All the examples I saw so far were about vectors having dimensions of similar type (e.g. occurance of words in text) but in my case, each dimension is of different "type" and seem to require different distance measure.

* Gender has two possible values - what distance measure should I use here?
* Age has a larger set of possible values - euclidean distance?
* Location, expressed as latitude and longitude - euclidean distance, but between pairs of points?
* Interests, if expressed as a subset of a finite set - so the distance number of items, shared between two vectors. I assume I can write a custom distance measure for it.

Other than deciding on correct distance measures, I'm not sure how to combine them into one clustering process.

As I said, I'm new to this field so any help would be much appreciated.

Thanks,
Raviv
Reply | Threaded
Open this post in threaded view
|

Re: Clustering user profiles

Raviv Pavel
This post has NOT been accepted by the mailing list yet.
Looking at the problem from a developers perspective, the questing it this:
Can I develop a custom vector where each dimension has a different data type (where the type can complex, e.g. Set<String>)
and use a different distance measure class for each dimension?

Reply | Threaded
Open this post in threaded view
|

Re: Clustering user profiles

mail2abin
This post has NOT been accepted by the mailing list yet.
If you write Customer vector, you might need to update other classes as well, not just the distance measure.
But your vectors looks like, we can bring it to a vector<int>,


* Gender has two possible values - what distance measure should I use here?   -   [ 1, 2]
* Age has a larger set of possible values - euclidean distance?                             another integer
* Location, expressed as latitude and longitude - euclidean distance, but between pairs of points?

I think this can be separate  X,Y which can go as separate integer.

* Interests, if expressed as a subset of a finite set - so the distance number of items, shared between two vectors. I assume I can write a custom distance measure for it. -  This can be a boolean vector like [ 0,1,0,0,0,1]. Now there is a way we can assign weightage to each of these dimensions, in calculating distance. Hence not overweighing these [ 0,1,0,0,0,1] entries. Please check this online.
If the interest subset is going to be too big , then probably this is not a good idea.


-Abin



On Thu, Jan 12, 2012 at 3:40 PM, Raviv Pavel [via Lucene] <[hidden email]> wrote:
Looking at the problem from a developers perspective, the questing it this:
Can I develop a custom vector where each dimension has a different data type (where the type can complex, e.g. Set<String>)
and use a different distance measure class for each dimension?




If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Clustering-user-profiles-tp3654678p3654848.html
To unsubscribe from Apache Mahout, click here.
NAML

Abin Varghese
Software Engineer
NY
Reply | Threaded
Open this post in threaded view
|

Re: Clustering user profiles

Raviv Pavel
This post has NOT been accepted by the mailing list yet.
My initial plan was to do exactly that, use 0 & 1 for gender, age as is, lat & lon in two dimensions, and one dimension holding 0 or 1 per possible interest (each value is mapped to an offset in the dimension)
For simplicity let's assume I have 3 types of interests, so a vector of a person would look like this:

d[0] = 1 (gender)
d[1] = 15.5 (latitude)
d[2] = 50.5 (longitude)
d[3] = 41 (age)
d[4] = 0 (not interested in A)
d[5] = 1 (interested in B)
d[6] = 0 (not interested in C)


I'm probably misunderstanding something here, but with this approach no single built-in distance measure will take into account that dimensions 1 & 2 should be compared as a pair using euclidean distance, and dimensions 4,5 and 6 should be compared by counting the common values between two vectors.
Reply | Threaded
Open this post in threaded view
|

Re: Clustering user profiles

nirmal1kumar
This post has NOT been accepted by the mailing list yet.
Hi,

I am having the same problem. please let me know if you have founf any solution.

Thanks
Nirmal
Reply | Threaded
Open this post in threaded view
|

Re: Clustering user profiles

Raviv Pavel
This post has NOT been accepted by the mailing list yet.
Remind what was the problem?



On Thu, Oct 17, 2013 at 3:43 PM, nirmal1kumar [via Lucene] <[hidden email]> wrote:
Hi,

I am having the same problem. please let me know if you have founf any solution.

Thanks
Nirmal


If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Clustering-user-profiles-tp3654678p4096080.html
To unsubscribe from Clustering user profiles, click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: Clustering user profiles

nirmal1kumar
This post has NOT been accepted by the mailing list yet.
My initial plan was to do exactly that, use 0 & 1 for gender, age as is, lat & lon in two dimensions, and one dimension holding 0 or 1 per possible interest (each value is mapped to an offset in the dimension)
For simplicity let's assume I have 3 types of interests, so a vector of a person would look like this:

d[0] = 1 (gender)
d[1] = 15.5 (latitude)
d[2] = 50.5 (longitude)
d[3] = 41 (age)
d[4] = 0 (not interested in A)
d[5] = 1 (interested in B)
d[6] = 0 (not interested in C)


I'm probably misunderstanding something here, but with this approach no single built-in distance measure will take into account that dimensions 1 & 2 should be compared as a pair using euclidean distance, and dimensions 4,5 and 6 should be compared by counting the common values between two vectors.
Reply | Threaded
Open this post in threaded view
|

Re: Clustering user profiles

Raviv Pavel
This post has NOT been accepted by the mailing list yet.
It was a long time ago but I think I created a custom distance measure.


On Fri, Oct 18, 2013 at 1:56 PM, nirmal1kumar [via Lucene] <[hidden email]> wrote:
My initial plan was to do exactly that, use 0 & 1 for gender, age as is, lat & lon in two dimensions, and one dimension holding 0 or 1 per possible interest (each value is mapped to an offset in the dimension)
For simplicity let's assume I have 3 types of interests, so a vector of a person would look like this:

d[0] = 1 (gender)
d[1] = 15.5 (latitude)
d[2] = 50.5 (longitude)
d[3] = 41 (age)
d[4] = 0 (not interested in A)
d[5] = 1 (interested in B)
d[6] = 0 (not interested in C)


I'm probably misunderstanding something here, but with this approach no single built-in distance measure will take into account that dimensions 1 & 2 should be compared as a pair using euclidean distance, and dimensions 4,5 and 6 should be compared by counting the common values between two vectors.


If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Clustering-user-profiles-tp3654678p4096331.html
To unsubscribe from Clustering user profiles, click here.
NAML