kMeans

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

kMeans

Marko Novakovic
Is good idea to apply project for integrating kMeans
algorithm to clustering web pages?


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page.
http://www.yahoo.com/r/hs
Reply | Threaded
Open this post in threaded view
|

答复: kMeans

shunkai.fu
There is one project called Carrot2 focusing on this topic already.

-----邮件原件-----
发件人: Marko Novakovic [mailto:[hidden email]]
发送时间: 2008年3月27日 7:03
收件人: [hidden email]
主题: kMeans

Is good idea to apply project for integrating kMeans
algorithm to clustering web pages?


 
____________________________________________________________________________
________
Never miss a thing.  Make Yahoo your home page.
http://www.yahoo.com/r/hs

Reply | Threaded
Open this post in threaded view
|

Re: kMeans

Ted Dunning-3
In reply to this post by Marko Novakovic

Kmeans can be used to cluster web-sites if you use a cosine measure of
similarity based on content.

You can also use the first few eigenvectors of the linkage graph to do
spectral clustering (this will essentially be a strongly connected component
analysis).

Using browse logs can also give you clusters if you look at common viewing
of pages during particular sessions.  This should mostly replicate the
linkage graph analysis.


On 3/26/08 4:02 PM, "Marko Novakovic" <[hidden email]> wrote:

> Is good idea to apply project for integrating kMeans
> algorithm to clustering web pages?
>
>
>      
> ______________________________________________________________________________
> ______
> Never miss a thing.  Make Yahoo your home page.
> http://www.yahoo.com/r/hs

Reply | Threaded
Open this post in threaded view
|

Re: kMeans

Khalil Honsali
Hello,

Is there any relevant papers/work about index-clustering (not search results
clustering) ? I wonder if it will impact queries if index is clustered and
distributed somehow?

K. Honsali

On 27/03/2008, Ted Dunning <[hidden email]> wrote:

>
>
> Kmeans can be used to cluster web-sites if you use a cosine measure of
> similarity based on content.
>
> You can also use the first few eigenvectors of the linkage graph to do
> spectral clustering (this will essentially be a strongly connected
> component
> analysis).
>
> Using browse logs can also give you clusters if you look at common viewing
> of pages during particular sessions.  This should mostly replicate the
> linkage graph analysis.
>
>
>
> On 3/26/08 4:02 PM, "Marko Novakovic" <[hidden email]> wrote:
>
> > Is good idea to apply project for integrating kMeans
> > algorithm to clustering web pages?
> >
> >
> >
> >
> ______________________________________________________________________________
> > ______
> > Never miss a thing.  Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
>
>
Reply | Threaded
Open this post in threaded view
|

Re: kMeans

Marko Novakovic
In reply to this post by Ted Dunning-3
OK, thanks for information.

I wrote application for this topic.
I want to know if this topic is acceptable for Google
Summer of Code.

--- Ted Dunning <[hidden email]> wrote:

>
> Kmeans can be used to cluster web-sites if you use a
> cosine measure of
> similarity based on content.
>
> You can also use the first few eigenvectors of the
> linkage graph to do
> spectral clustering (this will essentially be a
> strongly connected component
> analysis).
>
> Using browse logs can also give you clusters if you
> look at common viewing
> of pages during particular sessions.  This should
> mostly replicate the
> linkage graph analysis.
>
>
> On 3/26/08 4:02 PM, "Marko Novakovic"
> <[hidden email]> wrote:
>
> > Is good idea to apply project for integrating
> kMeans
> > algorithm to clustering web pages?
> >
> >
> >      
> >
>
______________________________________________________________________________
> > ______
> > Never miss a thing.  Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
>
>



      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
Reply | Threaded
Open this post in threaded view
|

Re: kMeans

Marko Novakovic
In reply to this post by Khalil Honsali
Try to view the issue of IEEE Computer 2007. There are
a lot of phenomenons about indexnig results. Maybe you
could find some good reference there about the
clustering of index.

--- Khalil Honsali <[hidden email]> wrote:

> Hello,
>
> Is there any relevant papers/work about
> index-clustering (not search results
> clustering) ? I wonder if it will impact queries if
> index is clustered and
> distributed somehow?
>
> K. Honsali
>
> On 27/03/2008, Ted Dunning <[hidden email]>
> wrote:
> >
> >
> > Kmeans can be used to cluster web-sites if you use
> a cosine measure of
> > similarity based on content.
> >
> > You can also use the first few eigenvectors of the
> linkage graph to do
> > spectral clustering (this will essentially be a
> strongly connected
> > component
> > analysis).
> >
> > Using browse logs can also give you clusters if
> you look at common viewing
> > of pages during particular sessions.  This should
> mostly replicate the
> > linkage graph analysis.
> >
> >
> >
> > On 3/26/08 4:02 PM, "Marko Novakovic"
> <[hidden email]> wrote:
> >
> > > Is good idea to apply project for integrating
> kMeans
> > > algorithm to clustering web pages?
> > >
> > >
> > >
> > >
> >
>
______________________________________________________________________________
> > > ______
> > > Never miss a thing.  Make Yahoo your home page.
> > > http://www.yahoo.com/r/hs
> >
> >
>



      ____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
Reply | Threaded
Open this post in threaded view
|

Re: [?? Probable Spam] 答复: kMeans

Dawid Weiss
In reply to this post by shunkai.fu

Carrot2 is for clustering web search results -- it's not exactly the same thing.

D.

shunkai.fu wrote:

> There is one project called Carrot2 focusing on this topic already.
>
> -----邮件原件-----
> 发件人: Marko Novakovic [mailto:[hidden email]]
> 发送时间: 2008年3月27日 7:03
> 收件人: [hidden email]
> 主题: kMeans
>
> Is good idea to apply project for integrating kMeans
> algorithm to clustering web pages?
>
>
>  
> ____________________________________________________________________________
> ________
> Never miss a thing.  Make Yahoo your home page.
> http://www.yahoo.com/r/hs
>
Reply | Threaded
Open this post in threaded view
|

Re: kMeans

Marko Novakovic
Is it acceptable solution for Google Summer of Code?

--- Dawid Weiss <[hidden email]> wrote:

>
> Carrot2 is for clustering web search results -- it's
> not exactly the same thing.
>
> D.
>
> shunkai.fu wrote:
> > There is one project called Carrot2 focusing on
> this topic already.
> >
> > -----邮件原件-----
> > 发件人: Marko Novakovic
> [mailto:[hidden email]]
> > 发送时间: 2008年3月27日 7:03
> > 收件人: [hidden email]
> > 主题: kMeans
> >
> > Is good idea to apply project for integrating
> kMeans
> > algorithm to clustering web pages?
> >
> >
> >  
> >
>
____________________________________________________________________________
> > ________
> > Never miss a thing.  Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
> >
>



      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page.
http://www.yahoo.com/r/hs
Reply | Threaded
Open this post in threaded view
|

Re: kMeans

Dawid Weiss

Hi Marko,

> Is it acceptable solution for Google Summer of Code?

I don't think it's an acceptable project for Mahout -- Mahout goals are in large
data set processing, supported by Map-Reduce. Clustering search results is
usually in-memory, on-line clustering with few information sources (titles,
snippets) and the resulting high noise.

That said, what I envisage could be done is to work on data structures that
could _support_ sensible on-line faceting/clustering of search results,
similarly to what Google supposedly does behind the scenes to reorder search
results (similar concept clustering). Building semantic relationships between
terms or detecting frequently recurring phrases with significantly different
meanings is definitely interesting and challenging (if not done naively),
especially on large scale.

Dawid
Reply | Threaded
Open this post in threaded view
|

Re: kMeans

Karl Wettin
In reply to this post by Marko Novakovic
Marko Novakovic skrev:
> Is good idea to apply project for integrating kMeans
> algorithm to clustering web pages?

(Your question is better suited the users- than the dev-forum.)

It depends on your needs, so you need to be more specific about how you
plan to use the results in order to get a good answer.

But choosing an algorithm to extract clusters is only half of your
problem. You need to transform the web pages to instance data accepted
by the clusterer. How much effort do you want to put in to that?


     karl

Reply | Threaded
Open this post in threaded view
|

Re: kMeans

Karl Wettin
In reply to this post by Marko Novakovic
At the same time as I sent my reply, I received all the other replies
that I did not read yet :)
Reply | Threaded
Open this post in threaded view
|

Index clustering (was: kMeans)

Karl Wettin
In reply to this post by Khalil Honsali
Khalil Honsali skrev:
> Hello,

Hi Khalil,

> Is there any relevant papers/work about index-clustering (not search results
> clustering) ? I wonder if it will impact queries if index is clustered and
> distributed somehow?

LUCENE-1025 is a heirarchial clusterer that I later refactored to be
persist the tree in a BDB so I could build a cluster of a complete index
that could come up with "more like this"-suggestions in an instant. It
was sort of slow, but the results where not too bad. Never compared it
with anything else thogh. It never became more than a proof of concept.

I'm looking at reimplenting this for Mahout, but I have a hard time
figuring out if building the tree is something one wants to (or even if
one can do) using map reduce. The more I think of it there more I want
to solve it with a grid.



     karl