Document clustering using lucene

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Document clustering using lucene

Prasenjit Mukherjee-3
I want to do some document  clustering on a corpus of  ~ 100,000
documents, with average doc size being ~ 7k. I have looked into carrot2
but it seems to work only for relatively short documents and has soem
scalign issues for large corpus.  Certainly for these kind of corpus
size, one cannot use a pure memory based clustering algorithm. Hence the
possible use of lucene.

I was thinking of using lucene to create the similarity matrix (between
documents).  Before adding a document (i.e. D-k) to the lucene index, we
can compute the document similarity between D-k with all other existing
documents by creating a Query out of D-k and doing a search on the
existing index. We can take the score of each document as   similarity
measure between the document and D-k. It is going to be a symmetric and
parse matrix. Now we can use this similarity  matrix and feed it to any
similarity based clustering algorithm.

Would like to know if anyone has worked along similar lines, and are
happy  to share their experiences.

thanks,
Prasen



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Document clustering using lucene

Paul Elschot
On Thursday 15 June 2006 13:50, Prasenjit Mukherjee wrote:

> I want to do some document  clustering on a corpus of  ~ 100,000
> documents, with average doc size being ~ 7k. I have looked into carrot2
> but it seems to work only for relatively short documents and has soem
> scalign issues for large corpus.  Certainly for these kind of corpus
> size, one cannot use a pure memory based clustering algorithm. Hence the
> possible use of lucene.
>
> I was thinking of using lucene to create the similarity matrix (between
> documents).  Before adding a document (i.e. D-k) to the lucene index, we
> can compute the document similarity between D-k with all other existing
> documents by creating a Query out of D-k and doing a search on the
> existing index. We can take the score of each document as   similarity
> measure between the document and D-k. It is going to be a symmetric and
> parse matrix. Now we can use this similarity  matrix and feed it to any
> similarity based clustering algorithm.
>
> Would like to know if anyone has worked along similar lines, and are
> happy  to share their experiences.

Did you look into indexing a TermVector for each document?
It is easy to compute an element of a similarity matrix from two
term vectors.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Document clustering using lucene

John Hamilton
In reply to this post by Prasenjit Mukherjee-3
I'v been thinking about a similar problem.  However, it seems that the similarity score returned by a search is only relevant within those search results.  You can't compare the similarity scores from two different searches.  I think you will have to compute the similarities yourself using the term vectors.

-John

-----Original Message-----
From: Prasenjit Mukherjee [mailto:[hidden email]]
Sent: Thursday, June 15, 2006 6:51 AM
To: [hidden email]
Subject: Document clustering using lucene


I want to do some document  clustering on a corpus of  ~ 100,000
documents, with average doc size being ~ 7k. I have looked into carrot2
but it seems to work only for relatively short documents and has soem
scalign issues for large corpus.  Certainly for these kind of corpus
size, one cannot use a pure memory based clustering algorithm. Hence the
possible use of lucene.

I was thinking of using lucene to create the similarity matrix (between
documents).  Before adding a document (i.e. D-k) to the lucene index, we
can compute the document similarity between D-k with all other existing
documents by creating a Query out of D-k and doing a search on the
existing index. We can take the score of each document as   similarity
measure between the document and D-k. It is going to be a symmetric and
parse matrix. Now we can use this similarity  matrix and feed it to any
similarity based clustering algorithm.

Would like to know if anyone has worked along similar lines, and are
happy  to share their experiences.

thanks,
Prasen



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]