How to get the most frequent words for a set of documents in Lucene?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

How to get the most frequent words for a set of documents in Lucene?

Gucko Gucko
Hello all,

I'm trying to cluster documents that were indexed using Lucene 4.3. The
results of the clustering algorithm is a set of clusters where each cluster
contains the most similar documents (I only store their docIDs in each
cluster). What I want is to get the most frequent words for each cluster.
So I query the Lucene index for the set of documents and then I want to get
the most frequent words for these documents. But how to do this in Lucene?
Especially I want an efficient way because I'm clustering tweets in
real-time.

What I was thinking about is to make a RAMDirectory and index each set of
documents in this directory and then get the statistics for each term.
However this is slow and uses a lot of memory!


Thanks in advance!


Gucko