How to get the most frequent words for a set of documents in Lucene?
I'm trying to cluster documents that were indexed using Lucene 4.3. The
results of the clustering algorithm is a set of clusters where each cluster
contains the most similar documents (I only store their docIDs in each
cluster). What I want is to get the most frequent words for each cluster.
So I query the Lucene index for the set of documents and then I want to get
the most frequent words for these documents. But how to do this in Lucene?
Especially I want an efficient way because I'm clustering tweets in
What I was thinking about is to make a RAMDirectory and index each set of
documents in this directory and then get the statistics for each term.
However this is slow and uses a lot of memory!