This post has NOT been accepted by the mailing list yet.
I am currently trying create clusters from a group of 50.000 strings that contain product descriptions (around 70-100 characters length each).
That group of 50.000 consists of roughly 5.000 individual products and ten varying product descriptions per product. The product descriptions are already prepared for clustering and contain a normalized brand name, product model number, etc.
What would be a good approach to maximise the amound of found clusters (the best possible value would be 5.000 clusters with 10 products each)
I adapted the reuters cluster script to read in my data and managed to create a first set of clusters. However, I have not managed to maximise the cluster count.
The question is: what do I need to tweak with regard to the available mahout settings, so the clusters are created as precisely as possible?