Lucene search clusters

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene search clusters

Lorenzo Viscanti
I'm writing this message trying to find some people interested in creating a
'general purpose' lucene search results' clustering extension.
I wrote a simply implementation of clustering, and I would like to
contribute to lucene development by releasing an open source clustering
implementation. I know that maybe each project need a different
implementation but that would be a useful basis for everyone to develop his
own project.
Is anyone interested in it?
Lorenzo
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Lorenzo Viscanti
Some people just replied, but I forgot the most important thing...
I'm thinking of this project as part of the Google's Summer of Code program,
so I'm looking for other students.
I've sent an email to Erik and he told me that we can propose this as part
of Google's SoC if we find some other people interested in it.
Lorenzo

On 6/7/05, Lorenzo <[hidden email]> wrote:

>
> I'm writing this message trying to find some people interested in creating
> a 'general purpose' lucene search results' clustering extension.
> I wrote a simply implementation of clustering, and I would like to
> contribute to lucene development by releasing an open source clustering
> implementation. I know that maybe each project need a different
> implementation but that would be a useful basis for everyone to develop his
> own project.
> Is anyone interested in it?
> Lorenzo
>
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Daniel Stephan
I am currently writing sth about text retrieval using EM clustering. The
approach represents documents as high-dimensional vectors, but still it
is not related to Lucene (yet?).
How would you add clustering to Lucene? I think it may be a very
interesting technique to improve search results. If it works. My current
experience shows that it scales rather bad for larger document collections.

I don't think I will take part in Googles SoC, as I have my own "summer
of code" right now. But I would surely like to take part in discussions
about that topic, or at least read it and throw 2cents at it now and then.

cheers
Daniel


Lorenzo schrieb:

>Some people just replied, but I forgot the most important thing...
>I'm thinking of this project as part of the Google's Summer of Code program,
>so I'm looking for other students.
>I've sent an email to Erik and he told me that we can propose this as part
>of Google's SoC if we find some other people interested in it.
>Lorenzo
>
>On 6/7/05, Lorenzo <[hidden email]> wrote:
>  
>
>>I'm writing this message trying to find some people interested in creating
>>a 'general purpose' lucene search results' clustering extension.
>>I wrote a simply implementation of clustering, and I would like to
>>contribute to lucene development by releasing an open source clustering
>>implementation. I know that maybe each project need a different
>>implementation but that would be a useful basis for everyone to develop his
>>own project.
>>Is anyone interested in it?
>>Lorenzo
>>
>>    
>>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Lorenzo Viscanti
My approach uses the same technique, but I'm using mostly HAG clustering.
I did manage to add clustering support to a lucene based application (a
customized solution), but I'd like to try to create a 'general purpose'
library. I know it ain't easy!
I've found many scaling issues, but I saw that with an optimized algorithms
you can have pretty good results. Reading a carrot2 and lucene related
messages, I figured out that I can cluster only the n first results,
avoiding any performance issue in that way.
Lucene offers a good support to a clustering framework, based on a tf idf
analysis (not thinking of k-means or EM 'til now).
The most interesting problem is creating the architecture for such a system,
being general purpose but also very efficient.
Thanks,
Lorenzo

On 6/8/05, Daniel Stephan <[hidden email]> wrote:

>
> I am currently writing sth about text retrieval using EM clustering. The
> approach represents documents as high-dimensional vectors, but still it
> is not related to Lucene (yet?).
> How would you add clustering to Lucene? I think it may be a very
> interesting technique to improve search results. If it works. My current
> experience shows that it scales rather bad for larger document
> collections.
>
> I don't think I will take part in Googles SoC, as I have my own "summer
> of code" right now. But I would surely like to take part in discussions
> about that topic, or at least read it and throw 2cents at it now and then.
>
> cheers
> Daniel
>
>
> Lorenzo schrieb:
>
> >Some people just replied, but I forgot the most important thing...
> >I'm thinking of this project as part of the Google's Summer of Code
> program,
> >so I'm looking for other students.
> >I've sent an email to Erik and he told me that we can propose this as
> part
> >of Google's SoC if we find some other people interested in it.
> >Lorenzo
> >
> >On 6/7/05, Lorenzo <[hidden email]> wrote:
> >
> >
> >>I'm writing this message trying to find some people interested in
> creating
> >>a 'general purpose' lucene search results' clustering extension.
> >>I wrote a simply implementation of clustering, and I would like to
> >>contribute to lucene development by releasing an open source clustering
> >>implementation. I know that maybe each project need a different
> >>implementation but that would be a useful basis for everyone to develop
> his
> >>own project.
> >>Is anyone interested in it?
> >>Lorenzo
> >>
> >>
> >>
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Dawid Weiss
In reply to this post by Lorenzo Viscanti

Hi Lorenzo,

Search in the list's archives -- I posted a glue code that lets Lucene
results be clustered with Carrot2 clusterers (there are a few
implementations there).

http://java2.5341.com/msg/82310.html

The official Web site of the project is at:

http://carrot2.sourceforge.net/

You'll find some links to demo pages there as well.

Let me know if you have any questions/ concerns (or if you want to
contribute in any way).

Dawid



Lorenzo wrote:

> I'm writing this message trying to find some people interested in creating a
> 'general purpose' lucene search results' clustering extension.
> I wrote a simply implementation of clustering, and I would like to
> contribute to lucene development by releasing an open source clustering
> implementation. I know that maybe each project need a different
> implementation but that would be a useful basis for everyone to develop his
> own project.
> Is anyone interested in it?
> Lorenzo
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Lorenzo Viscanti
In reply to this post by Daniel Stephan
Daniel, could you explain to me why you are using em clustering? Is there
any best field or case for that technique?
I don't have any em experience and would like to know something about that
(just studying some papers...)
Thanks,
Lorenzo
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Falko Guderian
In reply to this post by Lorenzo Viscanti
You can add the WEKA packages http://www.cs.waikato.ac.nz/ml/weka/ . It
has an EM clusterer.

-Falko

>Some people just replied, but I forgot the most important thing...
>I'm thinking of this project as part of the Google's Summer of Code program,
>so I'm looking for other students.
>I've sent an email to Erik and he told me that we can propose this as part
>of Google's SoC if we find some other people interested in it.
>Lorenzo
>
>On 6/7/05, Lorenzo <[hidden email]> wrote:
>  
>
>>I'm writing this message trying to find some people interested in creating
>>a 'general purpose' lucene search results' clustering extension.
>>I wrote a simply implementation of clustering, and I would like to
>>contribute to lucene development by releasing an open source clustering
>>implementation. I know that maybe each project need a different
>>implementation but that would be a useful basis for everyone to develop his
>>own project.
>>Is anyone interested in it?
>>Lorenzo
>>
>>    
>>
>
>  
>
>------------------------------------------------------------------------
>
>No virus found in this incoming message.
>Checked by AVG Anti-Virus.
>Version: 7.0.323 / Virus Database: 267.6.2 - Release Date: 04.06.2005
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Lorenzo Viscanti
I see some noise about clustering and lucene, but I'm still waiting for
someone that will help me creating a clustering extension.
I know both carrot2 and weka (the first can be integrated with Lucene, the
latter may be - Falko can you tell me?) but would like to write something
that could be included in the sandbox (or similar) with an implementation
that we'll find the better for a general purpose environment. Maybe carrot2
or other will be the best one (I really hope, I'm a lazy coder;-) ) and so
we will simply ask David to extend his code, but first want to make some
tests.
bye
Lorenzo
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Dawid Weiss

You should state your requirements clearly:

1. What data you want to cluster? (whole index/ search results)
2. What is the role of the extension? How is it going to be used?
(front-end clusters, query refinement, etc)
3. Do you need the implementation or an API for clustering in the
source code? (I'd personally stick to the API; there are many products
out there that perform clustering. Carrot2 is no exception -- there is
an excellent (in my humble opinion :) open source clustering algorithm
Lingo, but there is also a commercial component that is much faster and
more customizable. You can start off with an open source clusterer then
and switch to a commercial product if you want higher scalability or
different functionality. I implemented such an API in Nutch -- take a
look in its source code for hints).

Dawid

Lorenzo wrote:

> I see some noise about clustering and lucene, but I'm still waiting for
> someone that will help me creating a clustering extension.
> I know both carrot2 and weka (the first can be integrated with Lucene, the
> latter may be - Falko can you tell me?) but would like to write something
> that could be included in the sandbox (or similar) with an implementation
> that we'll find the better for a general purpose environment. Maybe carrot2
> or other will be the best one (I really hope, I'm a lazy coder;-) ) and so
> we will simply ask David to extend his code, but first want to make some
> tests.
> bye
> Lorenzo
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Lorenzo Viscanti
First, thanks for your reply.
I was wondering about adding some extra clustering functionalities to
Lucene. I wrote a clustering engine, based on hac/ahc and k-means algorithms
based on Lucene search results. That work is based on a customized solution,
and so I decided to write some general code . Right now I'm looking at this
class com/mwroblewski/carrot/filter/ahcfilter/AHCFilter from carrot2 and
found it to be very similar to my work;-)
My aim is to provide some basic clustering functionalities to lucene search
results. Carrot2 offers a lot of functionalities, like many inputs, I'm just
trying to offer a simpler (much simpler!) clustering opportunity for lucene
users.
Hope I can get some good advices from you!
ciao,
Lorenzo


On 6/8/05, Dawid Weiss <[hidden email]> wrote:

>
>
> You should state your requirements clearly:
>
> 1. What data you want to cluster? (whole index/ search results)
> 2. What is the role of the extension? How is it going to be used?
> (front-end clusters, query refinement, etc)
> 3. Do you need the implementation or an API for clustering in the
> source code? (I'd personally stick to the API; there are many products
> out there that perform clustering. Carrot2 is no exception -- there is
> an excellent (in my humble opinion :) open source clustering algorithm
> Lingo, but there is also a commercial component that is much faster and
> more customizable. You can start off with an open source clusterer then
> and switch to a commercial product if you want higher scalability or
> different functionality. I implemented such an API in Nutch -- take a
> look in its source code for hints).
>
> Dawid
>
> Lorenzo wrote:
> > I see some noise about clustering and lucene, but I'm still waiting for
> > someone that will help me creating a clustering extension.
> > I know both carrot2 and weka (the first can be integrated with Lucene,
> the
> > latter may be - Falko can you tell me?) but would like to write
> something
> > that could be included in the sandbox (or similar) with an
> implementation
> > that we'll find the better for a general purpose environment. Maybe
> carrot2
> > or other will be the best one (I really hope, I'm a lazy coder;-) ) and
> so
> > we will simply ask David to extend his code, but first want to make some
> > tests.
> > bye
> > Lorenzo
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Dawid Weiss

Lorenzo... Did you take a look at the mail I posted before? There was a
ready-to-use clustering for Lucene there. It _is_ simple. I don't know
what you mean by "much simpler" -- much simpler to use? You really don't
have to know all of Carrot2 code to use it. You build, or fetch a
precompiled clustering component and add an input to it. This is
basically what I did before and it took just a few lines of code to
integrate clustering into Lucene, so I can hardly find it difficult.

AHC and k-means are all classic clustering algorithms, so their
implementation will always look similar. They, unfortunately, need a lot
of tuning to get the results right. With Lingo clustering you can avoid
much of that work.

But anyway, if you're up to the challange, feel free to contribute in
any way you want to. I'll be glad to compare your clustering to ours.

D.

Lorenzo wrote:

> First, thanks for your reply.
> I was wondering about adding some extra clustering functionalities to
> Lucene. I wrote a clustering engine, based on hac/ahc and k-means algorithms
> based on Lucene search results. That work is based on a customized solution,
> and so I decided to write some general code . Right now I'm looking at this
> class com/mwroblewski/carrot/filter/ahcfilter/AHCFilter from carrot2 and
> found it to be very similar to my work;-)
> My aim is to provide some basic clustering functionalities to lucene search
> results. Carrot2 offers a lot of functionalities, like many inputs, I'm just
> trying to offer a simpler (much simpler!) clustering opportunity for lucene
> users.
> Hope I can get some good advices from you!
> ciao,
> Lorenzo
>
>
> On 6/8/05, Dawid Weiss <[hidden email]> wrote:
>
>>
>>You should state your requirements clearly:
>>
>>1. What data you want to cluster? (whole index/ search results)
>>2. What is the role of the extension? How is it going to be used?
>>(front-end clusters, query refinement, etc)
>>3. Do you need the implementation or an API for clustering in the
>>source code? (I'd personally stick to the API; there are many products
>>out there that perform clustering. Carrot2 is no exception -- there is
>>an excellent (in my humble opinion :) open source clustering algorithm
>>Lingo, but there is also a commercial component that is much faster and
>>more customizable. You can start off with an open source clusterer then
>>and switch to a commercial product if you want higher scalability or
>>different functionality. I implemented such an API in Nutch -- take a
>>look in its source code for hints).
>>
>>Dawid
>>
>>Lorenzo wrote:
>>
>>>I see some noise about clustering and lucene, but I'm still waiting for
>>>someone that will help me creating a clustering extension.
>>>I know both carrot2 and weka (the first can be integrated with Lucene,
>>
>>the
>>
>>>latter may be - Falko can you tell me?) but would like to write
>>
>>something
>>
>>>that could be included in the sandbox (or similar) with an
>>
>>implementation
>>
>>>that we'll find the better for a general purpose environment. Maybe
>>
>>carrot2
>>
>>>or other will be the best one (I really hope, I'm a lazy coder;-) ) and
>>
>>so
>>
>>>we will simply ask David to extend his code, but first want to make some
>>>tests.
>>>bye
>>>Lorenzo
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [hidden email]
>>For additional commands, e-mail: [hidden email]
>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Daniel Stephan
In reply to this post by Lorenzo Viscanti
My experience is also limited and stems mostly from having read some
papers with promising results. I went from the k-Means to the EM,
because I was hoping that it would be able to model more complex
relationships of my data. After all, EM is using multivariate gaussians,
so its results should mirror the reality more closely (in other words:
in reality there is little black and white, but its all some shades of
gray, and the gaussian probability distributions should model that).

My results are currently rather mixed, though. I think I may have too
much noise in the data. It seems to be very important to get the input
right, shit in - shit out :-).

cheers
Daniel



Lorenzo schrieb:

>Daniel, could you explain to me why you are using em clustering? Is there
>any best field or case for that technique?
>I don't have any em experience and would like to know something about that
>(just studying some papers...)
>Thanks,
>Lorenzo
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Dawid Weiss

> right, shit in - shit out :-).

True. But in most cases clustering of search results can yield sensible
clusters. Try, for example:

http://demo.carrot-search.com/carrot2-remote-controller/newsearch.do?query=chips&processingChain=carrot2.process.lingo-cluster-odp&resultsRequested=200

We in fact use Lucene for this demo (indexing ODP categories) --

http://www.carrot-search.com/demos.html

An open source clustering component isn't much worse (with Google
serving as the data source):

http://carrot.cs.put.poznan.pl/carrot2-remote-controller/newsearch.do?query=chips&processingChain=carrot2.process.lingo-google-en&resultsRequested=100

Compare it with (same algorithm) AllTheWeb:

http://carrot.cs.put.poznan.pl/carrot2-remote-controller/newsearch.do?query=chips&processingChain=carrot2.process.lingo-alltheweb-en&resultsRequested=100

As you said -- much depends on the data, but there is also a lot of
space for the clustering algorithm (try identical inputs and different
algorithms and you'll see the difference).

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Daniel Stephan
Your application works very well, congrats! May I ask how the input is
looking? How are the terms selected, how do you model phrases? Do you
handle titles different from the short summaries?

What I am doing is: I remove stopwords, stem terms using snowballs
default english stemmer, and then already build feature vectors for the
selected terms. I don't have information about phrases in there, yet.
I ask, because the descriptions of your clusters are very nice. How are
they done? (I know you are using SVD to do it, and I am too, but I am
only having single terms, and you have nicely formulated phrases.)

Cheers
Daniel


Dawid Weiss schrieb:

>
>> right, shit in - shit out :-).
>
>
> True. But in most cases clustering of search results can yield
> sensible clusters. Try, for example:
>
> http://demo.carrot-search.com/carrot2-remote-controller/newsearch.do?query=chips&processingChain=carrot2.process.lingo-cluster-odp&resultsRequested=200
>
>
> We in fact use Lucene for this demo (indexing ODP categories) --
>
> http://www.carrot-search.com/demos.html
>
> An open source clustering component isn't much worse (with Google
> serving as the data source):
>
> http://carrot.cs.put.poznan.pl/carrot2-remote-controller/newsearch.do?query=chips&processingChain=carrot2.process.lingo-google-en&resultsRequested=100
>
>
> Compare it with (same algorithm) AllTheWeb:
>
> http://carrot.cs.put.poznan.pl/carrot2-remote-controller/newsearch.do?query=chips&processingChain=carrot2.process.lingo-alltheweb-en&resultsRequested=100
>
>
> As you said -- much depends on the data, but there is also a lot of
> space for the clustering algorithm (try identical inputs and different
> algorithms and you'll see the difference).
>
> D.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Dawid Weiss

> Your application works very well, congrats! May I ask how the input is
> looking? How are the terms selected, how do you model phrases? Do you
> handle titles different from the short summaries?

Only search results (snippets and titles) are used.

> I ask, because the descriptions of your clusters are very nice. How are
> they done?

You'll have some answers in here:

http://www.cs.put.poznan.pl/dweiss/site/publications/download/iipwm-osinski-weiss-stefanowski-2004-lingo.pdf

D.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene search clusters

Falko Guderian-2
In reply to this post by Lorenzo Viscanti
You have to combine lucene and weka on your own.
I don't know an open source implementation or some other tools.
Sorry, you have to write a wrapper.

-Falko

>Some people just replied, but I forgot the most important thing...
>I'm thinking of this project as part of the Google's Summer of Code program,
>so I'm looking for other students.
>I've sent an email to Erik and he told me that we can propose this as part
>of Google's SoC if we find some other people interested in it.
>Lorenzo
>
>On 6/7/05, Lorenzo <[hidden email]> wrote:
>  
>
>>I'm writing this message trying to find some people interested in creating
>>a 'general purpose' lucene search results' clustering extension.
>>I wrote a simply implementation of clustering, and I would like to
>>contribute to lucene development by releasing an open source clustering
>>implementation. I know that maybe each project need a different
>>implementation but that would be a useful basis for everyone to develop his
>>own project.
>>Is anyone interested in it?
>>Lorenzo
>>
>>    
>>
>
>  
>
>------------------------------------------------------------------------
>
>No virus found in this incoming message.
>Checked by AVG Anti-Virus.
>Version: 7.0.323 / Virus Database: 267.6.2 - Release Date: 04.06.2005
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]