Returning a minimum number of clusters

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Returning a minimum number of clusters

Marvin Humphrey
Greets,

I'm toying with the idea of implementing clustering of search results  
based on comparison of document vectors constrained by field.  For  
instance, you could cluster based on "topic", or "domain", or  
"content".  "domain" would be easy, as it's presumably a single value  
field.  "content" would be much more involved.

The problem I'm trying to solve is how to return a minimum number of  
clusters from a search.  Say the most relevant 100 documents for a  
query are all from the same domain, but you want a maximum of two  
results per domain, a la Google.  I don't see any alternative to  
rerunning the query an indeterminate number of times until you've  
accumulated sufficient clusters, because the search logic doesn't  
know what cluster a document belongs to until the document vector is  
retrieved.

Is there a better way?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Returning a minimum number of clusters

Grant Ingersoll
You might be interested in the Carrot project, which has some Lucene
support.  I don't know if it solves your second problem, but it already
implements clustering and may allow you to get to an answer for the
second problem quicker.  I have, just recently, started using it for a
clustering task I am working on related to search results.  I think the
author of Carrot is on the user list from time to time



Marvin Humphrey wrote:

> Greets,
>
> I'm toying with the idea of implementing clustering of search results
> based on comparison of document vectors constrained by field.  For
> instance, you could cluster based on "topic", or "domain", or
> "content".  "domain" would be easy, as it's presumably a single value
> field.  "content" would be much more involved.
>
> The problem I'm trying to solve is how to return a minimum number of
> clusters from a search.  Say the most relevant 100 documents for a
> query are all from the same domain, but you want a maximum of two
> results per domain, a la Google.  I don't see any alternative to
> rerunning the query an indeterminate number of times until you've
> accumulated sufficient clusters, because the search logic doesn't know
> what cluster a document belongs to until the document vector is
> retrieved.
>
> Is there a better way?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Returning a minimum number of clusters

Doug Cutting
In reply to this post by Marvin Humphrey
Marvin Humphrey wrote:
> The problem I'm trying to solve is how to return a minimum number of  
> clusters from a search.  Say the most relevant 100 documents for a  
> query are all from the same domain, but you want a maximum of two  
> results per domain, a la Google.  I don't see any alternative to  
> rerunning the query an indeterminate number of times until you've  
> accumulated sufficient clusters, because the search logic doesn't  know
> what cluster a document belongs to until the document vector is  retrieved.
>
> Is there a better way?

Nutch implements host-deduping roughly as follows:

To fetch the first 10 hits it first asks for the top-scoring 20 or so.
Then it uses a field cache to reduce this to just two from each host.
If it runs out of raw hits, then it re-runs the query, this time for the
top scoring 40 hits.  But the query is modified this time to exclude
matches from hosts that have already returned more than two hits.
(Nutch also automatically converts clauses like "-host:foo.com" into
cached filters when "foo.com" occurs in more than a certain percentage
of documents.) Thus, in the worst case, it could take five queries to
return the top ten hits, but in practice I've never seen more than
three, and the re-query rate is usually quite low.  Since raw hits are
cheap to compute, and, with a field cache, the host filtering is also
fast, to reduce the raw query rate one can simply start by searching for
a larger number of raw hits, with little performance impact.

BTW, clustering in Information Retrieval usually implies grouping by
vector distance using statistical methods:

http://en.wikipedia.org/wiki/Data_clustering

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Returning a minimum number of clusters

Marvin Humphrey

On May 1, 2006, at 10:38 AM, Doug Cutting wrote:

> Nutch implements host-deduping roughly as follows:
>
> To fetch the first 10 hits it first asks for the top-scoring 20 or  
> so. Then it uses a field cache to reduce this to just two from each  
> host. If it runs out of raw hits, then it re-runs the query, this  
> time for the top scoring 40 hits.  But the query is modified this  
> time to exclude matches from hosts that have already returned more  
> than two hits. (Nutch also automatically converts clauses like "-
> host:foo.com" into cached filters when "foo.com" occurs in more  
> than a certain percentage of documents.)

Is that an optimization which only works for Nutch and hosts, or is  
it something that could be generalized and implemented sanely in Lucene?

> Thus, in the worst case, it could take five queries to return the  
> top ten hits, but in practice I've never seen more than three, and  
> the re-query rate is usually quite low.  Since raw hits are cheap  
> to compute, and, with a field cache, the host filtering is also  
> fast, to reduce the raw query rate one can simply start by  
> searching for a larger number of raw hits, with little performance  
> impact.

Great, thanks, it's good to know that in practice rerunning the  
queries is not much of a concern.

> BTW, clustering in Information Retrieval usually implies grouping  
> by vector distance using statistical methods:
>
> http://en.wikipedia.org/wiki/Data_clustering

Exactly.  I'd scanned this, but I haven't yet familiarized myself  
with the different models.

It may be possible for both keyword fields e.g. "host" and non-
keyword fields e.g. "content" to be clustered using the same  
algorithm and an interface like Hits.cluster(String fieldname, int  
docsPerCluster).  Retrieve each hit's vector for the specified field,  
and map the docs into a unified term space, then cluster.   For  
"host" or any other keyword field, the boundaries will be stark and  
the cost of calculation negligible.  For "content", a more  
sophisticated model will be required to group the docs and the cost  
will be greater.

It is more expensive to calculate similarity based on the entire  
document's contents rather than just a snippet chosen by the  
Highlighter.  However, it's presumably more accurate, and having the  
Term Vectors pre-built at index time should help quite a bit.  As the  
number of terms increases, there is presumably a point at which the  
cost becomes too great, but it might be a pretty large number of  
terms.  Dunno yet.  It might make sense to have a "clusterContent"  
field which is a truncated version of "content", which is vectored  
but neither stored nor indexed.

After that, there's also the issue of generating cluster labels.  
Lots of problems to be solved.  But it seems to me that if the term  
vectors are already there, that's an excellent start -- and if you're  
using them for highlighting, you get the disk seeks for free.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Returning a minimum number of clusters

Marvin Humphrey
In reply to this post by Grant Ingersoll

On May 1, 2006, at 10:21 AM, Grant Ingersoll wrote:

> You might be interested in the Carrot project, which has some  
> Lucene support.  I don't know if it solves your second problem, but  
> it already implements clustering and may allow you to get to an  
> answer for the second problem quicker.  I have, just recently,  
> started using it for a clustering task I am working on related to  
> search results.

I tracked down this demo...

http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip

 From what I can tell, it doesn't use Lucene's term vectors.  I think  
it should be possible to exploit those Term Vectors, perhaps yielding  
a good result without having to build a summary for each document.  
Dunno if the benefits justify the development effort.  :) I have to  
implement host-deduping in a KinoSearch-based app anyway, though, so  
I think I'll try this technique and see how well things work if I  
extend it for use with non-keyword fields.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Returning a minimum number of clusters

Doug Cutting
In reply to this post by Marvin Humphrey
Marvin Humphrey wrote:

> On May 1, 2006, at 10:38 AM, Doug Cutting wrote:
>> Nutch implements host-deduping roughly as follows:
>>
>> To fetch the first 10 hits it first asks for the top-scoring 20 or  
>> so. Then it uses a field cache to reduce this to just two from each  
>> host. If it runs out of raw hits, then it re-runs the query, this  
>> time for the top scoring 40 hits.  But the query is modified this  
>> time to exclude matches from hosts that have already returned more  
>> than two hits. (Nutch also automatically converts clauses like "-
>> host:foo.com" into cached filters when "foo.com" occurs in more  than
>> a certain percentage of documents.)
>
> Is that an optimization which only works for Nutch and hosts, or is  it
> something that could be generalized and implemented sanely in Lucene?

It's probably generalizeable.

The stuff that optimizes queries into filters is in:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java?view=markup

The deduping logic is in:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/NutchBean.java?view=markup

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Returning a minimum number of clusters

Bob Carpenter
In reply to this post by Marvin Humphrey
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: Returning a minimum number of clusters

Marvin Humphrey

On May 2, 2006, at 1:55 PM, [hidden email] wrote:This is an issue  
of scaling the different dimensions.

>> It is more expensive to calculate similarity based on the entire  
>> document's contents rather than just a snippet chosen by the  
>> Highlighter.  However, it's presumably more accurate, and having  
>> the  Term Vectors pre-built at index time should help quite a bit.
>
> This varies, actually, depending on the document.  If
> you grab HTML from a portal, and use it all, pages from
> that portal will tend to cluster together.  If you just
> use snippets of text around document passages that
> match your query, you can actually get more accurate clustering  
> relative
> to your query.  It really depends if the documents are
> single-topic and coherent.  If so, use them all; if not,
> use snippets.  [You can see this problem leading the
> Google news classifier astray on occasion.]

That's both helpful and deflating.  :\  I can imagine that if you  
used the complete document vector from an html document that included  
navigation text, the navigation text would cause the clustering.  
That navigation text, which cannot practically be expunged at  
spidering/indexing time if you are naive about the document  
structure, is unlikely to show up in a snippet.

> A typical way to approximate is by only taking high TF/IDF
> terms.

Another strike against using the existing Term Vectors, as you'd have  
to look them all up in the term dictionary.  A stoplist could narrow  
things down some, but it would have to be applied at index-time if  
the terms were stemmed.

> Principal component methods are also popular (e.g.
> latent semantic indexing) to reduce dimensionality (usually
> with a least-squares fit criterion).

I imagine that reducing dimensionality isn't necessary if you're  
using only snippets.  And if you were to pre-compute LSI or similar  
at index-time, wouldn't you run into the same problems if your docs  
aren't single-topic and coherent to begin with?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]