diversity in results

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

diversity in results

Jason Rennie-2
Is there any option in solr to encourage diversity in the results?  Our solr
index has millions of products, many of which are quite similar to each
other.  Even something simple like max 50% text overlap in successive
results would be valuable.  Does something like this exist in solr or are
there any plans to add it?

Thanks,

Jason

--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: diversity in results

Grant Ingersoll-2
See https://issues.apache.org/jira/browse/SOLR-236 and http://wiki.apache.org/solr/FieldCollapsing 
, but I gather it has been languishing.  I also don't think it will do  
anything as extensive as the text similarity question you are asking  
(50% overlap) but I have not tried it.

-Grant


On Aug 4, 2008, at 12:50 PM, Jason Rennie wrote:

> Is there any option in solr to encourage diversity in the results?  
> Our solr
> index has millions of products, many of which are quite similar to  
> each
> other.  Even something simple like max 50% text overlap in successive
> results would be valuable.  Does something like this exist in solr  
> or are
> there any plans to add it?
>
> Thanks,
>
> Jason
>
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/


Reply | Threaded
Open this post in threaded view
|

Re: diversity in results

Brian Whitman
In reply to this post by Jason Rennie-2
On Aug 4, 2008, at 12:50 PM, Jason Rennie wrote:

> Is there any option in solr to encourage diversity in the results?  
> Our solr
> index has millions of products, many of which are quite similar to  
> each
> other.  Even something simple like max 50% text overlap in successive
> results would be valuable.  Does something like this exist in solr  
> or are
> there any plans to add it?
>

not out of the box, but I would use the mlt handler on the first  
result and remove all the ones that appear in both the MLT and query  
response.

B

Reply | Threaded
Open this post in threaded view
|

Re: diversity in results

Jason Rennie-2
In reply to this post by Grant Ingersoll-2
Thanks for the pointers.  Looks interesting, at least as a starting point
for something more sophisticated.

Cheers,

Jason

On Mon, Aug 4, 2008 at 4:38 PM, Grant Ingersoll <[hidden email]> wrote:

> See https://issues.apache.org/jira/browse/SOLR-236 and
> http://wiki.apache.org/solr/FieldCollapsing, but I gather it has been
> languishing.  I also don't think it will do anything as extensive as the
> text similarity question you are asking (50% overlap) but I have not tried
> it.
>
> -Grant
Reply | Threaded
Open this post in threaded view
|

Re: diversity in results

Jason Rennie-2
In reply to this post by Brian Whitman
Does the MLT handler simply select a few high tfidf terms from the doc and
use them as a query?  Sounds like a useful tool.  Do you know anything about
relevant performance issues?  I noticed that the Solr MoreLikeThis wiki page
recommends turning on TermVectors for corresponding fields.  Can lucene not
easily return term counts for a document with the standard indexing b/c it's
term-based (i.e. "inverted").  Does TermVectors=true cause solr/lucene to
store an additional doc-based index?

Thanks,

Jason

On Mon, Aug 4, 2008 at 5:06 PM, Brian Whitman <[hidden email]>wrote:

> not out of the box, but I would use the mlt handler on the first result and
> remove all the ones that appear in both the MLT and query response.
>
> B
>
>
--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: diversity in results

Otis Gospodnetic-2
In reply to this post by Jason Rennie-2
Hi Jason,


Yes, TV will store additional data in the index.  Using fields with TV=true will simply get to the seminal terms more easily.  Yes, in the end the terms are used to perform a normal query and get the most similar docs.  This is based on my use of MLT a whiiiiiiile back, but I don't think things changed that much in the last few years.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Jason Rennie <[hidden email]>
> To: [hidden email]
> Sent: Monday, August 4, 2008 6:17:28 PM
> Subject: Re: diversity in results
>
> Does the MLT handler simply select a few high tfidf terms from the doc and
> use them as a query?  Sounds like a useful tool.  Do you know anything about
> relevant performance issues?  I noticed that the Solr MoreLikeThis wiki page
> recommends turning on TermVectors for corresponding fields.  Can lucene not
> easily return term counts for a document with the standard indexing b/c it's
> term-based (i.e. "inverted").  Does TermVectors=true cause solr/lucene to
> store an additional doc-based index?
>
> Thanks,
>
> Jason
>
> On Mon, Aug 4, 2008 at 5:06 PM, Brian Whitman wrote:
>
> > not out of the box, but I would use the mlt handler on the first result and
> > remove all the ones that appear in both the MLT and query response.
> >
> > B
> >
> >
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Reply | Threaded
Open this post in threaded view
|

Re: diversity in results

Grant Ingersoll-2
In reply to this post by Jason Rennie-2

On Aug 4, 2008, at 6:17 PM, Jason Rennie wrote:

> Does the MLT handler simply select a few high tfidf terms from the  
> doc and
> use them as a query?  Sounds like a useful tool.  Do you know  
> anything about
> relevant performance issues?  I noticed that the Solr MoreLikeThis  
> wiki page
> recommends turning on TermVectors for corresponding fields.  Can  
> lucene not
> easily return term counts for a document with the standard indexing  
> b/c it's
> term-based (i.e. "inverted").

Correct.

> Does TermVectors=true cause solr/lucene to
> store an additional doc-based index?

Yes, that is correct