solr and diversification

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

solr and diversification

Diego Ceccarelli (BLOOMBERG/ LONDON)
Hi,

I'm considering to write a component for diversifying the results. I know that diversification can be achieved by using grouping but I'm thinking about something different and query biased.
The idea is to have something that gets applied after the normal retrieval and selects the top k documents more diverse based on some distance metric:

Example:
imagine that you are asking for 10 rows, and you set diversify.rows=3  diversity.metric=tfidf  diversify.field=body

Solr might retrieve the the top 10 rows as usual, extract tfidf vectors for the bodies and select the top 3 stories that are more distant according to the cosine similarity.
This would be different from grouping because documents will be 'collapsed' or not based on the subset of documents retrieved for the query.
Do you think it would make sense to have it as a component?  any feedback / idea?


Reply | Threaded
Open this post in threaded view
|

Re: solr and diversification

Joel Bernstein
I've thought about this problem a little bit. What I was considering was
using Kmeans clustering to cluster the top 50 docs, then pulling the top
scoring doc form each cluster as the top documents. This should be fast and
effective at getting diversity.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
[hidden email]> wrote:

> Hi,
>
> I'm considering to write a component for diversifying the results. I know
> that diversification can be achieved by using grouping but I'm thinking
> about something different and query biased.
> The idea is to have something that gets applied after the normal retrieval
> and selects the top k documents more diverse based on some distance metric:
>
> Example:
> imagine that you are asking for 10 rows, and you set diversify.rows=3
> diversity.metric=tfidf  diversify.field=body
>
> Solr might retrieve the the top 10 rows as usual, extract tfidf vectors
> for the bodies and select the top 3 stories that are more distant according
> to the cosine similarity.
> This would be different from grouping because documents will be
> 'collapsed' or not based on the subset of documents retrieved for the
> query.
> Do you think it would make sense to have it as a component?  any feedback
> / idea?
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: solr and diversification

Diego Ceccarelli (BLOOMBERG/ LONDON)
In reply to this post by Diego Ceccarelli (BLOOMBERG/ LONDON)
Yeah, I think Kmeans might be a way to implement the "top 3 stories that are more distant", but you can also have a more naïve (and faster) strategy like
 - sending a threshold
 - scan the documents according to the relevance score
 - select the top documents that have diversity > threshold.

I would allow to define the strategy and select it from the request.

From: [hidden email] At: 09/27/18 18:25:43To:  Diego Ceccarelli (BLOOMBERG/ LONDON ) ,  [hidden email]
Subject: Re: solr and diversification

I've thought about this problem a little bit. What I was considering was
using Kmeans clustering to cluster the top 50 docs, then pulling the top
scoring doc form each cluster as the top documents. This should be fast and
effective at getting diversity.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
[hidden email]> wrote:

> Hi,
>
> I'm considering to write a component for diversifying the results. I know
> that diversification can be achieved by using grouping but I'm thinking
> about something different and query biased.
> The idea is to have something that gets applied after the normal retrieval
> and selects the top k documents more diverse based on some distance metric:
>
> Example:
> imagine that you are asking for 10 rows, and you set diversify.rows=3
> diversity.metric=tfidf  diversify.field=body
>
> Solr might retrieve the the top 10 rows as usual, extract tfidf vectors
> for the bodies and select the top 3 stories that are more distant according
> to the cosine similarity.
> This would be different from grouping because documents will be
> 'collapsed' or not based on the subset of documents retrieved for the
> query.
> Do you think it would make sense to have it as a component?  any feedback
> / idea?
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: solr and diversification

Joel Bernstein
Yeah, I think your plan sounds fine.

Do you have a specific use case for diversity of results. I've been
wondering if diversity of results would provide better perceived relevance.

Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Sep 27, 2018 at 1:39 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
[hidden email]> wrote:

> Yeah, I think Kmeans might be a way to implement the "top 3 stories that
> are more distant", but you can also have a more naïve (and faster) strategy
> like
>  - sending a threshold
>  - scan the documents according to the relevance score
>  - select the top documents that have diversity > threshold.
>
> I would allow to define the strategy and select it from the request.
>
> From: [hidden email] At: 09/27/18 18:25:43To:  Diego
> Ceccarelli (BLOOMBERG/ LONDON ) ,  [hidden email]
> Subject: Re: solr and diversification
>
> I've thought about this problem a little bit. What I was considering was
> using Kmeans clustering to cluster the top 50 docs, then pulling the top
> scoring doc form each cluster as the top documents. This should be fast and
> effective at getting diversity.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> [hidden email]> wrote:
>
> > Hi,
> >
> > I'm considering to write a component for diversifying the results. I know
> > that diversification can be achieved by using grouping but I'm thinking
> > about something different and query biased.
> > The idea is to have something that gets applied after the normal
> retrieval
> > and selects the top k documents more diverse based on some distance
> metric:
> >
> > Example:
> > imagine that you are asking for 10 rows, and you set diversify.rows=3
> > diversity.metric=tfidf  diversify.field=body
> >
> > Solr might retrieve the the top 10 rows as usual, extract tfidf vectors
> > for the bodies and select the top 3 stories that are more distant
> according
> > to the cosine similarity.
> > This would be different from grouping because documents will be
> > 'collapsed' or not based on the subset of documents retrieved for the
> > query.
> > Do you think it would make sense to have it as a component?  any feedback
> > / idea?
> >
> >
> >
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: solr and diversification

Tim Allison
If you haven’t already, might want to check out maximal marginal
relevance...original paper: Carbonell and Goldstein.

On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein <[hidden email]> wrote:

> Yeah, I think your plan sounds fine.
>
> Do you have a specific use case for diversity of results. I've been
> wondering if diversity of results would provide better perceived relevance.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Sep 27, 2018 at 1:39 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> [hidden email]> wrote:
>
> > Yeah, I think Kmeans might be a way to implement the "top 3 stories that
> > are more distant", but you can also have a more naïve (and faster)
> strategy
> > like
> >  - sending a threshold
> >  - scan the documents according to the relevance score
> >  - select the top documents that have diversity > threshold.
> >
> > I would allow to define the strategy and select it from the request.
> >
> > From: [hidden email] At: 09/27/18 18:25:43To:  Diego
> > Ceccarelli (BLOOMBERG/ LONDON ) ,  [hidden email]
> > Subject: Re: solr and diversification
> >
> > I've thought about this problem a little bit. What I was considering was
> > using Kmeans clustering to cluster the top 50 docs, then pulling the top
> > scoring doc form each cluster as the top documents. This should be fast
> and
> > effective at getting diversity.
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > [hidden email]> wrote:
> >
> > > Hi,
> > >
> > > I'm considering to write a component for diversifying the results. I
> know
> > > that diversification can be achieved by using grouping but I'm thinking
> > > about something different and query biased.
> > > The idea is to have something that gets applied after the normal
> > retrieval
> > > and selects the top k documents more diverse based on some distance
> > metric:
> > >
> > > Example:
> > > imagine that you are asking for 10 rows, and you set diversify.rows=3
> > > diversity.metric=tfidf  diversify.field=body
> > >
> > > Solr might retrieve the the top 10 rows as usual, extract tfidf vectors
> > > for the bodies and select the top 3 stories that are more distant
> > according
> > > to the cosine similarity.
> > > This would be different from grouping because documents will be
> > > 'collapsed' or not based on the subset of documents retrieved for the
> > > query.
> > > Do you think it would make sense to have it as a component?  any
> feedback
> > > / idea?
> > >
> > >
> > >
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: solr and diversification

Joel Bernstein
Interesting, I had not heard of MMR.


Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Sep 28, 2018 at 10:43 AM Tim Allison <[hidden email]> wrote:

> If you haven’t already, might want to check out maximal marginal
> relevance...original paper: Carbonell and Goldstein.
>
> On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein <[hidden email]> wrote:
>
> > Yeah, I think your plan sounds fine.
> >
> > Do you have a specific use case for diversity of results. I've been
> > wondering if diversity of results would provide better perceived
> relevance.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Thu, Sep 27, 2018 at 1:39 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > [hidden email]> wrote:
> >
> > > Yeah, I think Kmeans might be a way to implement the "top 3 stories
> that
> > > are more distant", but you can also have a more naïve (and faster)
> > strategy
> > > like
> > >  - sending a threshold
> > >  - scan the documents according to the relevance score
> > >  - select the top documents that have diversity > threshold.
> > >
> > > I would allow to define the strategy and select it from the request.
> > >
> > > From: [hidden email] At: 09/27/18 18:25:43To:  Diego
> > > Ceccarelli (BLOOMBERG/ LONDON ) ,  [hidden email]
> > > Subject: Re: solr and diversification
> > >
> > > I've thought about this problem a little bit. What I was considering
> was
> > > using Kmeans clustering to cluster the top 50 docs, then pulling the
> top
> > > scoring doc form each cluster as the top documents. This should be fast
> > and
> > > effective at getting diversity.
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > > [hidden email]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm considering to write a component for diversifying the results. I
> > know
> > > > that diversification can be achieved by using grouping but I'm
> thinking
> > > > about something different and query biased.
> > > > The idea is to have something that gets applied after the normal
> > > retrieval
> > > > and selects the top k documents more diverse based on some distance
> > > metric:
> > > >
> > > > Example:
> > > > imagine that you are asking for 10 rows, and you set diversify.rows=3
> > > > diversity.metric=tfidf  diversify.field=body
> > > >
> > > > Solr might retrieve the the top 10 rows as usual, extract tfidf
> vectors
> > > > for the bodies and select the top 3 stories that are more distant
> > > according
> > > > to the cosine similarity.
> > > > This would be different from grouping because documents will be
> > > > 'collapsed' or not based on the subset of documents retrieved for the
> > > > query.
> > > > Do you think it would make sense to have it as a component?  any
> > feedback
> > > > / idea?
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: solr and diversification

Diego Ceccarelli (BLOOMBERG/ LONDON)
In reply to this post by Diego Ceccarelli (BLOOMBERG/ LONDON)
The use case is on ranking news, Joel. And yes, I have the feeling that it might improve relevance and in 2011/2012 there was a lot of work on this in academia..

Thanks Tim, I'll check out MMR.

From: [hidden email] At: 09/28/18 20:24:44To:  [hidden email]
Subject: Re: solr and diversification

Interesting, I had not heard of MMR.


Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Sep 28, 2018 at 10:43 AM Tim Allison <[hidden email]> wrote:

> If you haven’t already, might want to check out maximal marginal
> relevance...original paper: Carbonell and Goldstein.
>
> On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein <[hidden email]> wrote:
>
> > Yeah, I think your plan sounds fine.
> >
> > Do you have a specific use case for diversity of results. I've been
> > wondering if diversity of results would provide better perceived
> relevance.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Thu, Sep 27, 2018 at 1:39 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > [hidden email]> wrote:
> >
> > > Yeah, I think Kmeans might be a way to implement the "top 3 stories
> that
> > > are more distant", but you can also have a more naïve (and faster)
> > strategy
> > > like
> > >  - sending a threshold
> > >  - scan the documents according to the relevance score
> > >  - select the top documents that have diversity > threshold.
> > >
> > > I would allow to define the strategy and select it from the request.
> > >
> > > From: [hidden email] At: 09/27/18 18:25:43To:  Diego
> > > Ceccarelli (BLOOMBERG/ LONDON ) ,  [hidden email]
> > > Subject: Re: solr and diversification
> > >
> > > I've thought about this problem a little bit. What I was considering
> was
> > > using Kmeans clustering to cluster the top 50 docs, then pulling the
> top
> > > scoring doc form each cluster as the top documents. This should be fast
> > and
> > > effective at getting diversity.
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > >
> > > On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
> > > [hidden email]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm considering to write a component for diversifying the results. I
> > know
> > > > that diversification can be achieved by using grouping but I'm
> thinking
> > > > about something different and query biased.
> > > > The idea is to have something that gets applied after the normal
> > > retrieval
> > > > and selects the top k documents more diverse based on some distance
> > > metric:
> > > >
> > > > Example:
> > > > imagine that you are asking for 10 rows, and you set diversify.rows=3
> > > > diversity.metric=tfidf  diversify.field=body
> > > >
> > > > Solr might retrieve the the top 10 rows as usual, extract tfidf
> vectors
> > > > for the bodies and select the top 3 stories that are more distant
> > > according
> > > > to the cosine similarity.
> > > > This would be different from grouping because documents will be
> > > > 'collapsed' or not based on the subset of documents retrieved for the
> > > > query.
> > > > Do you think it would make sense to have it as a component?  any
> > feedback
> > > > / idea?
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
>