result grouping?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

result grouping?

Ryan McKinley
Is it possible to group the results from a solr query?  I have indexed
the content from many web pages on many sites.  I'd like to return
only two results from each site.

schema.xml:

   <field name="uri"    type="string"   indexed="true"  stored="true"/>
   <field name="site"  type="string"   indexed="true"  stored="true"/>
   <field name="content"  type="text"   indexed="true"  stored="true"/>

for example
  uri: http://en.wikipedia.org/wiki/James_Madison
  site: wikipedia.org

How do i get results grouped by site?

Is this possible with the standard query?  The website lists: "Support
for Dynamic Result Grouping and Filtering."  Is it referring to
faceted browsing or this?

If its not supported off the shelf, what is the best way to implement
result grouping?

thanks
ryan
Reply | Threaded
Open this post in threaded view
|

Re: result grouping?

Ricardo Borillo-2
Hi,

I don't know if solr can manage grouping. But you can do it using an XSLT
stylesheet:

http://www.jenitennison.com/xslt/grouping/muenchian.html

Hope it helps :)


On 1/2/07, Ryan McKinley <[hidden email]> wrote:

> Is it possible to group the results from a solr query?  I have indexed
> the content from many web pages on many sites.  I'd like to return
> only two results from each site.
>
> schema.xml:
>
>    <field name="uri"    type="string"   indexed="true"  stored="true"/>
>    <field name="site"  type="string"   indexed="true"  stored="true"/>
>    <field name="content"  type="text"   indexed="true"  stored="true"/>
>
> for example
>   uri: http://en.wikipedia.org/wiki/James_Madison
>   site: wikipedia.org
>
> How do i get results grouped by site?
>
> Is this possible with the standard query?  The website lists: "Support
> for Dynamic Result Grouping and Filtering."  Is it referring to
> faceted browsing or this?
>
> If its not supported off the shelf, what is the best way to implement
> result grouping?
>
> thanks
> ryan
>


--
Salut,
====================================
Ricardo Borillo Domenech
Analista/Programador - Servei d'Informàtica
Universitat Jaume I
http://xml-utils.com
Reply | Threaded
Open this post in threaded view
|

Re: result grouping?

Ryan McKinley
thanks.  Yes, the presentation layer could group results, but that is
not practical if i want to show the first 20 results out of 200,000
matches.

Nutch groups the results by site.  Any idea how they do it?

thanks
ryan


On 1/3/07, Ricardo Borillo <[hidden email]> wrote:

> Hi,
>
> I don't know if solr can manage grouping. But you can do it using an XSLT
> stylesheet:
>
> http://www.jenitennison.com/xslt/grouping/muenchian.html
>
> Hope it helps :)
>
>
> On 1/2/07, Ryan McKinley <[hidden email]> wrote:
> > Is it possible to group the results from a solr query?  I have indexed
> > the content from many web pages on many sites.  I'd like to return
> > only two results from each site.
> >
> > schema.xml:
> >
> >    <field name="uri"    type="string"   indexed="true"  stored="true"/>
> >    <field name="site"  type="string"   indexed="true"  stored="true"/>
> >    <field name="content"  type="text"   indexed="true"  stored="true"/>
> >
> > for example
> >   uri: http://en.wikipedia.org/wiki/James_Madison
> >   site: wikipedia.org
> >
> > How do i get results grouped by site?
> >
> > Is this possible with the standard query?  The website lists: "Support
> > for Dynamic Result Grouping and Filtering."  Is it referring to
> > faceted browsing or this?
> >
> > If its not supported off the shelf, what is the best way to implement
> > result grouping?
> >
> > thanks
> > ryan
> >
>
>
> --
> Salut,
> ====================================
> Ricardo Borillo Domenech
> Analista/Programador - Servei d'Informàtica
> Universitat Jaume I
> http://xml-utils.com
>
Reply | Threaded
Open this post in threaded view
|

Re: result grouping?

Yonik Seeley-2
On 1/3/07, Ryan McKinley <[hidden email]> wrote:
> thanks.  Yes, the presentation layer could group results, but that is
> not practical if i want to show the first 20 results out of 200,000
> matches.
>
> Nutch groups the results by site.  Any idea how they do it?

Good question.
Off the top of my head, one could use a priority queue that can change
it's size dynamically.  One could increment a group count for each hit
(like faceted search with the FieldCache) and if the group count
exceeds "n", then you increment the size of the priority queue to
allow an additional item to be collected to compensate.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: result grouping?

Luis Neves-3
Yonik Seeley wrote:

> On 1/3/07, Ryan McKinley <[hidden email]> wrote:
>> thanks.  Yes, the presentation layer could group results, but that is
>> not practical if i want to show the first 20 results out of 200,000
>> matches.
>>
>> Nutch groups the results by site.  Any idea how they do it?
>
> Good question.
> Off the top of my head, one could use a priority queue that can change
> it's size dynamically.  One could increment a group count for each hit
> (like faceted search with the FieldCache) and if the group count
> exceeds "n", then you increment the size of the priority queue to
> allow an additional item to be collected to compensate.
>
> -Yonik

You might as wheel say that I have to change the dilithium crystals in the flux
capacitor :-)

One of the reasons I like Solr so much is because I get impressive results
without having to know Lucene, which is something that will have to change
because I also need this feature.

Not knowing much about the internal of Solr/Lucene I had a look at the Facet
code in search of ideas, but from what I could see the facet counts are
calculated after the Documents are added to the response, it seems to me that
any kind of grouping has to be done before that... right?

Could you explain in more detail where should I look?

Can the TopFieldDocCollector/TopFieldDocs classes be used to this end?

I'm immersing my self on Lucene but it will take some time.

Side note: Over here, beside Solr, we also use the "FAST" search platform and
they call this feature "Field collapsing":
<http://www.fastsearch.com/glossary.aspx?m=48&amid=299>
I like the syntax they use:
"&collapseon=<fieldname>&collapsenum=N" -> Collapse, but keep N number of
collapsed documents
For some reason they can only collapse on numeric fields (int32).

Regards,
Luis Neves

Reply | Threaded
Open this post in threaded view
|

Re: result grouping?

Yonik Seeley-2
On 1/4/07, Luis Neves <[hidden email]> wrote:

> Yonik Seeley wrote:
> > Off the top of my head, one could use a priority queue that can change
> > it's size dynamically.  One could increment a group count for each hit
> > (like faceted search with the FieldCache) and if the group count
> > exceeds "n", then you increment the size of the priority queue to
> > allow an additional item to be collected to compensate.
> >
> > -Yonik
>
> You might as wheel say that I have to change the dilithium crystals in the flux
> capacitor :-)

Heh...
When someone asks for the top 10 documents, we create a priority queue
of size 10 and put all of the hits through it (with a performance
shortcut if the only sort is by score).  After we are all done, the
queue contains the top 10 documents by the sort criteria.

Now lets say we are limiting the number of results from any "site" to 2.
If we add another document to the priority queue and it will be the
3rd from a specific site, there are two things we could do:
1) remove the lowest ranking document from the 3 documents matching that site
2) increase the size of the priority queue to 11 since we will be
throwing one of the
   documents away later.

At first blush, option (2) seemed easier to me, with the added step of
discarding the extra documents as you pull them from the queue.

> One of the reasons I like Solr so much is because I get impressive results
> without having to know Lucene, which is something that will have to change
> because I also need this feature.
>
> Not knowing much about the internal of Solr/Lucene I had a look at the Facet
> code in search of ideas, but from what I could see the facet counts are
> calculated after the Documents are added to the response, it seems to me that
> any kind of grouping has to be done before that... right?

Right.

> Could you explain in more detail where should I look?
>
> Can the TopFieldDocCollector/TopFieldDocs classes be used to this end?

That's currently how the top docs are collected in Lucene (these
separate classes were added later, and Solr doesn't currently use
them).

SolrIndexSearcher.getDocListNC() is the lowest level of doc collection
that would need to be modified or duplicated.

> Side note: Over here, beside Solr, we also use the "FAST" search platform and
> they call this feature "Field collapsing":
> <http://www.fastsearch.com/glossary.aspx?m=48&amid=299>
> I like the syntax they use:
> "&collapseon=<fieldname>&collapsenum=N" -> Collapse, but keep N number of
> collapsed documents
> For some reason they can only collapse on numeric fields (int32).

Cool, thanks for the reference.

There are still some things underspecified though.

Let's take an example of collapseon=site, collapsenum=2

The list of un-collapsed matches and their relevancy scores (sort order) is:
doc=51, site=A, score=100
doc=52, site=B, score=90
doc=53, site=C, score=80
doc=54, site=B, score=70
doc=55, site=D, score=60
doc=56, site=E, score=50
doc=57, site=B, score=40
doc=58, site=A, score=30

1)  If I ask for the top 4 docs, should I get [51,52,53,54] or
[51,52,54,53].  Are lower ranking docs moved up in the rankings to be
in their higher ranking "group"?

2)  If I ask for the top 3 docs, should I get [51,52,53] because those
are the top 3 scoring docs, or should I get [51,58,52] because
documents were first groups and then ranked (and 51 and 58 go
together)?  Another way of asking this is related to (1): should docs
outside the "window" be moved up in the rankings to be in their higher
ranking "group"?

3) Should the number of documents in a "group" change the relevancy?
Should site=B rank higher than site=A?

4) Is the collapsing only in the returned results, or just within a
page of results.  If I ask for docs 4 through 7, should doc 57 be in
that list or not?

Defining things to make sense while retaining the ability to page
through the results seems to be the challenge.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: result grouping?

Luis Neves-3
Yonik Seeley wrote:

> There are still some things underspecified though.
>
> Let's take an example of collapseon=site, collapsenum=2
>
> The list of un-collapsed matches and their relevancy scores (sort order)
> is:
> doc=51, site=A, score=100
> doc=52, site=B, score=90
> doc=53, site=C, score=80
> doc=54, site=B, score=70
> doc=55, site=D, score=60
> doc=56, site=E, score=50
> doc=57, site=B, score=40
> doc=58, site=A, score=30
>
> 1)  If I ask for the top 4 docs, should I get [51,52,53,54] or
> [51,52,54,53].  Are lower ranking docs moved up in the rankings to be
> in their higher ranking "group"?

The docs move up the ranking.
You should get [51,58,52,54] ... or one could make the case that you should get
[51,58,52,54,53,55], to get the somewhat equivalent behaviour of a SQL
"quota-query", in that case that case the "top 4" would not refer to the number
of documents but the number of distinct values for the field you are collapsing.


> 2)  If I ask for the top 3 docs, should I get [51,52,53] because those
> are the top 3 scoring docs, or should I get [51,58,52] because
> documents were first groups and then ranked (and 51 and 58 go
> together)?  Another way of asking this is related to (1): should docs
> outside the "window" be moved up in the rankings to be in their higher
> ranking "group"?

See above.


>
> 3) Should the number of documents in a "group" change the relevancy?
> Should site=B rank higher than site=A?

I don't think so... don't know if that is what *should* be done, but that's not
what FAST does.


> 4) Is the collapsing only in the returned results, or just within a
> page of results.  If I ask for docs 4 through 7, should doc 57 be in
> that list or not?

With "FAST" that is an option, the default behaviour is to remove the documents
from the resultset and the 57 would not be on the list, but you can choose to
not remove them and in that case they are presented last.

> Defining things to make sense while retaining the ability to page
> through the results seems to be the challenge.


I'm beginning to think that this a little to complex for a first project with
Lucene. In my particular case all I want is to group results by category (from a
predetermined - and small - category list), I think I will just make a request
by category and accept the latency.

--
Luis Neves
Reply | Threaded
Open this post in threaded view
|

Re: result grouping?

Mike Klaas
In reply to this post by Luis Neves-3
On 1/4/07, Luis Neves <[hidden email]> wrote:
> Yonik Seeley wrote:

> One of the reasons I like Solr so much is because I get impressive results
> without having to know Lucene, which is something that will have to change
> because I also need this feature.

<>

> Could you explain in more detail where should I look?
>
> Can the TopFieldDocCollector/TopFieldDocs classes be used to this end?
>
> I'm immersing my self on Lucene but it will take some time.

We use Solr in a nutch-like manner (index distributed over a
collection of servers, results are merged and similar documents
collapsed).  We have to do the collapsing outside of Solr due to the
result combining, but I think it is a viable strategy for a
single-instance too.  Just slightly over-request the desired number of
docs, collapse using arbitrary logic, and request more if necessary.

The main disadvantage is if the user skips ahead several pages, all
the intermediate results must be generated.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: result grouping?

Yonik Seeley-2
In reply to this post by Luis Neves-3
On 1/5/07, Luis Neves <[hidden email]> wrote:

> Yonik Seeley wrote:
>
> > There are still some things underspecified though.
> >
> > Let's take an example of collapseon=site, collapsenum=2
> >
> > The list of un-collapsed matches and their relevancy scores (sort order)
> > is:
> > doc=51, site=A, score=100
> > doc=52, site=B, score=90
> > doc=53, site=C, score=80
> > doc=54, site=B, score=70
> > doc=55, site=D, score=60
> > doc=56, site=E, score=50
> > doc=57, site=B, score=40
> > doc=58, site=A, score=30
> >
> > 1)  If I ask for the top 4 docs, should I get [51,52,53,54] or
> > [51,52,54,53].  Are lower ranking docs moved up in the rankings to be
> > in their higher ranking "group"?
>
> The docs move up the ranking.

After thinking on this a little further (since someone submitted a
patch), this makes things significantly more expensive.

The issue is that even if you are only interested in the top 10 docs,
you can't use the normal priority queue method to discard low scores,
because the last document you score could be very high scoring, and be
in the same group as the lower previously-discarded scores.

One way is to keep a priority queue per field value (very expensive if
there are many field values).
Another way is to use two phases... the first collects the top n
documents, and the second grabs

Another issue is how to implement start + offset.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: result grouping?

Yonik Seeley-2
On 6/4/07, Yonik Seeley <[hidden email]> wrote:
> Another way is to use two phases... the first collects the top n
> documents, and the second grabs
... other members of each group in the list of docs to return.

-Yonik