Aggregating/Grouping Document Search Results on a Field

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Aggregating/Grouping Document Search Results on a Field

Bradford Stephens
Greetings,

We've been experimenting with grouping fields returned from document
search results in Lucene, and we haven't gotten anything very
encouraging. Basically, the more results we return, the longer it
takes -- tens of seconds. Probably because we're doing expensive disks
seeks. I'm hoping the SOLR crew out there may provide some insight :)

What we're trying to do is similar to SQL's "GROUP BY".  Let's say we
have documents indexed by keyword for a content body, and also indexed
by an Author name. If I search our document store (very large) for the
word "laptop", I would like to be able to calculate the 10 authors
that appeared the most.

I've done some searching through the mailing list, but couldn't glean
much insight. What do you think?

--
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science
Reply | Threaded
Open this post in threaded view
|

Re: Aggregating/Grouping Document Search Results on a Field

shb-2
you can refer to the facet search of solr, that might help you.

2009/7/10 Bradford Stephens <[hidden email]>

> Greetings,
>
> We've been experimenting with grouping fields returned from document
> search results in Lucene, and we haven't gotten anything very
> encouraging. Basically, the more results we return, the longer it
> takes -- tens of seconds. Probably because we're doing expensive disks
> seeks. I'm hoping the SOLR crew out there may provide some insight :)
>
> What we're trying to do is similar to SQL's "GROUP BY".  Let's say we
> have documents indexed by keyword for a content body, and also indexed
> by an Author name. If I search our document store (very large) for the
> word "laptop", I would like to be able to calculate the 10 authors
> that appeared the most.
>
> I've done some searching through the mailing list, but couldn't glean
> much insight. What do you think?
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>
Reply | Threaded
Open this post in threaded view
|

Re: Aggregating/Grouping Document Search Results on a Field

Bradford Stephens
It looks like field collapsing may be the key:
http://issues.apache.org/jira/browse/SOLR-236

But it also doesn't seem to be 'finalized' yet. I wonder how
performant it is with indexes of 50 million documents+?

On Thu, Jul 9, 2009 at 9:42 PM, shb<[hidden email]> wrote:

> you can refer to the facet search of solr, that might help you.
>
> 2009/7/10 Bradford Stephens <[hidden email]>
>
>> Greetings,
>>
>> We've been experimenting with grouping fields returned from document
>> search results in Lucene, and we haven't gotten anything very
>> encouraging. Basically, the more results we return, the longer it
>> takes -- tens of seconds. Probably because we're doing expensive disks
>> seeks. I'm hoping the SOLR crew out there may provide some insight :)
>>
>> What we're trying to do is similar to SQL's "GROUP BY".  Let's say we
>> have documents indexed by keyword for a content body, and also indexed
>> by an Author name. If I search our document store (very large) for the
>> word "laptop", I would like to be able to calculate the 10 authors
>> that appeared the most.
>>
>> I've done some searching through the mailing list, but couldn't glean
>> much insight. What do you think?
>>
>> --
>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> Media, and Computer Science
>>
>



--
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science
Reply | Threaded
Open this post in threaded view
|

Re: Aggregating/Grouping Document Search Results on a Field

Bradford Stephens
Oh, wow... I think that faceted search is the right path, especially
since seeing this amazing site:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr

I hope it's performant over hundreds of thousands of search results :)

On Thu, Jul 9, 2009 at 10:13 PM, Bradford
Stephens<[hidden email]> wrote:

> It looks like field collapsing may be the key:
> http://issues.apache.org/jira/browse/SOLR-236
>
> But it also doesn't seem to be 'finalized' yet. I wonder how
> performant it is with indexes of 50 million documents+?
>
> On Thu, Jul 9, 2009 at 9:42 PM, shb<[hidden email]> wrote:
>> you can refer to the facet search of solr, that might help you.
>>
>> 2009/7/10 Bradford Stephens <[hidden email]>
>>
>>> Greetings,
>>>
>>> We've been experimenting with grouping fields returned from document
>>> search results in Lucene, and we haven't gotten anything very
>>> encouraging. Basically, the more results we return, the longer it
>>> takes -- tens of seconds. Probably because we're doing expensive disks
>>> seeks. I'm hoping the SOLR crew out there may provide some insight :)
>>>
>>> What we're trying to do is similar to SQL's "GROUP BY".  Let's say we
>>> have documents indexed by keyword for a content body, and also indexed
>>> by an Author name. If I search our document store (very large) for the
>>> word "laptop", I would like to be able to calculate the 10 authors
>>> that appeared the most.
>>>
>>> I've done some searching through the mailing list, but couldn't glean
>>> much insight. What do you think?
>>>
>>> --
>>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>>> Media, and Computer Science
>>>
>>
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>



--
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science
Reply | Threaded
Open this post in threaded view
|

Re: Aggregating/Grouping Document Search Results on a Field

Bradford Stephens
Does the facet aggregation take place on the Solr search server, or
the Solr client?

It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
million document index (about 36M unique values in the "author"
field), a query that returns 131,000 hits takes about 20 seconds to
calculate the top 50 authors. The query I'm running is this:

http://dttest10:8983/solr/select/select?q=java&facet=true&facet.field=authorname:



On Thu, Jul 9, 2009 at 10:32 PM, Bradford
Stephens<[hidden email]> wrote:

> Oh, wow... I think that faceted search is the right path, especially
> since seeing this amazing site:
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
>
> I hope it's performant over hundreds of thousands of search results :)
>
> On Thu, Jul 9, 2009 at 10:13 PM, Bradford
> Stephens<[hidden email]> wrote:
>> It looks like field collapsing may be the key:
>> http://issues.apache.org/jira/browse/SOLR-236
>>
>> But it also doesn't seem to be 'finalized' yet. I wonder how
>> performant it is with indexes of 50 million documents+?
>>
>> On Thu, Jul 9, 2009 at 9:42 PM, shb<[hidden email]> wrote:
>>> you can refer to the facet search of solr, that might help you.
>>>
>>> 2009/7/10 Bradford Stephens <[hidden email]>
>>>
>>>> Greetings,
>>>>
>>>> We've been experimenting with grouping fields returned from document
>>>> search results in Lucene, and we haven't gotten anything very
>>>> encouraging. Basically, the more results we return, the longer it
>>>> takes -- tens of seconds. Probably because we're doing expensive disks
>>>> seeks. I'm hoping the SOLR crew out there may provide some insight :)
>>>>
>>>> What we're trying to do is similar to SQL's "GROUP BY".  Let's say we
>>>> have documents indexed by keyword for a content body, and also indexed
>>>> by an Author name. If I search our document store (very large) for the
>>>> word "laptop", I would like to be able to calculate the 10 authors
>>>> that appeared the most.
>>>>
>>>> I've done some searching through the mailing list, but couldn't glean
>>>> much insight. What do you think?
>>>>
>>>> --
>>>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>>>> Media, and Computer Science
>>>>
>>>
>>
>>
>>
>> --
>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> Media, and Computer Science
>>
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>



--
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science
Reply | Threaded
Open this post in threaded view
|

Re: Aggregating/Grouping Document Search Results on a Field

Avlesh Singh
>
> Does the facet aggregation take place on the Solr search server, or the
> Solr client?
>
Solr server.

Faceting is an expensive operation by nature, especially when the hits are
large in number. Solr caches these values once computed. You might want to
tweak cache related parameters in your solr config for better performance.
Read up on the caching section (
http://wiki.apache.org/solr/SolrConfigXml#head-ffe19c34abf267ca2d49d9e7102feab8c79b5fb5)
for details.

Cheers
Avlesh

On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens <
[hidden email]> wrote:

> Does the facet aggregation take place on the Solr search server, or
> the Solr client?
>
> It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
> million document index (about 36M unique values in the "author"
> field), a query that returns 131,000 hits takes about 20 seconds to
> calculate the top 50 authors. The query I'm running is this:
>
>
> http://dttest10:8983/solr/select/select?q=java&facet=true&facet.field=authorname
> :
>
>
>
> On Thu, Jul 9, 2009 at 10:32 PM, Bradford
> Stephens<[hidden email]> wrote:
> > Oh, wow... I think that faceted search is the right path, especially
> > since seeing this amazing site:
> >
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr
> >
> > I hope it's performant over hundreds of thousands of search results :)
> >
> > On Thu, Jul 9, 2009 at 10:13 PM, Bradford
> > Stephens<[hidden email]> wrote:
> >> It looks like field collapsing may be the key:
> >> http://issues.apache.org/jira/browse/SOLR-236
> >>
> >> But it also doesn't seem to be 'finalized' yet. I wonder how
> >> performant it is with indexes of 50 million documents+?
> >>
> >> On Thu, Jul 9, 2009 at 9:42 PM, shb<[hidden email]> wrote:
> >>> you can refer to the facet search of solr, that might help you.
> >>>
> >>> 2009/7/10 Bradford Stephens <[hidden email]>
> >>>
> >>>> Greetings,
> >>>>
> >>>> We've been experimenting with grouping fields returned from document
> >>>> search results in Lucene, and we haven't gotten anything very
> >>>> encouraging. Basically, the more results we return, the longer it
> >>>> takes -- tens of seconds. Probably because we're doing expensive disks
> >>>> seeks. I'm hoping the SOLR crew out there may provide some insight :)
> >>>>
> >>>> What we're trying to do is similar to SQL's "GROUP BY".  Let's say we
> >>>> have documents indexed by keyword for a content body, and also indexed
> >>>> by an Author name. If I search our document store (very large) for the
> >>>> word "laptop", I would like to be able to calculate the 10 authors
> >>>> that appeared the most.
> >>>>
> >>>> I've done some searching through the mailing list, but couldn't glean
> >>>> much insight. What do you think?
> >>>>
> >>>> --
> >>>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> >>>> Media, and Computer Science
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> >> Media, and Computer Science
> >>
> >
> >
> >
> > --
> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
> > Media, and Computer Science
> >
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>
Reply | Threaded
Open this post in threaded view
|

Re: Aggregating/Grouping Document Search Results on a Field

Shalin Shekhar Mangar
In reply to this post by Bradford Stephens
On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens <
[hidden email]> wrote:

> Does the facet aggregation take place on the Solr search server, or
> the Solr client?
>
> It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
> million document index (about 36M unique values in the "author"
> field), a query that returns 131,000 hits takes about 20 seconds to
> calculate the top 50 authors. The query I'm running is this:
>
>
> http://dttest10:8983/solr/select/select?q=java&facet=true&facet.field=authorname
> :
>
>
Is the author field tokenized? Is it multi-valued? It is best to have
untokenized fields.

Solr 1.4 has huge improvements in faceting performance so you can try that
and see if it helps. See Yonik's blog post about this -
http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/

--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: Aggregating/Grouping Document Search Results on a Field

Bradford Stephens
Thanks for this -- we're also trying out bobo-browse for Lucene, and
early results look pretty enticing. They greatly sped up how fast you
read in documents from disk, among other things:
http://bobo-browse.wiki.sourceforge.net/

On Sat, Jul 11, 2009 at 12:10 AM, Shalin Shekhar
Mangar<[hidden email]> wrote:

> On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens <
> [hidden email]> wrote:
>
>> Does the facet aggregation take place on the Solr search server, or
>> the Solr client?
>>
>> It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
>> million document index (about 36M unique values in the "author"
>> field), a query that returns 131,000 hits takes about 20 seconds to
>> calculate the top 50 authors. The query I'm running is this:
>>
>>
>> http://dttest10:8983/solr/select/select?q=java&facet=true&facet.field=authorname
>> :
>>
>>
> Is the author field tokenized? Is it multi-valued? It is best to have
> untokenized fields.
>
> Solr 1.4 has huge improvements in faceting performance so you can try that
> and see if it helps. See Yonik's blog post about this -
> http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



--
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science
Reply | Threaded
Open this post in threaded view
|

Re: Aggregating/Grouping Document Search Results on a Field

Jason Rutherglen
SOLR 1.4 has a new feature
https://issues.apache.org/jira/browse/SOLR-475that speeds up faceting
on fields with many terms by adding
an UnInvertedField.
Bobo uses a custom field cache as well. It may be useful to benchmark the 3
different approaches (bitsets, SOLR-475, Bobo). This could be a good wiki
page explaining the differences between them?

On Mon, Jul 13, 2009 at 9:49 AM, Bradford Stephens <
[hidden email]> wrote:

> Thanks for this -- we're also trying out bobo-browse for Lucene, and
> early results look pretty enticing. They greatly sped up how fast you
> read in documents from disk, among other things:
> http://bobo-browse.wiki.sourceforge.net/
>
> On Sat, Jul 11, 2009 at 12:10 AM, Shalin Shekhar
> Mangar<[hidden email]> wrote:
> > On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens <
> > [hidden email]> wrote:
> >
> >> Does the facet aggregation take place on the Solr search server, or
> >> the Solr client?
> >>
> >> It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
> >> million document index (about 36M unique values in the "author"
> >> field), a query that returns 131,000 hits takes about 20 seconds to
> >> calculate the top 50 authors. The query I'm running is this:
> >>
> >>
> >>
> http://dttest10:8983/solr/select/select?q=java&facet=true&facet.field=authorname
> >> :
> >>
> >>
> > Is the author field tokenized? Is it multi-valued? It is best to have
> > untokenized fields.
> >
> > Solr 1.4 has huge improvements in faceting performance so you can try
> that
> > and see if it helps. See Yonik's blog post about this -
> >
> http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>
Reply | Threaded
Open this post in threaded view
|

Re: Aggregating/Grouping Document Search Results on a Field

John Wang-9
Hi Brad:
    We have since (Bobo) added some perf tests which allows you to do some
benchmarking very quickly:

http://code.google.com/p/bobo-browse/wiki/BoboPerformance

    Let me know if you need help setting up.

-John

On Mon, Jul 13, 2009 at 10:41 AM, Jason Rutherglen <
[hidden email]> wrote:

> SOLR 1.4 has a new feature
> https://issues.apache.org/jira/browse/SOLR-475that speeds up faceting
> on fields with many terms by adding
> an UnInvertedField.
> Bobo uses a custom field cache as well. It may be useful to benchmark the 3
> different approaches (bitsets, SOLR-475, Bobo). This could be a good wiki
> page explaining the differences between them?
>
> On Mon, Jul 13, 2009 at 9:49 AM, Bradford Stephens <
> [hidden email]> wrote:
>
> > Thanks for this -- we're also trying out bobo-browse for Lucene, and
> > early results look pretty enticing. They greatly sped up how fast you
> > read in documents from disk, among other things:
> > http://bobo-browse.wiki.sourceforge.net/
> >
> > On Sat, Jul 11, 2009 at 12:10 AM, Shalin Shekhar
> > Mangar<[hidden email]> wrote:
> > > On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens <
> > > [hidden email]> wrote:
> > >
> > >> Does the facet aggregation take place on the Solr search server, or
> > >> the Solr client?
> > >>
> > >> It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
> > >> million document index (about 36M unique values in the "author"
> > >> field), a query that returns 131,000 hits takes about 20 seconds to
> > >> calculate the top 50 authors. The query I'm running is this:
> > >>
> > >>
> > >>
> >
> http://dttest10:8983/solr/select/select?q=java&facet=true&facet.field=authorname
> > >> :
> > >>
> > >>
> > > Is the author field tokenized? Is it multi-valued? It is best to have
> > > untokenized fields.
> > >
> > > Solr 1.4 has huge improvements in faceting performance so you can try
> > that
> > > and see if it helps. See Yonik's blog post about this -
> > >
> >
> http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
> >
> >
> > --
> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
> > Media, and Computer Science
> >
>