Facet performance with heterogeneous 'facets'?

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Facet performance with heterogeneous 'facets'?

Michael Imbeault
Been playing around with the news 'facets search' and it works very
well, but it's really slow for some particular applications. I've been
trying to use it to display the most frequent authors of articles; this
is from a huge (15 millions articles) database and names of authors are
rare and heterogeneous. On a query that takes (without facets) 0.1
seconds, it jumps to ~20 seconds with just 1% of the documents indexed
(I've been getting java.lang.OutOfMemoryError with the full index). ~40
seconds for a faceted search on 2 (string) fields. Range queries on a
slong field is more acceptable (even with a dozen of them, query time is
still in the subsecond range).

I'm I trying to do something which isn't what faceted search was made
for? It would be understandable, after all, I guess the facets engine
has to check very doc in the index and sort... which shouldn't yield
good performance no matter what, sadly.

Is there any other way I could achieve what I'm trying to do? Just a
list of the most frequent (top 5) authors present in the results of a query.

Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Michael Imbeault
Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 130000 documents.

It seems something isn't right... it looks like solr is doing faceted
search on the whole index no matter what's the result set when doing
facets on a string field. I must be doing something wrong?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Michael Imbeault wrote:

> Been playing around with the news 'facets search' and it works very
> well, but it's really slow for some particular applications. I've been
> trying to use it to display the most frequent authors of articles;
> this is from a huge (15 millions articles) database and names of
> authors are rare and heterogeneous. On a query that takes (without
> facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the
> documents indexed (I've been getting java.lang.OutOfMemoryError with
> the full index). ~40 seconds for a faceted search on 2 (string)
> fields. Range queries on a slong field is more acceptable (even with a
> dozen of them, query time is still in the subsecond range).
>
> I'm I trying to do something which isn't what faceted search was made
> for? It would be understandable, after all, I guess the facets engine
> has to check very doc in the index and sort... which shouldn't yield
> good performance no matter what, sadly.
>
> Is there any other way I could achieve what I'm trying to do? Just a
> list of the most frequent (top 5) authors present in the results of a
> query.
>
> Thanks,
>
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Yonik Seeley-2
In reply to this post by Michael Imbeault
On 9/18/06, Michael Imbeault <[hidden email]> wrote:
> Been playing around with the news 'facets search' and it works very
> well, but it's really slow for some particular applications. I've been
> trying to use it to display the most frequent authors of articles

I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).

For cases like "author", if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Yonik Seeley-2
In reply to this post by Michael Imbeault
On 9/18/06, Michael Imbeault <[hidden email]> wrote:
> Just a little follow-up - I did a little more testing, and the query
> takes 20 seconds no matter what - If there's one document in the results
> set, or if I do a query that returns all 130000 documents.

Yes, currently the same strategy is always used.
   intersection_count(docs_matching_query, docs_matching_author1)
   intersection_count(docs_matching_query, docs_matching_author2)
   intersection_count(docs_matching_query, docs_matching_author3)
   etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Michael Imbeault
In reply to this post by Yonik Seeley-2
Yonik Seeley wrote:
> I noticed this too, and have been thinking about ways to fix it.
> The root of the problem is that lucene, like all full-text search
> engines, uses inverted indicies.  It's fast and easy to get all
> documents for a particular term, but getting all terms for a document
> documents is either not possible, or not fast (assuming many documents
> match a query).
Yeah that's what I've been thinking; the index isn't built to handle
such searches, sadly :( It would be very nice to be able to rapidly
search by most frequent author, journal, etc.
> For cases like "author", if there is only one value per document, then
> a possible fix is to use the field cache.  If there can be multiple
> occurrences, there doesn't seem to be a good way that preserves exact
> counts, except maybe if the number of documents matching a query is
> low.
>
I have one value per document (I have fields for authors, last_author
and first_author, and I'm doing faceted search on first and last authors
fields). How would I use the field cache to fix my problem? Also, would
it be better to store a unique number (for each possible author) in an
int field along with the string, and do the faceted searching on the int
field? Would this be faster / require less memory? I guess that yes, and
I'll test that when I have the time.

>> Just a little follow-up - I did a little more testing, and the query
>> takes 20 seconds no matter what - If there's one document in the results
>> set, or if I do a query that returns all 130000 documents.
>
> Yes, currently the same strategy is always used.
>   intersection_count(docs_matching_query, docs_matching_author1)
>   intersection_count(docs_matching_query, docs_matching_author2)
>   intersection_count(docs_matching_query, docs_matching_author3)
>   etc...
>
> Normally, the docsets will be cached, but since the number of authors
> is greater than the size of the filtercache, the effective cache hit
> rate will be 0%
>
> -Yonik
So more memory would fix the problem? Also, I was under the impression
that it was only searching / sorting for authors that it knows are in
the result set... in the case of only one document (1 result), it seems
strange that it takes the same time as for 130 000 results. It should
just check the results, see that there's only one author, and return
that? And in the case of 2 documents, just sort 2 authors (or 1 if
they're the same)? I understand your answer (it does intersections), but
I wonder why its intersecting from the whole document set at first, and
not docs_matching_query like you said.

Thanks for the support,

Michael
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Michael Imbeault
In reply to this post by Yonik Seeley-2
Another followup: I bumped all the caches in solrconfig.xml to

      size="1600384"
      initialSize="400096"
      autowarmCount="400096"

It seemed to fix the problem on a very small index (facets on last and
first author fields, + 12 range date facets, sub 0.3 seconds for
queries). I'll check on the full index tomorrow (it's indexing right
now, 400docs/sec!). However, I still don't have an idea what are these
values representing, and how I should estimate what values I should set
them to. Originally I thought it was the size of the cache in kb, and
someone on the list told me it was number of items, but I don't quite
get it. Better documentation on that would be welcomed :)

Also, is there any plans to add an option not to run a facet search if
the result set is too big? To avoid 40 seconds queries if the docset is
too large...

Thanks,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

> On 9/18/06, Michael Imbeault <[hidden email]> wrote:
>> Just a little follow-up - I did a little more testing, and the query
>> takes 20 seconds no matter what - If there's one document in the results
>> set, or if I do a query that returns all 130000 documents.
>
> Yes, currently the same strategy is always used.
>   intersection_count(docs_matching_query, docs_matching_author1)
>   intersection_count(docs_matching_query, docs_matching_author2)
>   intersection_count(docs_matching_query, docs_matching_author3)
>   etc...
>
> Normally, the docsets will be cached, but since the number of authors
> is greater than the size of the filtercache, the effective cache hit
> rate will be 0%
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Yonik Seeley-2
In reply to this post by Michael Imbeault
On 9/18/06, Michael Imbeault <[hidden email]> wrote:

> Yonik Seeley wrote:
> > For cases like "author", if there is only one value per document, then
> > a possible fix is to use the field cache.  If there can be multiple
> > occurrences, there doesn't seem to be a good way that preserves exact
> > counts, except maybe if the number of documents matching a query is
> > low.
> >
> I have one value per document (I have fields for authors, last_author
> and first_author, and I'm doing faceted search on first and last authors
> fields). How would I use the field cache to fix my problem?

Unless you want to dive into Solr development, you don't :-)
It requires extensive changes to the faceting code and doing things a
different way in some cases.

The FieldCache is the fastest way to "uninvert" single valued
fields... it's currently only used for Sorting, where one needs to
quickly know the field value given the document id.
The downside is high memory use, and that it's not a general
solution... it can't handle fields with multiple tokens (tokenized
fields or multi-valued fields).

So the strategy would be to step through the documents, get the value
for the field from the FieldCache, increment a counter for that value,
then find the top counters when we are done.

> Also, would
> it be better to store a unique number (for each possible author) in an
> int field along with the string, and do the faceted searching on the int
> field?

It won't really help.  It wouldn't be faster, and it would require
only slightly less memory.

> >> Just a little follow-up - I did a little more testing, and the query
> >> takes 20 seconds no matter what - If there's one document in the results
> >> set, or if I do a query that returns all 130000 documents.
> >
> > Yes, currently the same strategy is always used.
> >   intersection_count(docs_matching_query, docs_matching_author1)
> >   intersection_count(docs_matching_query, docs_matching_author2)
> >   intersection_count(docs_matching_query, docs_matching_author3)
> >   etc...
> >
> > Normally, the docsets will be cached, but since the number of authors
> > is greater than the size of the filtercache, the effective cache hit
> > rate will be 0%
> >
> > -Yonik
> So more memory would fix the problem?

Yes, if your collection size isn't that large...  it's not a practical
solution for many cases though.

> Also, I was under the impression
> that it was only searching / sorting for authors that it knows are in
> the result set...

That's the problem... it's not necessarily easy to know *what* authors
are in the result set.  If we could quickly determine that, we could
just count them and not do any intersections or anything at all.

>  in the case of only one document (1 result), it seems
> strange that it takes the same time as for 130 000 results. It should
> just check the results, see that there's only one author, and return
> that? And in the case of 2 documents, just sort 2 authors (or 1 if
> they're the same)? I understand your answer (it does intersections), but
> I wonder why its intersecting from the whole document set at first, and
> not docs_matching_query like you said.

It is just intersecting docs_matching_query.  The problem is that it's
intersecting that set with all possible author sets since it doesn't
know ahead of time what authors are in the docs that match the query.

There could be optimizations when docs_matching_query.size() is small,
so we start somehow with terms in the documents rather than terms in
the index.  That requires termvectors to be stored (medium speed), or
requires that the field be stored and that we re-analyze it (very
slow).

More optimization of special cases hasn't been done simply because no
one has done it yet... (as you note, faceting is a new feature).


-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Joachim Martin
In reply to this post by Michael Imbeault
Michael Imbeault wrote:

> Also, is there any plans to add an option not to run a facet search if
> the result set is too big? To avoid 40 seconds queries if the docset
> is too large...


You could run one query with facet=false, check the result size and then
run it again (should be fast because it is cached) with
facet=true&rows=0 to get facet results only.

I would think that the decision to run/not run facets would be highly
custom to your collection and not easily developed as a configurable
feature.

--Joachim
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Yonik Seeley-2
In reply to this post by Michael Imbeault
I just updated the comments in solrconfig.xml:

   <!-- Cache used by SolrIndexSearcher for filters (DocSets),
         unordered sets of *all* documents that match a query.
         When a new searcher is opened, its caches may be prepopulated
         or "autowarmed" using data from caches in the old searcher.
         autowarmCount is the number of items to prepopulate.  For LRUCache,
         the autowarmed items will be the most recently accessed items.
       Parameters:
         class - the SolrCache implementation (currently only LRUCache)
         size - the maximum number of entries in the cache
         initialSize - the initial capacity (number of entries) of
           the cache.  (seel java.util.HashMap)
         autowarmCount - the number of entries to prepopulate from
           and old cache.
         -->
    <filterCache
      class="solr.LRUCache"
      size="512"
      initialSize="512"
      autowarmCount="256"/>

On 9/18/06, Michael Imbeault <[hidden email]> wrote:

> Another followup: I bumped all the caches in solrconfig.xml to
>
>       size="1600384"
>       initialSize="400096"
>       autowarmCount="400096"
>
> It seemed to fix the problem on a very small index (facets on last and
> first author fields, + 12 range date facets, sub 0.3 seconds for
> queries). I'll check on the full index tomorrow (it's indexing right
> now, 400docs/sec!). However, I still don't have an idea what are these
> values representing, and how I should estimate what values I should set
> them to. Originally I thought it was the size of the cache in kb, and
> someone on the list told me it was number of items, but I don't quite
> get it. Better documentation on that would be welcomed :)
>
> Also, is there any plans to add an option not to run a facet search if
> the result set is too big? To avoid 40 seconds queries if the docset is
> too large...

I'd like to speed up certain corner cases, but you can always set
timeouts in whatever frontend is making the request to Solr too.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Chris Hostetter-3
In reply to this post by Yonik Seeley-2

Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?

You'll probably see a sharp increase in performacne if you have a single
untokenized author field containing hte full name and you facet on that --
there will be a lot less unique terms to use when computing DocSets and
intersections.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index is ...
once you made that cahnge did your filterCache hitrate increase? .. do you
have any evictions (you can check on the "Statistics" patge)

: > Also, I was under the impression
: > that it was only searching / sorting for authors that it knows are in
: > the result set...
:
: That's the problem... it's not necessarily easy to know *what* authors
: are in the result set.  If we could quickly determine that, we could
: just count them and not do any intersections or anything at all.

another way to look at it is that by looking at all the authors, the work
done for generating the facet counts for query A can be completely reused
for the next query B -- presuming your filterCache is large enough to hold
all of the author filters.

: There could be optimizations when docs_matching_query.size() is small,
: so we start somehow with terms in the documents rather than terms in
: the index.  That requires termvectors to be stored (medium speed), or
: requires that the field be stored and that we re-analyze it (very
: slow).
:
: More optimization of special cases hasn't been done simply because no
: one has done it yet... (as you note, faceting is a new feature).

the optimization optimization i anticipated from teh begining, would
probably be usefull in the situation Michael is describing ... if there is
a "long tail" oif authors (and in my experience, there typically is) we
can cache an ordered list of the top N most prolific authors, along with
the count of how many documents they have in the index (this info is easy
to getfrom TermEnum.docFreq).  when we facet on the authors, we start with
that list and go in order, generating their facet constraint count using
the DocSet intersection just like we currently do ... if we reach our
facet.limit before we reach the end of hte list and the lowest constraint
count is higher then the total doc count of the last author in the list,
then we know we don't need to bother testing any other Author, because no
other author an possibly have a higher facet constraint count then the
ones on our list (since they haven't even written that many documents)



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Yonik Seeley-2
On 9/19/06, Chris Hostetter <[hidden email]> wrote:

>
> Quick Question: did you say you are faceting on the first name field
> seperately from the last name field? ... why?
>
> You'll probably see a sharp increase in performacne if you have a single
> untokenized author field containing hte full name and you facet on that --
> there will be a lot less unique terms to use when computing DocSets and
> intersections.
>
> Second: you mentioned increasing hte size of your filterCache
> significantly, but we don't really know how heterogenous your index is ...
> once you made that cahnge did your filterCache hitrate increase? .. do you
> have any evictions (you can check on the "Statistics" patge)
>
> : > Also, I was under the impression
> : > that it was only searching / sorting for authors that it knows are in
> : > the result set...
> :
> : That's the problem... it's not necessarily easy to know *what* authors
> : are in the result set.  If we could quickly determine that, we could
> : just count them and not do any intersections or anything at all.
>
> another way to look at it is that by looking at all the authors, the work
> done for generating the facet counts for query A can be completely reused
> for the next query B -- presuming your filterCache is large enough to hold
> all of the author filters.
>
> : There could be optimizations when docs_matching_query.size() is small,
> : so we start somehow with terms in the documents rather than terms in
> : the index.  That requires termvectors to be stored (medium speed), or
> : requires that the field be stored and that we re-analyze it (very
> : slow).
> :
> : More optimization of special cases hasn't been done simply because no
> : one has done it yet... (as you note, faceting is a new feature).
>
> the optimization optimization i anticipated from teh begining, would
> probably be usefull in the situation Michael is describing ... if there is
> a "long tail" oif authors (and in my experience, there typically is)

> we
> can cache an ordered list of the top N most prolific authors, along with
> the count of how many documents they have in the index (this info is easy
> to getfrom TermEnum.docFreq).

Yeah, I've thought about a fieldInfoCache too.  It could also cache
the total number of terms in order to make decisions about what
faceting strategy to follow.

> when we facet on the authors, we start with
> that list and go in order, generating their facet constraint count using
> the DocSet intersection just like we currently do ... if we reach our
> facet.limit before we reach the end of hte list and the lowest constraint
> count is higher then the total doc count of the last author in the list,
> then we know we don't need to bother testing any other Author, because no
> other author an possibly have a higher facet constraint count then the
> ones on our list

This works OK if the intersection counts are high (as a percentage of
the facet sets).  I'm not sure how often this will be the case though.

Another tradeoff is to allow getting inexact counts with multi-token fields by:
 - simply faceting on the most popular values
   OR
 - do some sort of statistical sampling by reading term vectors for a
fraction of the matching docs.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Chris Hostetter-3

: > when we facet on the authors, we start with
: > that list and go in order, generating their facet constraint count using
: > the DocSet intersection just like we currently do ... if we reach our
: > facet.limit before we reach the end of hte list and the lowest constraint
: > count is higher then the total doc count of the last author in the list,
: > then we know we don't need to bother testing any other Author, because no
: > other author an possibly have a higher facet constraint count then the
: > ones on our list
:
: This works OK if the intersection counts are high (as a percentage of
: the facet sets).  I'm not sure how often this will be the case though.

well, keep in mind "N" could be very big, big enough to store the full
list of Terms sorted in docFreq order (it shouldn't take up much space
since it's just hte Term and an int)e ... for any query that returns a
"large" number of results, you probably won't need to reach the end of the
list before you can tell that all the remaining Terms have a lower docFreq
then the current last constraint count in your facet.limit list.  For
queries that return a "small" number of results, it wouldn't be as
usefull, but thats where a switch could be fliped to start with the values
mapped to hte docs (using FieldCache -- assuming single-value fields)

: Another tradeoff is to allow getting inexact counts with multi-token fields by:
:  - simply faceting on the most popular values
:    OR
:  - do some sort of statistical sampling by reading term vectors for a
: fraction of the matching docs.

i loath inexact counts ... i think of them as "Astrology" to the Astronomy
of true Faceted Searching ... but i'm sure they would be "good enough" for
some peoples use cases.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Chris Hostetter-3
In reply to this post by Yonik Seeley-2

: I just updated the comments in solrconfig.xml:

I've tweaked the SolrCaching wiki page to include some of this info as
well, feel free to add any additional info you think would be helpful to
other people (or ask any qestions about it if any of it still doesn't seem
clear to you)...

        http://wiki.apache.org/solr/SolrCaching

: > now, 400docs/sec!). However, I still don't have an idea what are these
: > values representing, and how I should estimate what values I should set
: > them to. Originally I thought it was the size of the cache in kb, and
: > someone on the list told me it was number of items, but I don't quite
: > get it. Better documentation on that would be welcomed :)



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Michael Imbeault
In reply to this post by Yonik Seeley-2
Thanks for all the great answers.

>> Quick Question: did you say you are faceting on the first name field
>> seperately from the last name field? ... why?
You misunderstood. I'm doing faceting on first author, and last author
of the list. Life science papers have authors list, and the first one is
usually the guy who did most of the work, and the last one is usually
the boss of the lab. I already have untokenized author fields for that
using copyField.
>> Second: you mentioned increasing hte size of your filterCache
>> significantly, but we don't really know how heterogenous your index
>> is ...
>> once you made that cahnge did your filterCache hitrate increase? ..
>> do you
>> have any evictions (you can check on the "Statistics" page)
It was at the default (16000) and it hit the ceiling so to speak. I did
maxSize=16000000 (for testing purpose) and now size : 17038 and 0
evictions. For a single facet field (journal name) with a limit of 5 and
12 faceted query fields (range on publication date), I now have 0.5
seconds search, which is not too bad. The filtercache size is pretty
much constant no matter how many queries I do.

However, if I try to add another facet field (such as first_author),
something strange happens. 99% CPU, the filter cache is filling up
really fast, hitratio goes to hell, no disk activity, and it can stay
that way for at least 30 minutes (didn't test longer, no point really).
It turns out that journal_name has 17038 different tokens, which is
manageable, but first_author has > 400 000. I don't think this will ever
yield good performance, so i might only do journal_name facets.

Any reasons why facets tries to preload every term in the field?

I have noticed that facets are not cached. Facets off, cached query take
0.01 seconds. Facet on, uncached and cached queries take 0.7 seconds.
Any plans for a facets cache? I know that facets is still a very early
feature, but its already awesome; my application is maybe irrealistic.

Thanks,
Michael

Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Yonik Seeley-2
On 9/21/06, Michael Imbeault <[hidden email]> wrote:
> It turns out that journal_name has 17038 different tokens, which is
> manageable, but first_author has > 400 000. I don't think this will ever
> yield good performance, so i might only do journal_name facets.

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):

http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Michael Imbeault
Dude, stop being so awesome (and the whole Solr team). Seriously! Every
problem / request (MoreLikeThis class, change AND/OR preference
programatically, etc) I've submitted to this mailing list has received a
quick, more-than-I-ever-expected answer.

I'll subscribe to the dev list (been reading it off and on), but I'm
afraid I couldn't code my way of a paper bag in Java. I'll contribute to
the Solr wiki (the SolrPHP part in particular) as soon as I can. Thats
the least I can do!

Btw, Any plans for a facets cache?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

> On 9/21/06, Michael Imbeault <[hidden email]> wrote:
>> It turns out that journal_name has 17038 different tokens, which is
>> manageable, but first_author has > 400 000. I don't think this will ever
>> yield good performance, so i might only do journal_name facets.
>
> Hang in there Michael, a fix is on the way for your scenario (and
> subscribe to solr-dev if you want to stay on the bleeding edge):
>
> http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html 
>
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Yonik Seeley-2
On 9/21/06, Michael Imbeault <[hidden email]> wrote:
> Btw, Any plans for a facets cache?

Maybe a partial one (like caching top terms to implement some other
optimizations).  My general philosophy on caching in Solr has been to
cache things the client can't: elemental things, or *parts* of
requests to make many different requests faster (most
bang-for-the-buck).

Caching complete requests/responses is generally less useful since it
requires even more memory, has a worse hit ratio, and can be done
anyway by the client or a separate process like squid.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Yonik Seeley-2
In reply to this post by Yonik Seeley-2
On 9/21/06, Yonik Seeley <[hidden email]> wrote:
> Hang in there Michael, a fix is on the way for your scenario (and
> subscribe to solr-dev if you want to stay on the bleeding edge):

OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Michael Imbeault
I upgraded to the most recent Solr build (9-22) and sadly it's still
really slow. 800 seconds query with a single facet on first_author, 15
millions documents total, the query return 180. Maybe i'm doing
something wrong? Also, this is on my personal desktop; not on a server.
Still, I'm getting 0.1 seconds queries without facets, so I don't think
thats the cause. In the admin panel i can still see the filtercache
doing millions of lookups (and tons of evictions once it hits the maxsize).

Here's the field i'm using in schema.xml :
<field name ="first_author" type="string" indexed="true" stored="true"/>

This is the query :
q="hiv red
blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false

I'll do more testing on the weekend,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

> On 9/21/06, Yonik Seeley <[hidden email]> wrote:
>> Hang in there Michael, a fix is on the way for your scenario (and
>> subscribe to solr-dev if you want to stay on the bleeding edge):
>
> OK, the optimization has been checked in.  You can checkout from svn
> and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
> I'd be interested in hearing your results with it.
>
> The first facet request on a field will take longer than subsequent
> ones because the FieldCache entry is loaded on demand.  You can use a
> firstSearcher/newSearcher hook in solrconfig.xml to send a facet
> request so that a real user would never see this slower query.
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Facet performance with heterogeneous 'facets'?

Yonik Seeley-2
On 9/22/06, Michael Imbeault <[hidden email]> wrote:
> I upgraded to the most recent Solr build (9-22) and sadly it's still
> really slow. 800 seconds query with a single facet on first_author, 15
> millions documents total, the query return 180. Maybe i'm doing
> something wrong? Also, this is on my personal desktop; not on a server.
> Still, I'm getting 0.1 seconds queries without facets, so I don't think
> thats the cause. In the admin panel i can still see the filtercache
> doing millions of lookups (and tons of evictions once it hits the maxsize).

The fact that you see all the filtercache usage means that the
optimization didn't kick in for some reason.

> Here's the field i'm using in schema.xml :
> <field name ="first_author" type="string" indexed="true" stored="true"/>

That looks fine...

> This is the query :
> q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false

That looks OK too.
I assume that you didn't change the fieldtype definition for "string",
and that the schema has version="1.1"?  Before 1.1, all fields were
assumed to be multiValued (there was no checking or enforcement).

-Yonik
12