facet optimizing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: facet optimizing

Chris Hostetter-3

I freely admit that i'm totally lost on most of what you're suggestion ...
it seems like you're suggesting that organizing the terms in a facet field
into a tree structure would help us know which terms to compute the
counts for first for a given query -- but it's not clear to me why that
would be the case -- the terms don't neccessarily have any correlation.

but on this point...

: pruning by df is going to be one of the most important features
: here... so a variation (or optimization) would be to keep a list of
: the highest terms by df, and then build the facet tree excluding those
: top terms.  That should lower the dfs in the tree nodes and allow more
: pruning.

i'm not sure why excluding high DF terms helps your approach, but one of
the optimizations i anticipated when we first added term faceting was to
build a cache of the high DF terms and test them first -- if you only want
the 20 facet terms with the highest counts, and after computing counts for
the 100 highest DF terms you find your lowest count (#20) to be 678, and
the DF of the 100th highest DF term in your cache was 677 then you are
garunteed you don't need to check any other terms (which by definition
have lower DFs)

...so if extracting the high DF terms helps whatever complex tree walking
you are thinking of, then checking those high DF terms first might save
you the hassle of walking the tree at all.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: facet optimizing

Yonik Seeley-2
On 2/9/07, Chris Hostetter <[hidden email]> wrote:
> I freely admit that i'm totally lost on most of what you're suggestion ...
> it seems like you're suggesting that organizing the terms in a facet field
> into a tree structure would help us know which terms to compute the
> counts for first for a given query -- but it's not clear to me why that
> would be the case -- the terms don't neccessarily have any correlation.

They don't necessarily have to (I think)... we propagate information up the tree
in terms of max_df,  and in terms of the union of all the children.
There may be other info we can propagate as well... but I'm not sure what yet.
We need a set-theory mathematician or something :-)

It would be best, of course, to group the lowest level nodes by the
amount they have in common (the correlation you were looking for), but
that sounds prohibitively expensive.
It would *not* be expensive, however, to group the lowest level nodes
by max_df or union_size.

> : pruning by df is going to be one of the most important features
> : here... so a variation (or optimization) would be to keep a list of
> : the highest terms by df, and then build the facet tree excluding those
> : top terms.  That should lower the dfs in the tree nodes and allow more
> : pruning.
>
> i'm not sure why excluding high DF terms helps your approach, but one of
> the optimizations i anticipated when we first added term faceting was to
> build a cache of the high DF terms and test them first -- if you only want
> the 20 facet terms with the highest counts, and after computing counts for
> the 100 highest DF terms you find your lowest count (#20) to be 678, and
> the DF of the 100th highest DF term in your cache was 677 then you are
> garunteed you don't need to check any other terms (which by definition
> have lower DFs)
>
> ...so if extracting the high DF terms helps whatever complex tree walking
> you are thinking of, then checking those high DF terms first might save
> you the hassle of walking the tree at all.

Yes, that's a required part of it.  If you exclude both the high df
counts from the tree, and the "bits" they contribute, then it becomes
mandatory to calculate the intersections for those high df terms.  It
also will hopefully act as a good boostrap to raise the min_df of the
queue and allow you to prune earlier in the tree based on the nodes
max_df or intersectionSize(union, baseDocSet).

The tree-building code I put in the JIRA bug already extracts the top
terms and excludes them from the tree.

As you note, some of the effectiveness will depend on the distribution
of the terms.
If everything is flat, you don't get to prune much, but hopefully at
least at the bottom of the tree we can avoid checking every bucket of
terms by checking the intersection count of the union.

I really don't know how well it will work/scale, but I do think it
will do better than what we have now for high cardinality facets, so I
intend to find out.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: facet optimizing

Yonik Seeley-2
On 2/9/07, Yonik Seeley <[hidden email]> wrote:
> If you exclude both the high df
> counts from the tree, and the "bits" they contribute, then it becomes
> mandatory to calculate the intersections for those high df terms.  It
> also will hopefully act as a good boostrap to raise the min_df of the
> queue and allow you to prune earlier in the tree based on the nodes
> max_df or intersectionSize(union, baseDocSet).

It occurs to me that iterating over all the high-df terms and getting
their intersection count could also be more efficient... (if you take
intersection counts with many top-terms, say 1000, that will take up a
lot of time right there before you even get to the tree).

The high df terms could also be put into a facet tree.  Perhaps they
may even be added as nodes to the same facet tree - it depends on the
traversal algorithm (we may want the high-df terms checked first,
since those would be more efficient to access, using a specific term
and calling numDocs(), rather than being contained in a block of
terms.)

I probably won't implement it that way first... I want to see what
improvement we can get with a basic tree implementation.

-Yonik
Reply | Threaded
Open this post in threaded view
|

RE: facet optimizing

Gunther, Andrew
In reply to this post by Erik Hatcher
Back on facet optimizing again.

Can someone post their magic formula for filterCache (Erik?)  We've hit
a plateau around 1.7mill docs and my response times have suffered when
filtering.  Have adjusted filtercache up and down all day but can't seem
to get a good handle on these values.  What does size actually correlate
to (number or named-values pairs in hashmap?)  Can adjusting JVM memory
on startup factor in any.

Cheers,

Andrew

-----Original Message-----
From: Erik Hatcher [mailto:[hidden email]]
Sent: Wednesday, February 07, 2007 5:08 PM
To: [hidden email]
Subject: Re: facet optimizing


On Feb 7, 2007, at 4:42 PM, Yonik Seeley wrote:
> Solr relies on the filter cache for faceting, and if it's not big
> enough you're going to get a near 0% hit rate.  Check the statistics
> page and make sure there aren't any evictions after you do a query
> with facets.  If there are, make the cache larger.

Yonik - thanks!   I was too deep into other things to worry about the  
slowness of massive multiValued facets, mainly because I was going to  
use the mess of all those nasty values we have in typical library  
data to push back and have it cleaned up.  But, I just adjusted my  
filter cache settings and my responses went from 2000+ ms to 85 ms!  
Now it takes longer to render the pie charts than it does to get the  
results back :)

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: facet optimizing

Yonik Seeley-2
On 3/1/07, Gunther, Andrew <[hidden email]> wrote:
> Can someone post their magic formula for filterCache (Erik?)  We've hit
> a plateau around 1.7mill docs and my response times have suffered when
> filtering.

Is this for field faceting (facet.field)?

> Have adjusted filtercache up and down all day but can't seem
> to get a good handle on these values.  What does size actually correlate
> to (number or named-values pairs in hashmap?)

Yes.
Right now, field faceting isn't incredibly scalable if the number of
unique values for the field is high (it iterates through all the
values).  About all you can do is increase the filterCache to be
larger than the number of unique values in your field (check the admin
statistics to see if it's large enough).

There are plans to try and remedy this (the ideas for tree faceting I
talked about in this thread previously), but I haven't had a chance to
work on it recently.  I will eventually though... this is the fun
stuff to work on :-)

http://issues.apache.org/jira/browse/SOLR-153

-Yonik
12