Re: [Solr Wiki] Update of "SolrFacetingOverview" by JJLarrea

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: [Solr Wiki] Update of "SolrFacetingOverview" by JJLarrea

Erik Hatcher
JJ:  Fantastic - this is excellent info, and sharing it helps a LOT!

        Erik


On Dec 27, 2006, at 7:25 PM, Apache Wiki wrote:

> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki"  
> for change notification.
>
> The following page has been changed by JJLarrea:
> http://wiki.apache.org/solr/SolrFacetingOverview
>
> The comment on the change is:
> Added page per 12/8/06 suggestion by Yonik
>
> New page:
> = Faceting Overview =
>
> Solr provides a [http://incubator.apache.org/solr/docs/api/org/ 
> apache/solr/request/SimpleFacets.html Simple Faceting toolkit]  
> which can be reused by various Request Handlers to include "Facet  
> counts" of based on some simple criteria. Both the  
> StandardRequestHandler and the DisMaxRequestHandler currently use  
> these utilities.  Detailed descriptions of the parameters used to  
> control faceting can be found (along with several examples) at  
> [SimpleFacetParameters].
>
> This page briefly provides some general background information:
>
> = Facet Indexing =
>
> Faceting is done on __indexed__ rather than __stored__ values.  
> This is because the primary use for faceting is drilldown into a  
> subset of hits resulting from a query, and so the chosen facet  
> value is used to construct a filter query which literally matches  
> that value in the index.  For the stock Solr request handlers this  
> is done by adding an `fq=<facet-field>:<quoted facet-value>`  
> parameter and resubmitting the query.
>
> Because faceting fields are often specified to serve two purposes,  
> human-readable text and drill-down query value, they are frequently  
> indexed differently from fields used for searching and sorting:
>   * They are not tokenized into separate words
>   * They are not mapped into lower case
>   * Human-readable punctuation is not removed (other than double-
> quotes)
>   * There is often no need to store them, since stored values would  
> look much like indexed values and the faceting mechanism is used  
> for value retrieval.
>   * Depending on how the field is defined the SimpleFacets  
> mechanism may only allow for a single value per field per document  
> (see below)
>
> As an example, if I had a field with a list of authors, such as:
>
>   Schildt, Herbert; Wolpert, Lewis; Davies, P.
>
> I might want to index the same data differently in three different  
> fields (perhaps using the Solr [:SchemaXml#Copy Fields:copyField]  
> directive):
>   * For searching: Tokenized, case-folded, punctuation-stripped:
>       schildt / herbert / wolpert / lewis / davies / p
>   * For sorting: Untokenized, case-folded, punctuation-stripped:
>       schildt herbert wolpert lewis davies p
>   * For faceting: Primary author only, using a `solr.StringField`:
>       Schildt, Herbert
>
> Then when the user drills down on the "Schildt, Herbert" string I  
> would reissue the query with an added fq="Schild, Herbert" parameter.
>
> = Facet Operation =
>
> Currently SimpleFacets has 3 modes of operation:
>
> == FacetQueries ==
>
> Any number of [:SimpleFacetParameters#facet.query:facet.query]  
> parameters can be passed to the request handler.  Each distinct  
> facet.query will first be executed against the entire index, with  
> the results cached as a hashed set (if fewer than hashDocSet) or a  
> bit set (if greater) of document IDs (see [:SolrCaching#The  
> hashDocSet Max Size:hashDocSet]).  Then every time that facet.query  
> is used for faceting a query, the cached set will be intersected  
> against the set of document ids returned by the query to count the  
> number of documents for which the facet.query condition is true.
>
> == FacetFields ==
>
> Any number of [:SimpleFacetParameters#facet.field:facet.field]  
> parameters can be passed to the request handler.  For each  
> facet.field, one of two approaches will be used:
>
>     * Field Queries:  If the facet field is defined in the schema  
> as multi-valued, boolean, or tokenized, then every indexed value  
> for the field will be iterated and a facet query will be executed  
> and cached (as described above).  This is excellent for fields  
> where there is a small set of distinct values.  For example,  
> faceting on a field with U.S. States eg. `Alabama, Alaska, ...  
> Wyoming` would lead to fifty cached queries which would be used  
> over and over again.  It also works in the case when the facet  
> field can have multiple values for each document.  However, it  
> requires excessive amounts of memory and time when the number of  
> field values is large and especially when it exceeds the filter  
> cache size defined in [:SolrCaching#filterCache:filterCache]
>
>     * Field Cache: If the facet field is not tokenized, not multi-
> valued, and not boolean, then a field-cache approach will be used.  
> This is currently implemented with the Lucene [http://
> lucene.apache.org/java/docs/api/org/apache/lucene/search/
> FieldCache.html FieldCache] mechanism used for results sorting.  An  
> array of integers (one for every document in the index) is  
> allocated, pre-filled with the first indexed value for that field  
> in each document (offset into a table of strings for fields indexed  
> as strings), and cached.  Every time that facet.field is used for  
> faceting a query, all the document IDs resulting from the query are  
> looked up in the field cache and any value found has its tally  
> incremented.  This is excellent for situations where the number of  
> indexed values for the field is too large to be practical using the  
> field queries mechanism, such as faceting against authors or  
> titles.  However it is currently much slower and more memory-
> intensive than the field query
>   mechanism for fields with a small number of values.
>
> Note at this time there is no way to manually control whether  
> facet.field is handled via field queries or field cache other than  
> defining in the schema whether the field is single- or multi-valued  
> and the analyzer used: `solr.TextField` is always tokenized while  
> `solr.StrField` is never.  Control may be improved in the future,  
> along with a means to handle multi-valued fields with a variant of  
> the Field Cache mechanism.
> ----
> CategoryCategory