Multi-Valued Faceting

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Multi-Valued Faceting

jjlarrea
Many thanks to Hoss, Yonik, et al. for their excellent efforts to
bring faceted browsing to the masses!  In most cases it works great!

But for some fields I have a need which is unfortunately not filled
by the current faceting code.  These fields (Author Name, for
example) have too many discrete Term values to be handled by the
cached-filter mechanism for facet counting, but require counting
multiple Terms per document and so are not handled by the alternate
facet mechanism based on FieldCache.  So I think I need to dive into
the SOLR code, but I first wanted to check to make sure nobody else
is working on something like this, and secondly to get feedback on
the best implementation approach.

(As background to those not familiar with the internals, for a
single-valued non-tokenized non-Boolean field the current
SimpleFacets implementation uses a FieldCache borrowed from the
Lucene sorting mechanism to quickly retrieve the indexed Term value
for each Document; otherwise a series of queries cached as bitmaps
are done for each possible Term value, ANDed with a hitlist bitmap,
and the resulting bits counted)

My thought was that the simplest approach would be to subclass
FieldCacheImpl to introduce a getMultiStringIndex method derived from
getStringIndex, defining  and returning a MultiStringIndex class
which stores order as int[][] rather than int[]; a variant of
SimpleFacets.getFieldCacheCounts would simply need an inner loop to
tally each of the Document's Term indexes for that field.

With multi-valuedness no longer being a useful criterion for
automatically choosing between the filter-based and modified
FieldCache-based mechanisms, there then would need to be an alternate
criterion, either implicit or explicit. Does anyone have any ideas
how best to do that?  For example, is there a way to quickly
determine the number of distinct Term values for a field without
enumerating to the end, so the ratio of Terms to Documents can be
used?

An entirely alternate approach (briefly suggested in a comment in
SimpleFacets) for fields indexed with term vectors would be to simply
call getTermFreqVector, for each hit and store (term text, tally) in
a HashTable, or (term text, index) in a HT which could be cached with
tallies generated per-query in an array as they are now, in the
latter case building a field-cache dynamically based on actual query
results.  Does anyone have any insight on how efficient that may or
may not be?

And if I have gotten something dreadfully wrong in my understanding
of current implementation or proposed enhancement, I would appreciate
getting straightened out.

Thanks,
J.J. Larrea
Reply | Threaded
Open this post in threaded view
|

Re: Multi-Valued Faceting

Yonik Seeley-2
On 12/6/06, J.J. Larrea <[hidden email]> wrote:
> My thought was that the simplest approach would be to subclass
> FieldCacheImpl to introduce a getMultiStringIndex method derived from
> getStringIndex, defining  and returning a MultiStringIndex class
> which stores order as int[][] rather than int[]; a variant of
> SimpleFacets.getFieldCacheCounts would simply need an inner loop to
> tally each of the Document's Term indexes for that field.

I think something like that is the right approach, the only problem
being the size in memory this would take up.  It may need some clever
encoding to keep it reasonable.

> With multi-valuedness no longer being a useful criterion for
> automatically choosing between the filter-based and modified
> FieldCache-based mechanisms, there then would need to be an alternate
> criterion, either implicit or explicit. Does anyone have any ideas
> how best to do that?  For example, is there a way to quickly
> determine the number of distinct Term values for a field without
> enumerating to the end, so the ratio of Terms to Documents can be
> used?

I'd suggest a Solr fieldInfo cache that stored info about a field:
a) the number of unique terms in the field
b) (optionally) a sorted list by docfreq of the top terms in the field

> An entirely alternate approach (briefly suggested in a comment in
> SimpleFacets) for fields indexed with term vectors would be to simply
> call getTermFreqVector, for each hit and store (term text, tally) in
> a HashTable, or (term text, index) in a HT which could be cached with
> tallies generated per-query in an array as they are now, in the
> latter case building a field-cache dynamically based on actual query
> results.  Does anyone have any insight on how efficient that may or
> may not be?

For queries that don't have many hits, termvectors would be fine.
I don't think they would perform well with a lot of hits though.
There could be a different type of faceting that just uses the top "n"
results though.

> And if I have gotten something dreadfully wrong in my understanding
> of current implementation or proposed enhancement, I would appreciate
> getting straightened out.

Sounds like you have a pretty good handle on it!

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Multi-Valued Faceting

Chris Hostetter-3

: > which stores order as int[][] rather than int[]; a variant of
: > SimpleFacets.getFieldCacheCounts would simply need an inner loop to
: > tally each of the Document's Term indexes for that field.
:
: I think something like that is the right approach, the only problem
: being the size in memory this would take up.  It may need some clever
: encoding to keep it reasonable.

yeah ... at some point you may wnat to consider the possibility
of "sampling" the values for the first N documents in your DocList and
then getting the counts for those values ... it's not a perfect solution,
but it should work efficiently no matter how many docs you have, or how
many unique field values there are in your index -- i don't know of any
other approach that can function as well for any data set.

: > how best to do that?  For example, is there a way to quickly
: > determine the number of distinct Term values for a field without
: > enumerating to the end, so the ratio of Terms to Documents can be
: > used?
:
: I'd suggest a Solr fieldInfo cache that stored info about a field:
: a) the number of unique terms in the field
: b) (optionally) a sorted list by docfreq of the top terms in the field

yeah ... this is what i'd orriginally invinisioned when i first wrote the
TermEnum based code in SimpleFacets before Yonik pointed out how usefull
the FieldCache could be ... keeping a list like yonik described in (b)
where the size of the list is sufficiently bigger them the typical "limit"
you put on your facet fields should provide a lot of wins -- if the way i
picture it in my head works out in reality, sizing that list with your
<HashDocSet maxSize="X"/> in mind might help you ensure that even if you
do wind up iterating over the full TermEnum, those terms all result in
HashDocSets which will be relatively small, so your filterCache can be
big.

computing the (a) value yonik mentioned is trivial to do while building up
(b) and (a) can be used to determine wether or not you really want to try
and build the MultiFieldCache or just walk the TermEnums.



-Hoss