[jira] Created: (SOLR-44) Basic Facet Count support

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (SOLR-44) Basic Facet Count support

Nick Burch (Jira)
Basic Facet Count support
-------------------------

                 Key: SOLR-44
                 URL: http://issues.apache.org/jira/browse/SOLR-44
             Project: Solr
          Issue Type: New Feature
          Components: search
            Reporter: Hoss Man
         Assigned To: Hoss Man
         Attachments: simple-facets.patch

First pass at basic facet support.  initial patch includes utilities for use in RequestHandlers, and usage in StandardRequestHandler (DisMax should use SolrParams before attempting to add this)

Basic idea is that:
  * facet=true indicates facet counts are desired.
  * facetField=inStock indicates we want a count of the matching docs for each value in the field inStock
  * facetQuery=title:ipod indicates we want the count of matching docs also in the set of docs matching query title:ipod
  * if user wants to apply a facet constraint on subsequent queries, they can add an "fq" (filter query) param (support for this was added to StandardRequestHandler as well)

Things marked TODO...
  * add support for per field facetLimit indicating that only the top N items in each facetField should be returned
  * add support for a per field facetZero boolean indicating that there is no reason to bother returning counts of 0 for facetFields (some clients may want to know the list, others don't care)
  * potential optimization when using faceLimit to cache the terms with the highest docFreq and see if they provide all the info we need without doing a full TermEnum

I'd like to get some feedback on the overall appraoch and params before i proceed too much farther.




--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-44) Basic Facet Count support

Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/SOLR-44?page=comments#action_12431625 ]
           
Mike Klaas commented on SOLR-44:
--------------------------------

I haven't looked at the patch yet but in terms of the parameters, might it make sense to use a group name similar to the highlighter params?  e.g., facet, facet.fl, facet.query, facet.limit, etc.

Also, now that we have per-field override capability for params, we should document which params can be thus overridden (facet.zero, facet.limit?)

> Basic Facet Count support
> -------------------------
>
>                 Key: SOLR-44
>                 URL: http://issues.apache.org/jira/browse/SOLR-44
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Hoss Man
>         Assigned To: Hoss Man
>         Attachments: simple-facets.patch
>
>
> First pass at basic facet support.  initial patch includes utilities for use in RequestHandlers, and usage in StandardRequestHandler (DisMax should use SolrParams before attempting to add this)
> Basic idea is that:
>   * facet=true indicates facet counts are desired.
>   * facetField=inStock indicates we want a count of the matching docs for each value in the field inStock
>   * facetQuery=title:ipod indicates we want the count of matching docs also in the set of docs matching query title:ipod
>   * if user wants to apply a facet constraint on subsequent queries, they can add an "fq" (filter query) param (support for this was added to StandardRequestHandler as well)
> Things marked TODO...
>   * add support for per field facetLimit indicating that only the top N items in each facetField should be returned
>   * add support for a per field facetZero boolean indicating that there is no reason to bother returning counts of 0 for facetFields (some clients may want to know the list, others don't care)
>   * potential optimization when using faceLimit to cache the terms with the highest docFreq and see if they provide all the info we need without doing a full TermEnum
> I'd like to get some feedback on the overall appraoch and params before i proceed too much farther.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Chris Hostetter-3

: I haven't looked at the patch yet but in terms of the parameters, might
: it make sense to use a group name similar to the highlighter params?
: e.g., facet, facet.fl, facet.query, facet.limit, etc.
:
: Also, now that we have per-field override capability for params, we
: should document which params can be thus overridden (facet.zero,
: facet.limit?)

yeah, limit and zero are the two properties i planed on being per
field-able.

I started with facet, facetQuery and facetField based on the existing
highlight, hightlightFields and highlightFormatterClass ... I suppose it
could be sqitched to "." notation, but it seems like it would be better to
only have one "." in a param name, with the string to the left being the
param and the string to the right being the field name.  People ooking at
examples might get confused by "facet.limit=10" thinking "limit" is
field name.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Mike Klaas
On 8/30/06, Chris Hostetter <[hidden email]> wrote:

> yeah, limit and zero are the two properties i planed on being per
> field-able.
>
> I started with facet, facetQuery and facetField based on the existing
> highlight, hightlightFields and highlightFormatterClass ... I suppose it
> could be sqitched to "." notation, but it seems like it would be better to
> only have one "." in a param name, with the string to the left being the
> param and the string to the right being the field name.  People ooking at
> examples might get confused by "facet.limit=10" thinking "limit" is
> field name.

Well, the highlighter param names are changing to hl.<whatever>, and
the convention for field overrides being in general
f.<fieldname>.<parametername>=<value>, so there would already be two
periods in the name even if the original param was a single word.  For
highlighting, the field param overrides will have three:

f.title.hl.fragsize = 0
f.contents.hl.formatter = simple

If this is confusing we should fix in this for both hl and facet params.

ciao,
-Mike
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Chris Hostetter-3

: Well, the highlighter param names are changing to hl.<whatever>, and
: the convention for field overrides being in general
: f.<fieldname>.<parametername>=<value>, so there would already be two

oh crap ... yeah i totally misread what SolrParams.getFieldParam did

: periods in the name even if the original param was a single word.  For
: highlighting, the field param overrides will have three:
:
: f.title.hl.fragsize = 0
: f.contents.hl.formatter = simple

yeah ... this is all starting to look familiar now.  i can totally get on
board that, so i'll make...

facet=true ... turn all facet counts on/off
facet.query=bar ... give a facet count for the constrait query "bar"
facet.field=foo ... give TermEnum based facet counts for field "foo"
facet.zero=true ... display zero counts for facet.field terms
facet.limit=30 ... display the top 30 terms for facet.field terms
f.foo.facet.zero=false ... override facet.zero for field foo
f.foo.facet.limit=20 ... override facet.limit for field foo


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Yonik Seeley-2
On 8/30/06, Chris Hostetter <[hidden email]> wrote:
> facet=true ... turn all facet counts on/off
> facet.query=bar ... give a facet count for the constrait query "bar"
> facet.field=foo ... give TermEnum based facet counts for field "foo"
> facet.zero=true ... display zero counts for facet.field terms
> facet.limit=30 ... display the top 30 terms for facet.field terms

What about ties?  Important or not?

> f.foo.facet.zero=false ... override facet.zero for field foo
> f.foo.facet.limit=20 ... override facet.limit for field foo

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Chris Hostetter-3

: > facet.limit=30 ... display the top 30 terms for facet.field terms
:
: What about ties?  Important or not?

good question ... i think we should go with the simplest possible behavior
and just pick one in a deterministic manner, most likely it will jsut be
whatever mechanism PriorityQueue uses to determe the order of equal items
(last in vs first in) combined with the lexigraphical order of the Term
... but that's something important to keep in mind if i do the
cache of high docFreq Terms


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Yonik Seeley-2
In reply to this post by Yonik Seeley-2
Hoss, do you have any example output?
I think the inputs & outputs are 90% of what people should be
reviewing here... the implementation can be more easily
changed/optimized in the future.

I'm pretty excited about this stuff... I think it will really help
build Solr's user base!

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Greg Ludington
In reply to this post by Mike Klaas
> I'd like to get some feedback on the overall appraoch and params before i proceed too much farther.

These comments are probably just confusion since the approach differs
from my home-grown faceting prototype, and my dev box is on a moving
truck right now, so I cannot try the patch, so please bear with me:

1) Should grouping of facets also be parameter-based?   Say, for
instance, I want to have multiple different ways to look into my
result set:

By Price (<$100, $100-$200, $200+)
By Manufactuerer (Apple, Dell, HP, SONY)
By Status: (In Stock, Out of Stock)

I assume the first two would be 4 facetQuery params each, and the
third would be a single facetField, can the output format represent
these sorts of logical groupings, or should it be solely the client's
responsibility to parse and split?

2) If facets can be returned in such a logical grouping, it might also
be worthwhile to allow an optional sort order for the facets (e.g.
alphabetically, by count, etc).  While the client can certainly sort,
if there is a facet limit the client will not be able to sort on the
full set.

3)  We have one running application (not yet on Solr;  that is my
prototype :) ) where the boundaries on range-based facets are
calculated to achieve equal distribution among a known number of
facets.  (This is like facetField, except for ranges, not terms.)

4) Would it be possible to extend the types of facets?  Because,
admittedly, #3 is an application-specific case, I would not expect
some general-purpose solution to it.  However, when confronted with
such a need, it would be nice to be able to plug in a new facet
impementation type for a given facet without having to change Solr
internals and/or create and maintain nearly exact duplicates of
existing request handlers.  (In my prototype, I had the concept of a
Facet interface, which allows this, but in other respects is far less
flexible than what you have outlined.)

Thanks,
Greg
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-44) Basic Facet Count support

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/SOLR-44?page=comments#action_12431745 ]
           
Hoss Man commented on SOLR-44:
------------------------------

Per mailing list discussion...
  1) Mike's points about parameter names are dead on, and i'll be making changes.
  2) Yonik pointed out I wasn't very forthcoming with examples, my bad.

With the patch as it stands right now, a query like this (against the example schema/docs) ...

http://localhost:8983/solr/select/?q=video&facetQuery=inStock:true&facetQuery=price:[*+TO+500]&facet=true

...would match on 3 docs, and would contain the following additional data...

   <lst name="facet_counts">
     <int name="inStock:true">1</int>
     <int name="price:[* TO 500]">2</int>
   </lst>

The real powerful stuff comes into play when using facetField ...

http://localhost:8983/solr/select/?indent=1&q=video&facetField=inStock&facetField=cat&facetQuery=price:[*+TO+500]&facet=true

...to get...


<lst name="facet_counts">
 <int name="price:[* TO 500]">2</int>
 <lst name="inStock">
  <int name="true">1</int>

  <int name="false">2</int>
 </lst>
 <lst name="cat">
  <int name="search">0</int>
  <int name="memory">0</int>
  <int name="graphics">2</int>
  <int name="card">2</int>

  <int name="connector">0</int>
  <int name="software">0</int>
  <int name="electronics">3</int>
  <int name="copier">0</int>
  <int name="multifunction">0</int>
  <int name="camera">0</int>

  <int name="music">1</int>
  <int name="hard">0</int>
  <int name="scanner">0</int>
  <int name="monitor">0</int>
  <int name="drive">0</int>
  <int name="printer">0</int>

 </lst>
</lst>



> Basic Facet Count support
> -------------------------
>
>                 Key: SOLR-44
>                 URL: http://issues.apache.org/jira/browse/SOLR-44
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Hoss Man
>         Assigned To: Hoss Man
>         Attachments: simple-facets.patch
>
>
> First pass at basic facet support.  initial patch includes utilities for use in RequestHandlers, and usage in StandardRequestHandler (DisMax should use SolrParams before attempting to add this)
> Basic idea is that:
>   * facet=true indicates facet counts are desired.
>   * facetField=inStock indicates we want a count of the matching docs for each value in the field inStock
>   * facetQuery=title:ipod indicates we want the count of matching docs also in the set of docs matching query title:ipod
>   * if user wants to apply a facet constraint on subsequent queries, they can add an "fq" (filter query) param (support for this was added to StandardRequestHandler as well)
> Things marked TODO...
>   * add support for per field facetLimit indicating that only the top N items in each facetField should be returned
>   * add support for a per field facetZero boolean indicating that there is no reason to bother returning counts of 0 for facetFields (some clients may want to know the list, others don't care)
>   * potential optimization when using faceLimit to cache the terms with the highest docFreq and see if they provide all the info we need without doing a full TermEnum
> I'd like to get some feedback on the overall appraoch and params before i proceed too much farther.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Chris Hostetter-3
In reply to this post by Greg Ludington

: These comments are probably just confusion since the approach differs

not at all confusing, you seem to have grasped everything just fine.

The one big thing to keep in mind is that this is an attmpt at very
"simple" faceted searching support, my goals for these changes were:
  1) provide TermEnum based facet field support
  2) provide basic support for query based constraints
  3) serve as an example for people who want to build more customized
     request handlers to deal with any cases too complex for this to
     handle

(specificaly, i was tackling the 'simple faceting' item from the TaskList
and not any ofthe more complex ideas from ComplexFacetingBrainstorming)


: 1) Should grouping of facets also be parameter-based?   Say, for
: instance, I want to have multiple different ways to look into my
: result set:
:
: By Price (<$100, $100-$200, $200+)
: By Manufactuerer (Apple, Dell, HP, SONY)
: By Status: (In Stock, Out of Stock)
:
: I assume the first two would be 4 facetQuery params each, and the
: third would be a single facetField, can the output format represent
: these sorts of logical groupings, or should it be solely the client's
: responsibility to parse and split?

First off: depending on the way your schema is setup, even the
manufacturer facet could be field based if you have an non-tokenized
version of the field (which you'd need anyway if you wanted to sort by
manufacturer)

Your point about query based counts is definitely valid though, it's
something i was rolling over in my head a lot before i started working on
this ... it would be nice if there was an easy way to group these, but i
couldn't really think of an clean way to deal with it using simple
init/query params -- but as i type this, it occurs to me that one approach
would be to allow for "per field" usages of the of the "facet.query" param
to specify queries that would use a SolrQueryParser with the default
field set to the specified field, so that you could things like...

        facet.query=foo:bar
        f.price.facet.query=[*+TO+100]
        f.price.facet.query=[101+TO+*]
        facet.field=cat

...and all of the "f.price.facet.query" counts would be grouped together
seperate from the count for "facet.query=foo:bar"

...the hitch here is this isn't really a "per field override" of the
facet.query param ... so the API might confuse some people.  We would also
either need to change SolrParams so that it's possible to get a list of
all set param names matching a pattern, or we'd need another param name
listing which fields we should expect to find f.*.facet.query params for.

Did you have any thoughts on what a grouping API could look like for the
query based facets ?

: 2) If facets can be returned in such a logical grouping, it might also
: be worthwhile to allow an optional sort order for the facets (e.g.
: alphabetically, by count, etc).  While the client can certainly sort,
: if there is a facet limit the client will not be able to sort on the
: full set.

Right ... for the query based facets, i'm assuming they should all allways
be returned (if it's too much data for the lcient to deal with, they
wouldn't have asked for it).

For the Field/TermEnum based facets, i'm assuming that any set small
enough that you don't care about the limit doesn't need to be sorted (the
client can deal with it) but if a limit is specified the constraints
should be collected in a bounded PriorityQuery with the higher counts
"first" ... i'm not really sure when sorting alphabetically would be
usefull in these situations ... the client might want to sort
alphabeticaly after the list has been "limited" so that the display looks
nicer, but what use cases can you think of where it would make snese to
return the (alphabeticaly) first N Terms in a field with their counts?

: 3)  We have one running application (not yet on Solr;  that is my
: prototype :) ) where the boundaries on range-based facets are
: calculated to achieve equal distribution among a known number of
: facets.  (This is like facetField, except for ranges, not terms.)
:
: 4) Would it be possible to extend the types of facets?  Because,
: admittedly, #3 is an application-specific case, I would not expect
: some general-purpose solution to it.  However, when confronted with
: such a need, it would be nice to be able to plug in a new facet
: impementation type for a given facet without having to change Solr
: internals and/or create and maintain nearly exact duplicates of
: existing request handlers.  (In my prototype, I had the concept of a

This gets back to my orriginal goal of targeting the simple stuff
first .. i think we can hit a nice 80/20 sweet spot of meeting a lot of
peoples needs with a very simple amount of code.

Beyond that, the tools for building really customizable faceting code are
really already there -- the DocSet class, TermEnum, ... these things are
teh real work horses of the system.

One thing you made me think of that we can definitely do to improve the
reusability of the code is to add a...

  protected NamedList getFacetInfo(SolrQueryRequest req, SolrQueryResponse rsp, DocSet mainSet)

...method to StandardRequestHandler, and move the call to
SolrPluginUtils.doSimpleFacetCounts inside of it -- that way subclasses
could replace just the faceting aspects of the handler without needing to
cut and paste.  I'll also extract some of the code in doSimpleFacetCounts
into smaller granularity methods so that they can be reused more easily.
-- that should go along way towards reusability.


How does that sound to everybody?




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (SOLR-44) Basic Facet Count support

Erik Hatcher
In reply to this post by Nick Burch (Jira)

On Aug 29, 2006, at 9:57 PM, Hoss Man (JIRA) wrote:

> First pass at basic facet support.  initial patch includes  
> utilities for use in RequestHandlers, and usage in  
> StandardRequestHandler (DisMax should use SolrParams before  
> attempting to add this)
>
> Basic idea is that:
>   * facet=true indicates facet counts are desired.
>   * facetField=inStock indicates we want a count of the matching  
> docs for each value in the field inStock
>   * facetQuery=title:ipod indicates we want the count of matching  
> docs also in the set of docs matching query title:ipod
>   * if user wants to apply a facet constraint on subsequent  
> queries, they can add an "fq" (filter query) param (support for  
> this was added to StandardRequestHandler as well)
>
> Things marked TODO...
>   * add support for per field facetLimit indicating that only the  
> top N items in each facetField should be returned
>   * add support for a per field facetZero boolean indicating that  
> there is no reason to bother returning counts of 0 for facetFields  
> (some clients may want to know the list, others don't care)
>   * potential optimization when using faceLimit to cache the terms  
> with the highest docFreq and see if they provide all the info we  
> need without doing a full TermEnum
>
> I'd like to get some feedback on the overall appraoch and params  
> before i proceed too much farther.

Wow, Hoss.  Very cool.  I might be able to just rip out all the  
custom work I've done and go with a pure Solr build one of these days :)

One thing that my facet code does is compute the count for all items  
that have _no_ terms in a particular field, and makes an  
<unspecified> count as well.  It does this by putting all documents  
found into a DocSet as it iterates through all terms for a field, and  
then .andNot'ing it away from an all docs query.  Not pretty, but  
does work and works quite fast.

Do you think a catch all facet count could be added into your  
implementation somehow?

        Erik


Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (SOLR-44) Basic Facet Count support

Chris Hostetter-3

: One thing that my facet code does is compute the count for all items
: that have _no_ terms in a particular field, and makes an
: <unspecified> count as well.  It does this by putting all documents
: found into a DocSet as it iterates through all terms for a field, and
: then .andNot'ing it away from an all docs query.  Not pretty, but
: does work and works quite fast.

great idea ... as you describe it should be easy to add ... the nature of
a NamedList will make it easy to return (include an <int> with no name),
and and API to request that functionality can be something like
facet.missing=true with f.*.facet.missing allowing field overrides
(probably don't want it to allways be on since not everyone will care
need it and the andNot's could get expensive in a field with lots of
terms)




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Greg Ludington
In reply to this post by Chris Hostetter-3
> init/query params -- but as i type this, it occurs to me that one approach
> would be to allow for "per field" usages of the of the "facet.query" param
> to specify queries that would use a SolrQueryParser with the default
> field set to the specified field, so that you could things like...
>
>         facet.query=foo:bar
>         f.price.facet.query=[*+TO+100]
>         f.price.facet.query=[101+TO+*]
>         facet.field=cat
>
> ...and all of the "f.price.facet.query" counts would be grouped together
> seperate from the count for "facet.query=foo:bar"
>
> ...the hitch here is this isn't really a "per field override" of the
> facet.query param ... so the API might confuse some people.  We would also
> either need to change SolrParams so that it's possible to get a list of
> all set param names matching a pattern, or we'd need another param name
> listing which fields we should expect to find f.*.facet.query params for.
>
> Did you have any thoughts on what a grouping API could look like for the
> query based facets ?

I was hoping you would not call me out on this :) -- I think we have
been thinking along similar lines, but just about every alternative I
have tried leaves something to be desired, in that you either end up
with alot of extra String parsing or a very awkward/brittle url
format.  One possibility might be to add a sort of namespace to the
params themselves:

facet.query.byprice=price:[*+TO+100]
facet.query.byprice=price:[101+TO+*]

This is similar to the f.<fieldname> approach, though the format would
be different enough to avoid confusion.  The downside, as you have
suggested, is that you have to add some manner of
getParamStartsWith(...) method, and that might not be too efficient.
The only way I could see to avoid that getsParamStartsWith(..) method
would be to pass in separately a list of the groupings you want:

facet.query.groups=byprice
facet.query.byprice=price:[*+TO+100]
facet.query.byprice=price:[101+TO+*]

and then you look for those grouping fields with a getFieldParams(..)
method that could use "facet.query." as its prefix instead of "f." --
.but at that point the request URL is getting very complicated.
Alternatively, you could have a simpler URL by putting it all on the
value side, as in:

facet.query=byprice|price:[*+TO+100]
facet.query=byprice|price:[101+TO+*]

and use splitList (like the highlighter) or some similar mechanism to
separate the group and query portions.  The obvious downside here is
making sure not to split incorrectly, and it limits adding additional
attributes later.

Getting back to what you said about the 80/20 rule, you certainly have
hit that sweet spot.  It may be that in just about every use case (or
at least 80% of them :) ) the client can, at worst, extract the field
name from the <lst> name attribute, and use that for grouping.  While
explicit grouping control would be nice, it may be overcomplicating
things, unless somebody has another approch, or ideas that overcome or
minimize the drawbacks above.

-Greg
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Chris Hostetter-3

: Getting back to what you said about the 80/20 rule, you certainly have
: hit that sweet spot.  It may be that in just about every use case (or
: at least 80% of them :) ) the client can, at worst, extract the field
: name from the <lst> name attribute, and use that for grouping.  While
: explicit grouping control would be nice, it may be overcomplicating
: things, unless somebody has another approch, or ideas that overcome or
: minimize the drawbacks above.

yeah, the problem I keep coming back to is that even if we come up with a
simple way to identify which queries should be grouped together and
treated as one facet, the client still has to look at the orriginal query
string in order to make sense of the data -- or to put a pretty label on
it for the end user, so why complicate the API with teh grouping at all?

Consier an approach like the one you describe, let's assume for a minute
that we don't care about letting the user do really simple things like
"facet.query=inStock:true", so we make "facet.query" be the param that
specifies the names of the Query Facets we want to group by, and
facet.query.* is the pattern for finding individual queries, so something
like this...

  facet=true
  facet.field=category
  facet.query=price
  facet.query.price=[* TO 100]
  facet.price=[101 TO *]

...says we want two facets: one on the category field where each value is
a constraint, and one on the price field with two constraints based on the
specified ranges, when we return that data it can look like this...

<lst name="facet_counts">
 <lst name="price">
  <int name="[* TO 100]">2</int>
  <int name="[101 TO *]">5</int>
 </lst>
 <lst name="category">
  <int name="graphics">2</int>
  <int name="software">2</int>
  <int name="music">1</int>
 </lst>
</lst>

...but the client still needs to parse the labels like "[* TO 100]" to
make them pretty (ie: "under $100") ... if they have to do that much
parsing, why not let them parse "price:[* TO 100]" and make out lives
easier?

It's really just not enough to have an easy way to group queries into a
single facet -- to be worthwhile we'd need and easy way for the client to
specify a label for each query as well, and now i think we may definitely
be pushing the limits of the simple key=>val(s) nature of SolrParams.

I definitely think we want to support stuff like this out of the box in
the long run, i think it just needs to be based on specifying the Facet
info in a more robust way (ie: XML configuration)




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Greg Ludington
> I definitely think we want to support stuff like this out of the box in
> the long run, i think it just needs to be based on specifying the Facet
> info in a more robust way (ie: XML configuration)
>

Not to threadjack, but this is actually the path I went down during my
faceting prototype, pushing all the facet configuration into a
facets.xml file, which is parsed at startup and the definitions stored
in a SolrCache.  Here is a snippet:

<facets>
        <facet id="manu" class="com.cyberego.solr.facet.DynamicTermQueryFacet">
                <str name="fieldName">manu_exact</str>
                <str name="displayName">By Manufacturer</str>
                <int name="maxNumberToDisplay">25</int>
        </facet>
       
        <facet id="instock">
                <str name="displayName">Instock</str>
                <item class="com.cyberego.solr.facet.StandardFacetItem">
                        <str name="displayName">In Stock</str>
                        <str name="queryString">inStock:true</str>
                </item>
                <item class="com.cyberego.solr.facet.StandardFacetItem">
                        <str name="displayName">Out of Stock</str>
                        <str name="queryString">inStock:false</str>
                </item>
        </facet>
</facets>

The client then asks for facets by name, e.g.
facet=manu&facet=instock, and gets back output that includes the name,
the count, and the queryString.  The client application cannot,
however, just ask for an arbitrary facetQuery, and I really like that
ability in your patch -- it probably hits 98/2 instead of just 80/20.

-Greg
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Yonik Seeley-2
In reply to this post by Chris Hostetter-3
> It's really just not enough to have an easy way to group queries into a
> single facet -- to be worthwhile we'd need and easy way for the client to
> specify a label for each query as well

Exactly.  If you can't put all of the info in the response, you need a
smarter client anyway.

But, one thing we may want to consider now is what the ideal format
would look like from a custom query handler designed to do facets, and
we might want to use the same format as it would help create more of a
defacto standard.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Yonik Seeley-2
Following up with what the ideal faceted browsing info might look like
(ignoring how label info is obtained for the moment):
- include grouping and group labels
- include constraint label, count, and the exact query needed to
filter by the constraint
- have the same structure regardless of faceting type (by field or by
query constraints)
- be relatively compact, while being easy for clients to manipulate

Here is an example:

<lst name="facets">
  <lst name="Categories">
    <arr name="cat:music">
      <int>24</int>
      <str>Music</str>
    </arr>
    <arr name="cat:electronics">
      <int>36</int>
      <str>Electronics</str>
    </arr>
  </lst>
  <lst name="Prices">
    <arr name="price:[0 TO 100]">
      <int>142</int>
      <str>less than $100</str>
    </arr>
    <lst name="price:[100 TO 200]">
      <int>70</int>
      <str>$100 - $200</str>
    </arr>
  </lst>
</lst>

Alternately, each facet constraint could have a triple (4 more chars
per entry than above)

  <arr>
   <int>70</int>
   <str>$100 - $200</str>
   <str>price:[100 TO 200]</str>
  </arr>

I think the only downside to standardizing something like this is the
increased response size (not over custom query handlers, but over the
simple facet handler that has no labels).

The upsides include
  - the rendering part can be really dumb/stateless... all the needed
info is in the rsp
  - people can share facet presentation logic, and it can be built
into Solr admin pages
 - custom handlers should be able to reuse the same structure
 - could be convenience methods for custom handlers to add facet info

Thoughts?

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Chris Hostetter-3
: Following up with what the ideal faceted browsing info might look like
: (ignoring how label info is obtained for the moment):
: - include grouping and group labels
: - include constraint label, count, and the exact query needed to
: filter by the constraint
: - have the same structure regardless of faceting type (by field or by
: query constraints)
: - be relatively compact, while being easy for clients to manipulate

Compact is fine, but if we're going to try and have a general, reusable,
structure -i'd rather it be flexible.  there's a lot of other information
that people might want to return as part of a "robust" faceting system.

With product data at CNET, we have things like the "display type" for
differnet facets (ie: always display these constraints in order of count,
display in Label order, display top N by count, but then sort by Label,
etc...), for numeric facets we also have the "rank direction" which tells
the front end wether lower values are better or wrose then high values
(ie: lower prices are better then high prices, but high hard disk sizes
are better then low disk sizes).

At the individual constraint level other information can be useful as
well: in some situations it's useful to know not only the count of
matching results that meets the constraint, but the total number of
results in the index that meets the constraint (or ranking constraints by
ratio instead of by count)

So if we really want to come up with an output structure that can grow
beyond "simple" facets, we should keep that in mind, something like...


 <lst name="facets">
   <lst name="cat">
     <str name="facetType">field</str>
     <str name="label">Category</str>
     <lst name="constriants">
       <lst name="music">
         <int name="count">24</int>
         <str name="label">Music</str>
         <str name="query">cat:music</str>
       </lst>
       <lst name="electronics">
         <int name="count">36</int>
         <str name="label">Electronics</str>
         <str name="query">cat:electronics</str>
       </lst>
       ...
     </lst>
   </lst>
   <lst name="foo">
     <str name="facetType">custom_crazy_type</str>
     <str name="label">Foo</str>
     <lst name="constraints">
       <lst name="someIdMaybe">
         <str name="who_knows_what">arbitrary data</str>
         <int name="count">142</int>
       </lst>
       ...
     </lst>
   </lst>
 </lst>




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (SOLR-44) Basic Facet Count support

Yonik Seeley-2
It occurs to me that a single level of hierarchy might be desirable too.

http://www.nabble.com/forum/Search.jtp?query=yonik
 Narrow Search Results

    * Software (2301)
          o Apache (2296)
                + Lucene (2252)
                + more...
          o Web Search (2253)
          o Jetty (3)
          o more...

Perhaps we should go relatively simple for this first iteration to get
it out the door, and we can upgrade it to the full fledged  format
later with a facet.format param?

-Yonik
12