'accumulate' copyField for faceting

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

'accumulate' copyField for faceting

Ryan McKinley
Faceting is much happier if you use a single valued field, but my apps
all require multivalued fields:
<doc>
 <arr name="subject">
  <str>aaa</str>
  <str>bbb</str>
  <str>ccc</str>
 </arr>
</doc>

I'd like to use copyField to accumulate the multivalued fields into a
single field that can be efficiently faceted.  (As written, it adds a
new field for each one and throws an error if multiValued="false")

The simplest thing i can think of is to check if the copyField target
is multivalued, if not, accumulate the values separated by some token
that the copyField target will split.

perhaps something like:

<fieldtype name="facetable" class="solr.StrField" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.RegexTokenizerFactory">
      <str name="pattern">;</str>  <!-- tokens=input.split( ";" ) -->
    </tokenizer>
    <filter class="solr.TrimFilterFactory" />
  </analyzer>
</fieldtype>

<field name="subject" type="text" indexed="true" stored="true"
multiValued="true"/>
<field name="subject_facet" type="facetable" indexed="true"
stored="false" multiValued="false"/>

<copyField source="subject" dest="subject_facet" accumulate=";" />

If ';' is not in the input, this would work.  Is there some character
guaranteed not to be in any input?  Maybe i should call it
"facet_field" rather then "facetable" - i keep reading it as "face
table"

Any thoughts on this design would be great.

thanks
ryan
Reply | Threaded
Open this post in threaded view
|

Re: 'accumulate' copyField for faceting

Yonik Seeley-2
On 3/1/07, Ryan McKinley <[hidden email]> wrote:
> Faceting is much happier if you use a single valued field, but my apps
> all require multivalued fields:

If by "happy" you mean performance, things should be better in the
future though.

> <doc>
>  <arr name="subject">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>
>
> I'd like to use copyField to accumulate the multivalued fields into a
> single field that can be efficiently faceted.

Not sure I understand...  you don't want counts for aaa, bbb, and ccc
separately, but you want counts for the combined values "aaa;bbb;ccc"?

I'm not sure I see the usecases for this.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: 'accumulate' copyField for faceting

Ryan McKinley
On 3/1/07, Yonik Seeley <[hidden email]> wrote:
> On 3/1/07, Ryan McKinley <[hidden email]> wrote:
> > Faceting is much happier if you use a single valued field, but my apps
> > all require multivalued fields:
>
> If by "happy" you mean performance, things should be better in the
> future though.
>

yes, performance.  The docs seems to say "avoid faceting on
multiValued fields if possible"

With SOLR-153, do you think that won't be an issue anymore?


> >
> > I'd like to use copyField to accumulate the multivalued fields into a
> > single field that can be efficiently faceted.
>
> Not sure I understand...  you don't want counts for aaa, bbb, and ccc
> separately, but you want counts for the combined values "aaa;bbb;ccc"?
>
> I'm not sure I see the usecases for this.
>

Maybe its clearer if i say

<arr name="subject">
  <str>San Francisco</str>
  <str>San Diego</str>
  <str>DC</str>
</arr>

I want facets for "San Francisco", "San Diego" and "DC", not "san"
"francisco", "diego", "dc".  I want the faceting to be as efficient as
it could/should be.  If i search for "San Fran" (or San Leandro) this
doc should show up.

I was suggesting using copyField with accumulate the cities into a
single field used for faceting:
  tokens[] = "San Francisco; San Diego; DC".split( ";" )

In my current setup, I have:

<field name="subject" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="subject_txt" type="text" indexed="true" stored="false"
multiValued="true"/>
<copyField source="subject" dest="subject_txt"  />

I facet on the multivalued field "subject" and search on the text
field "subject_txt" -- "subject" is stored as a "string" so that the
tokens resemble the input, and "subject_txt" is tokenized for search.
If i have to go through the overhead of copy field to make search and
faceting work nice together, it may as well be configured to be as
efficient as possible.  Should I ignore the problem for now, and bank
on SOLR-153?

Am i missing something?

thanks
ryan
Reply | Threaded
Open this post in threaded view
|

Re: 'accumulate' copyField for faceting

Mike Klaas
On 3/1/07, Ryan McKinley <[hidden email]> wrote:

> Am i missing something?

I think you're missing that the parameter that matters is the number
of unique values on which you facet.  Whether they come from a
single-valued, tokenized field, or a multi-valued, non-tokenized
field, makes to difference.

I'm using faceting on a multi-valued field with ~70 unique values, and
it is quite fast, once the filters have been cached.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: 'accumulate' copyField for faceting

Ryan McKinley
On 3/1/07, Mike Klaas <[hidden email]> wrote:
> On 3/1/07, Ryan McKinley <[hidden email]> wrote:
>
> > Am i missing something?
>
> I think you're missing that the parameter that matters is the number
> of unique values on which you facet.  Whether they come from a
> single-valued, tokenized field, or a multi-valued, non-tokenized
> field, makes to difference.
>

aaaah.  that makes sense.  thanks

I just looked at this bit from SimpleFacets.java:

if (sf.multiValued() || ft.isTokenized() || ft instanceof BoolField) {
  counts = getFacetTermEnumCounts(...
} else {
  // TODO: future logic could use filters instead of the fieldcache if
  // the number of terms in the field is small enough.
  counts = getFieldCacheCounts(...
}

If i understand it correct, with a large number of terms, it *is*
better if they are single-valued, non-tokenized fields.  But that does
not help the case i am considering.


> I'm using faceting on a multi-valued field with ~70 unique values, and
> it is quite fast, once the filters have been cached.
>

Well, I'll let you all know how it goes to facet on the (>70)
cities/counties in the united states!

thanks
ryan
Reply | Threaded
Open this post in threaded view
|

Re: 'accumulate' copyField for faceting

Yonik Seeley-2
On 3/1/07, Ryan McKinley <[hidden email]> wrote:
> Well, I'll let you all know how it goes to facet on the (>70)
> cities/counties in the united states!

Heh... how many documents?

I'll be interested in seeing some numbers.  The number of documents
matching the base query and filters will also factor in (small will be
HashDocSet, large will be BitDocSet).

Just make sure to run all of your facets, then check the statistics
page to see how big you need to make the filterCache to hold them all
(and add a little extra for random filters).  The access pattern for
the faceting code is worst case for the LRU cache, so it needs to
avoid any evictions.

-Yonik
Reply | Threaded
Open this post in threaded view
|

RE: 'accumulate' copyField for faceting

Graham Stead-2
Sorry for interloping, but I have been wondering the same thing as Ryan. On
my current index with ~6.1M docs, I restarted Solr and ran a query that
included faceting on 4 fields:

QTime: 5712
numFound: 25908
filterCache stats:
        lookups : 0
        hits : 0
        hitratio : 0.00
        inserts : 1
        evictions : 0
        size : 1
        cumulative_lookups : 0
        cumulative_hits : 0
        cumulative_hitratio : 0.00
        cumulative_inserts : 1
        cumulative_evictions : 0

Then I added faceting on a 5th, multivalued field:

QTime: 65551
numFound: 25908
Filtercache stats:
        lookups : 1898314
        hits : 1
        hitratio : 0.00
        inserts : 1898314
        evictions : 1897802
        size : 512
        cumulative_lookups : 1898314
        cumulative_hits : 1
        cumulative_hitratio : 0.00
        cumulative_inserts : 1898314
        cumulative_evictions : 1897802


I realize there are a lot of different values in the 5th multivalued field.
But this is where I'm fuzzy: are we saying there would be no difference
using a tokenized, single valued field versus a multivalued field? Or are we
saying that multivalued is ok, as long as the number of values is less than
the filterCache size? [Unfortunately I don't have a single valued version of
this field to test with]

Thanks,
-Graham

> I'll be interested in seeing some numbers.  The number of
> documents matching the base query and filters will also
> factor in (small will be HashDocSet, large will be BitDocSet).
>
> Just make sure to run all of your facets, then check the
> statistics page to see how big you need to make the
> filterCache to hold them all (and add a little extra for
> random filters).  The access pattern for the faceting code is
> worst case for the LRU cache, so it needs to avoid any evictions.
>
> -Yonik


Reply | Threaded
Open this post in threaded view
|

Re: 'accumulate' copyField for faceting

Mike Klaas
On 3/1/07, Graham Stead <[hidden email]> wrote:
> Sorry for interloping, but I have been wondering the same thing as Ryan. On
> my current index with ~6.1M docs, I restarted Solr and ran a query that
> included faceting on 4 fields:

<snip>

Non-tokenized, single valued.

> Then I added faceting on a 5th, multivalued field:
>
> QTime: 65551
> numFound: 25908
> Filtercache stats:
>         lookups : 1898314
>         hits : 1
>         hitratio : 0.00
>         inserts : 1898314
>         evictions : 1897802
>         size : 512
>         cumulative_lookups : 1898314
>         cumulative_hits : 1
>         cumulative_hitratio : 0.00
>         cumulative_inserts : 1898314
>         cumulative_evictions : 1897802
>
>
> I realize there are a lot of different values in the 5th multivalued field.
> But this is where I'm fuzzy: are we saying there would be no difference
> using a tokenized, single valued field versus a multivalued field? Or are we
> saying that multivalued is ok, as long as the number of values is less than
> the filterCache size? [Unfortunately I don't have a single valued version of
> this field to test with]

For non- singled-valued, untokenized fields, all[1] that matters is
the number of "things" faceted on.  Whether these things are arbitrary
queries, tokens from tokenized fields or multiple values in
untokenized fields is moot.  You've got 2million values, which implies
the construction of 2million filters and an intersection with the main
query docset.  Even if you enlarge the filter cache to contain all 2m
filtters, you still require time to do 2m set intersections.  This may
take too long if the filters are all small.

As a point of comparison, here is a query that returned ~200k docs and
faceted against 70 facets with roughly 140k docs in each filter
(cached):

329.0   total time
  0.0   set up/parsing
  125.0   main query
  46.0   faceting
  100.0   optimized pre-fetch
  58.0   debug

Times are in milliseconds.  I've found breaking down the timing rather
useful since I have huge stored docs and non-query-related tasks often
take up big chunks of time.  I could contribute it if anyone else
would find it useful.

-Mike



[1] well, much, if not all.
Reply | Threaded
Open this post in threaded view
|

Re: 'accumulate' copyField for faceting

Ryan McKinley
>
> 329.0   total time
>   0.0   set up/parsing
>   125.0         main query
>   46.0          faceting
>   100.0         optimized pre-fetch
>   58.0          debug
>
> Times are in milliseconds.  I've found breaking down the timing rather
> useful since I have huge stored docs and non-query-related tasks often
> take up big chunks of time.  I could contribute it if anyone else
> would find it useful.
>

Yes, this would be really helpfull.  It would be nice to be able to
put this in in the response output too.

thanks
ryan
Reply | Threaded
Open this post in threaded view
|

Re: 'accumulate' copyField for faceting

Mike Klaas
On 3/2/07, Ryan McKinley <[hidden email]> wrote:

> Yes, this would be really helpfull.  It would be nice to be able to
> put this in in the response output too.

Two votes is enough for me.  I'll see if I can get to it this weekend.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: 'accumulate' copyField for faceting

Mike Klaas
A patch is up at SOLR-176

On 3/2/07, Mike Klaas <[hidden email]> wrote:
> On 3/2/07, Ryan McKinley <[hidden email]> wrote:
>
> > Yes, this would be really helpfull.  It would be nice to be able to
> > put this in in the response output too.
>
> Two votes is enough for me.  I'll see if I can get to it this weekend.
>
> -Mike
>