Facets and running out of Heap Space

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Facets and running out of Heap Space

David Whalen
Hi All.

I run a faceted query against a very large index on a
regular schedule.  Every now and then the query throws
an out of heap space error, and we're sunk.

So, naturally we increased the heap size and things worked
well for a while and then the errors would happen again.
We've increased the initial heap size to 2.5GB and it's
still happening.

Is there anything we can do about this?

Thanks in advance,

Dave W
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Yonik Seeley-2
On 10/9/07, David Whalen <[hidden email]> wrote:

> I run a faceted query against a very large index on a
> regular schedule.  Every now and then the query throws
> an out of heap space error, and we're sunk.
>
> So, naturally we increased the heap size and things worked
> well for a while and then the errors would happen again.
> We've increased the initial heap size to 2.5GB and it's
> still happening.
>
> Is there anything we can do about this?

Try facet.enum.cache.minDf param:
http://wiki.apache.org/solr/SimpleFacetParameters

-Yonik
Reply | Threaded
Open this post in threaded view
|

RE: Facets and running out of Heap Space

David Whalen
Hi Yonik.

According to the doc:


> This is only used during the term enumeration method of
> faceting (facet.field type faceting on multi-valued or
> full-text fields).

What if I'm faceting on just a plain String field?  It's
not full-text, and I don't have multiValued set for it....


Dave


> -----Original Message-----
> From: Yonik Seeley [mailto:[hidden email]]
> Sent: Tuesday, October 09, 2007 12:47 PM
> To: [hidden email]
> Subject: Re: Facets and running out of Heap Space
>
> On 10/9/07, David Whalen <[hidden email]> wrote:
> > I run a faceted query against a very large index on a regular
> > schedule.  Every now and then the query throws an out of heap space
> > error, and we're sunk.
> >
> > So, naturally we increased the heap size and things worked
> well for a
> > while and then the errors would happen again.
> > We've increased the initial heap size to 2.5GB and it's still
> > happening.
> >
> > Is there anything we can do about this?
>
> Try facet.enum.cache.minDf param:
> http://wiki.apache.org/solr/SimpleFacetParameters
>
> -Yonik
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Yonik Seeley-2
In reply to this post by Yonik Seeley-2
On 10/9/07, David Whalen <[hidden email]> wrote:
> > This is only used during the term enumeration method of
> > faceting (facet.field type faceting on multi-valued or
> > full-text fields).
>
> What if I'm faceting on just a plain String field?  It's
> not full-text, and I don't have multiValued set for it....

Then you will be using the FieldCache counting method, and this param
is not applicable :-)
Are all your field that you facet on like this?

The FieldCache entry might be taking up too much room, esp if the
number of entries is high, and the entries are big.  The requests
themselves can take up a good chunk of memory temporarily (4 bytes *
nValuesInField).

You could try a memory profiling tool and see where all the memory is
being taken up too.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

hossman
In reply to this post by David Whalen

: So, naturally we increased the heap size and things worked
: well for a while and then the errors would happen again.
: We've increased the initial heap size to 2.5GB and it's
: still happening.

is this the same 25,000,000 document index you mentioned before?

2.5GB of heap doesn't seem like much if you are also doing faceting ...
even if you are faceting on an int field, there's going to be 95MB of
FieldCache for that field, you said this was a string field, so it's going
to be 95MB+however much space is needed for all the terms
(presumably if you are faceting on this field every doc doesn't have a
unique value, but even assuming a conservative 10% unique values of 10
characters each that's another ~50MB, so we're up to about 150MB of
FieldCache to facet that field -- and we haven't even started talking
about how big the index is itself (or how big the filterCache gets, or
how many other fields you are faceting on)

how big is your index on disk? are you faceting or sorting on other fields
as well?

what does the LukeReqeust Handler tell you about the # of distinct terms
in each field that you facet on?




-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Facets and running out of Heap Space

David Whalen
In reply to this post by Yonik Seeley-2
> Then you will be using the FieldCache counting method, and
> this param is not applicable :-) Are all your field that you
> facet on like this?

Unfortunately yes.  Could I improve my situation by changing
them to multiValued?



_________________________________________________________________
david whalen
senior applications developer
eNR Services, Inc.
[hidden email]
203-849-7240
 

> -----Original Message-----
> From: Yonik Seeley [mailto:[hidden email]]
> Sent: Tuesday, October 09, 2007 2:14 PM
> To: [hidden email]
> Subject: Re: Facets and running out of Heap Space
>
> On 10/9/07, David Whalen <[hidden email]> wrote:
> > > This is only used during the term enumeration method of faceting
> > > (facet.field type faceting on multi-valued or full-text fields).
> >
> > What if I'm faceting on just a plain String field?  It's not
> > full-text, and I don't have multiValued set for it....
>
> Then you will be using the FieldCache counting method, and
> this param is not applicable :-) Are all your field that you
> facet on like this?
>
> The FieldCache entry might be taking up too much room, esp if
> the number of entries is high, and the entries are big.  The
> requests themselves can take up a good chunk of memory
> temporarily (4 bytes * nValuesInField).
>
> You could try a memory profiling tool and see where all the
> memory is being taken up too.
>
> -Yonik
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Facets and running out of Heap Space

David Whalen
In reply to this post by hossman
> is this the same 25,000,000 document index you mentioned before?

Yep.

> how big is your index on disk? are you faceting or sorting on
> other fields as well?

running 'du -h' on my index directory returns 86G.  We facet
on almost all of our index fields (they were added to the index
solely for that purpose, otherwise we'd remove them).  Here's
the meaty part of the config again:

<field name="id" type="string" indexed="true" stored="true" />
<field name="content_date" type="date" indexed="true" stored="true" />
<field name="media_type" type="string" indexed="true" stored="true" />
<field name="location" type="string" indexed="true" stored="true" />
<field name="country_code" type="string" indexed="true" stored="true" />
<field name="text" type="text" indexed="true" stored="true" multiValued="true" />
<field name="content_source" type="string" indexed="true" stored="true" />
<field name="title" type="string" indexed="true" stored="true" />
<field name="site_id" type="string" indexed="true" stored="true" />
<field name="journalist_id" type="string" indexed="true" stored="true" />
<field name="blog_url" type="string" indexed="true" stored="true" />
<field name="created_date" type="date" indexed="true" stored="true" />

I'm sure we could stop storing many of these columns, especially
if someone told me that would make a big difference.


> what does the LukeReqeust Handler tell you about the # of
> distinct terms in each field that you facet on?

Where would I find that?  I could probably estimate that myself
on a per-column basis.  it ranges from 4 distinct values for
media_type to 30-ish for location to 200-ish for country_code
to almost 10,000 for site_id to almost 100,000 for journalist_id.

Thanks very much for your help so far, Chris!

Dave


 

> -----Original Message-----
> From: Chris Hostetter [mailto:[hidden email]]
> Sent: Tuesday, October 09, 2007 2:48 PM
> To: solr-user
> Subject: Re: Facets and running out of Heap Space
>
>
> : So, naturally we increased the heap size and things worked
> : well for a while and then the errors would happen again.
> : We've increased the initial heap size to 2.5GB and it's
> : still happening.
>
> is this the same 25,000,000 document index you mentioned before?
>
> 2.5GB of heap doesn't seem like much if you are also doing
> faceting ...
> even if you are faceting on an int field, there's going to be
> 95MB of FieldCache for that field, you said this was a string
> field, so it's going to be 95MB+however much space is needed
> for all the terms (presumably if you are faceting on this
> field every doc doesn't have a unique value, but even
> assuming a conservative 10% unique values of 10 characters
> each that's another ~50MB, so we're up to about 150MB of
> FieldCache to facet that field -- and we haven't even started
> talking about how big the index is itself (or how big the
> filterCache gets, or how many other fields you are faceting on)
>
> how big is your index on disk? are you faceting or sorting on
> other fields as well?
>
> what does the LukeReqeust Handler tell you about the # of
> distinct terms in each field that you facet on?
>
>
>
>
> -Hoss
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Ryan McKinley
>
>> what does the LukeReqeust Handler tell you about the # of
>> distinct terms in each field that you facet on?
>
> Where would I find that?  

check:
http://wiki.apache.org/solr/LukeRequestHandler

Make sure you have:
<requestHandler name="/admin/luke"
class="org.apache.solr.handler.admin.LukeRequestHandler" />
defined in solrconfig.xml

for a large index, this can be very slow but the results are valuable.

ryan
Reply | Threaded
Open this post in threaded view
|

RE: Facets and running out of Heap Space

David Whalen
> Make sure you have:
> <requestHandler name="/admin/luke"
> class="org.apache.solr.handler.admin.LukeRequestHandler" />
> defined in solrconfig.xml

What's the consequence of me changing the solrconfig.xml file?
Doesn't that cause a restart of solr?

> for a large index, this can be very slow but the results are valuable.

In what way?  I'm still not clear on what this does for me....


> -----Original Message-----
> From: Ryan McKinley [mailto:[hidden email]]
> Sent: Tuesday, October 09, 2007 4:01 PM
> To: [hidden email]
> Subject: Re: Facets and running out of Heap Space
>
> >
> >> what does the LukeReqeust Handler tell you about the # of distinct
> >> terms in each field that you facet on?
> >
> > Where would I find that?  
>
> check:
> http://wiki.apache.org/solr/LukeRequestHandler
>
> Make sure you have:
> <requestHandler name="/admin/luke"
> class="org.apache.solr.handler.admin.LukeRequestHandler" />
> defined in solrconfig.xml
>
> for a large index, this can be very slow but the results are valuable.
>
> ryan
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Ryan McKinley
David Whalen wrote:
>> Make sure you have:
>> <requestHandler name="/admin/luke"
>> class="org.apache.solr.handler.admin.LukeRequestHandler" />
>> defined in solrconfig.xml
>
> What's the consequence of me changing the solrconfig.xml file?
> Doesn't that cause a restart of solr?
>

editing solrconfig.xml does *not* restart solr.

But you need to restart solr to see any changes to solrconfig.


>> for a large index, this can be very slow but the results are valuable.
>
> In what way?  I'm still not clear on what this does for me....
>

It gives you all kinds of index statistics - that may or may not be
useful in figuring out how big field caches will need to be.

It is just a diagnostics tool, not a fix.

ryan

Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Mike Klaas
In reply to this post by David Whalen
On 9-Oct-07, at 12:36 PM, David Whalen wrote:

> <field name="id" type="string" indexed="true" stored="true" />
> <field name="content_date" type="date" indexed="true" stored="true" />
> <field name="media_type" type="string" indexed="true" stored="true" />
> <field name="location" type="string" indexed="true" stored="true" />
> <field name="country_code" type="string" indexed="true"  
> stored="true" />
> <field name="text" type="text" indexed="true" stored="true"  
> multiValued="true" />
> <field name="content_source" type="string" indexed="true"  
> stored="true" />
> <field name="title" type="string" indexed="true" stored="true" />
> <field name="site_id" type="string" indexed="true" stored="true" />
> <field name="journalist_id" type="string" indexed="true"  
> stored="true" />
> <field name="blog_url" type="string" indexed="true" stored="true" />
> <field name="created_date" type="date" indexed="true" stored="true" />
>
> I'm sure we could stop storing many of these columns, especially
> if someone told me that would make a big difference.

I don't think that it would make a difference in memory consumption,  
but storage is certainly not necessary for faceting.  Extra stored  
fields can slow down search if they are large (in terms of bytes),  
but don't really occupy extra memory, unless they are polluting the  
doc cache.  Does 'text' need to be stored?
>
>> what does the LukeReqeust Handler tell you about the # of
>> distinct terms in each field that you facet on?
>
> Where would I find that?  I could probably estimate that myself
> on a per-column basis.  it ranges from 4 distinct values for
> media_type to 30-ish for location to 200-ish for country_code
> to almost 10,000 for site_id to almost 100,000 for journalist_id.

Using the filter cache method on the things like media type and  
location; this will occupy ~2.3MB of memory _per unique value_, so it  
should be a net win for those (although quite close in space  
requirements for a 30-ary field on your index size).

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Stu Hood-2
In reply to this post by David Whalen
> Using the filter cache method on the things like media type and
> location; this will occupy ~2.3MB of memory _per unique value_

Mike, how did you calculate that value? I'm trying to tune my caches, and any equations that could be used to determine some balanced settings would be extremely helpful. I'm in a memory limited environment, so I can't afford to throw a ton of cache at the problem.

(I don't want to thread-jack, but I'm also wondering whether anyone has any notes on how to tune cache sizes for the filterCache, queryResultCache and documentCache).

Thanks,
Stu


-----Original Message-----
From: Mike Klaas <[hidden email]>
Sent: Tuesday, October 9, 2007 9:30pm
To: [hidden email]
Subject: Re: Facets and running out of Heap Space

On 9-Oct-07, at 12:36 PM, David Whalen wrote:

>(snip)
> I'm sure we could stop storing many of these columns, especially
> if someone told me that would make a big difference.

I don't think that it would make a difference in memory consumption,  
but storage is certainly not necessary for faceting.  Extra stored  
fields can slow down search if they are large (in terms of bytes),  
but don't really occupy extra memory, unless they are polluting the  
doc cache.  Does 'text' need to be stored?
>
>> what does the LukeReqeust Handler tell you about the # of
>> distinct terms in each field that you facet on?
>
> Where would I find that?  I could probably estimate that myself
> on a per-column basis.  it ranges from 4 distinct values for
> media_type to 30-ish for location to 200-ish for country_code
> to almost 10,000 for site_id to almost 100,000 for journalist_id.

Using the filter cache method on the things like media type and  
location; this will occupy ~2.3MB of memory _per unique value_, so it  
should be a net win for those (although quite close in space  
requirements for a 30-ary field on your index size).

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Mike Klaas
On 9-Oct-07, at 7:53 PM, Stu Hood wrote:

>> Using the filter cache method on the things like media type and
>> location; this will occupy ~2.3MB of memory _per unique value_
>
> Mike, how did you calculate that value? I'm trying to tune my  
> caches, and any equations that could be used to determine some  
> balanced settings would be extremely helpful. I'm in a memory  
> limited environment, so I can't afford to throw a ton of cache at  
> the problem.

8bits * 25m docs.  Note that HashSet filters will be smaller  
(cardinality < 3000).

> (I don't want to thread-jack, but I'm also wondering whether anyone  
> has any notes on how to tune cache sizes for the filterCache,  
> queryResultCache and documentCache).

I'll give the usual Solr answer: it depends <g>.  For me:

The filterCache is the most important.  I want my faceting filters to  
be there at all times, as well as the common fq's I throw at Solr.  I  
have this bumped up to 4096 or so.

The queryResultCache isn't too important.  I'm mostly interested in  
keeping around a few recent queries since they tend to be  
reexecuted.  There is generally not a whole lot of overlap, though,  
and I never page very far into the results (10 results over 100  
slaves is more than I typically would ever need).  Memory usage is  
quite low, though, so you might have success going nuts with this cache.

docCache? Make sure this is set to at least maxResults*<max  
concurrent queries>, since the query processing sometimes assumes  
fetching a document earlier in the request will let us retrieve it  
for free later in the request from the cache.  Other than that, it  
depends on your document usage overlap.  It you have a set of  
documents needed for meta-data storage, it behooves you to make sure  
these are always cached.

cheers,
-Mike
Reply | Threaded
Open this post in threaded view
|

RE: Facets and running out of Heap Space

David Whalen
In reply to this post by Stu Hood-2
It looks now like I can't use facets the way I was hoping
to because the memory requirements are impractical.

So, as an alternative I was thinking I could get counts
by doing rows=0 and using filter queries.  

Is there a reason to think that this might perform better?
Or, am I simply moving the problem to another step in the
process?

DW

 

> -----Original Message-----
> From: Stu Hood [mailto:[hidden email]]
> Sent: Tuesday, October 09, 2007 10:53 PM
> To: [hidden email]
> Subject: Re: Facets and running out of Heap Space
>
> > Using the filter cache method on the things like media type and
> > location; this will occupy ~2.3MB of memory _per unique value_
>
> Mike, how did you calculate that value? I'm trying to tune my
> caches, and any equations that could be used to determine
> some balanced settings would be extremely helpful. I'm in a
> memory limited environment, so I can't afford to throw a ton
> of cache at the problem.
>
> (I don't want to thread-jack, but I'm also wondering whether
> anyone has any notes on how to tune cache sizes for the
> filterCache, queryResultCache and documentCache).
>
> Thanks,
> Stu
>
>
> -----Original Message-----
> From: Mike Klaas <[hidden email]>
> Sent: Tuesday, October 9, 2007 9:30pm
> To: [hidden email]
> Subject: Re: Facets and running out of Heap Space
>
> On 9-Oct-07, at 12:36 PM, David Whalen wrote:
>
> >(snip)
> > I'm sure we could stop storing many of these columns,
> especially  if
> >someone told me that would make a big difference.
>
> I don't think that it would make a difference in memory
> consumption, but storage is certainly not necessary for
> faceting.  Extra stored fields can slow down search if they
> are large (in terms of bytes), but don't really occupy extra
> memory, unless they are polluting the doc cache.  Does 'text'
> need to be stored?
> >
> >> what does the LukeReqeust Handler tell you about the # of distinct
> >> terms in each field that you facet on?
> >
> > Where would I find that?  I could probably estimate that
> myself on a
> > per-column basis.  it ranges from 4 distinct values for
> media_type to
> > 30-ish for location to 200-ish for country_code to almost
> 10,000 for
> > site_id to almost 100,000 for journalist_id.
>
> Using the filter cache method on the things like media type
> and location; this will occupy ~2.3MB of memory _per unique
> value_, so it should be a net win for those (although quite
> close in space requirements for a 30-ary field on your index size).
>
> -Mike
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Mike Klaas
On 10-Oct-07, at 12:19 PM, David Whalen wrote:

> It looks now like I can't use facets the way I was hoping
> to because the memory requirements are impractical.

I can't remember if this has been mentioned, but upping the  
HashDocSet size is one way to reduce memory consumption.  Whether  
this will work well depends greatly on the cardinality of your facet  
sets.  facet.enum.cache.minDf set high is another option (will not  
generate a bitset for any value whose facet set is less that this  
value).

Both options have performance implications.

> So, as an alternative I was thinking I could get counts
> by doing rows=0 and using filter queries.
>
> Is there a reason to think that this might perform better?
> Or, am I simply moving the problem to another step in the
> process?

Running one query per unique facet value seems impractical, if that  
is what you are suggesting.  Setting minDf to a very high value  
should always outperform such an approach.

-Mike

> DW
>
>
>
>> -----Original Message-----
>> From: Stu Hood [mailto:[hidden email]]
>> Sent: Tuesday, October 09, 2007 10:53 PM
>> To: [hidden email]
>> Subject: Re: Facets and running out of Heap Space
>>
>>> Using the filter cache method on the things like media type and
>>> location; this will occupy ~2.3MB of memory _per unique value_
>>
>> Mike, how did you calculate that value? I'm trying to tune my
>> caches, and any equations that could be used to determine
>> some balanced settings would be extremely helpful. I'm in a
>> memory limited environment, so I can't afford to throw a ton
>> of cache at the problem.
>>
>> (I don't want to thread-jack, but I'm also wondering whether
>> anyone has any notes on how to tune cache sizes for the
>> filterCache, queryResultCache and documentCache).
>>
>> Thanks,
>> Stu
>>
>>
>> -----Original Message-----
>> From: Mike Klaas <[hidden email]>
>> Sent: Tuesday, October 9, 2007 9:30pm
>> To: [hidden email]
>> Subject: Re: Facets and running out of Heap Space
>>
>> On 9-Oct-07, at 12:36 PM, David Whalen wrote:
>>
>>> (snip)
>>> I'm sure we could stop storing many of these columns,
>> especially  if
>>> someone told me that would make a big difference.
>>
>> I don't think that it would make a difference in memory
>> consumption, but storage is certainly not necessary for
>> faceting.  Extra stored fields can slow down search if they
>> are large (in terms of bytes), but don't really occupy extra
>> memory, unless they are polluting the doc cache.  Does 'text'
>> need to be stored?
>>>
>>>> what does the LukeReqeust Handler tell you about the # of distinct
>>>> terms in each field that you facet on?
>>>
>>> Where would I find that?  I could probably estimate that
>> myself on a
>>> per-column basis.  it ranges from 4 distinct values for
>> media_type to
>>> 30-ish for location to 200-ish for country_code to almost
>> 10,000 for
>>> site_id to almost 100,000 for journalist_id.
>>
>> Using the filter cache method on the things like media type
>> and location; this will occupy ~2.3MB of memory _per unique
>> value_, so it should be a net win for those (although quite
>> close in space requirements for a 30-ary field on your index size).
>>
>> -Mike
>>
>>

Reply | Threaded
Open this post in threaded view
|

RE: Facets and running out of Heap Space

David Whalen
Accoriding to Yonik I can't use minDf because I'm faceting
on a string field.  I'm thinking of changing it to a tokenized
type so that I can utilize this setting, but then I'll have to
rebuild my entire index.

Unless there's some way around that?


 

> -----Original Message-----
> From: Mike Klaas [mailto:[hidden email]]
> Sent: Wednesday, October 10, 2007 4:56 PM
> To: [hidden email]
> Cc: stuhood
> Subject: Re: Facets and running out of Heap Space
>
> On 10-Oct-07, at 12:19 PM, David Whalen wrote:
>
> > It looks now like I can't use facets the way I was hoping
> to because
> > the memory requirements are impractical.
>
> I can't remember if this has been mentioned, but upping the
> HashDocSet size is one way to reduce memory consumption.  
> Whether this will work well depends greatly on the
> cardinality of your facet sets.  facet.enum.cache.minDf set
> high is another option (will not generate a bitset for any
> value whose facet set is less that this value).
>
> Both options have performance implications.
>
> > So, as an alternative I was thinking I could get counts by doing
> > rows=0 and using filter queries.
> >
> > Is there a reason to think that this might perform better?
> > Or, am I simply moving the problem to another step in the process?
>
> Running one query per unique facet value seems impractical,
> if that is what you are suggesting.  Setting minDf to a very
> high value should always outperform such an approach.
>
> -Mike
>
> > DW
> >
> >
> >
> >> -----Original Message-----
> >> From: Stu Hood [mailto:[hidden email]]
> >> Sent: Tuesday, October 09, 2007 10:53 PM
> >> To: [hidden email]
> >> Subject: Re: Facets and running out of Heap Space
> >>
> >>> Using the filter cache method on the things like media type and
> >>> location; this will occupy ~2.3MB of memory _per unique value_
> >>
> >> Mike, how did you calculate that value? I'm trying to tune
> my caches,
> >> and any equations that could be used to determine some balanced
> >> settings would be extremely helpful. I'm in a memory limited
> >> environment, so I can't afford to throw a ton of cache at the
> >> problem.
> >>
> >> (I don't want to thread-jack, but I'm also wondering
> whether anyone
> >> has any notes on how to tune cache sizes for the filterCache,
> >> queryResultCache and documentCache).
> >>
> >> Thanks,
> >> Stu
> >>
> >>
> >> -----Original Message-----
> >> From: Mike Klaas <[hidden email]>
> >> Sent: Tuesday, October 9, 2007 9:30pm
> >> To: [hidden email]
> >> Subject: Re: Facets and running out of Heap Space
> >>
> >> On 9-Oct-07, at 12:36 PM, David Whalen wrote:
> >>
> >>> (snip)
> >>> I'm sure we could stop storing many of these columns,
> >> especially  if
> >>> someone told me that would make a big difference.
> >>
> >> I don't think that it would make a difference in memory
> consumption,
> >> but storage is certainly not necessary for faceting.  Extra stored
> >> fields can slow down search if they are large (in terms of bytes),
> >> but don't really occupy extra memory, unless they are
> polluting the
> >> doc cache.  Does 'text'
> >> need to be stored?
> >>>
> >>>> what does the LukeReqeust Handler tell you about the #
> of distinct
> >>>> terms in each field that you facet on?
> >>>
> >>> Where would I find that?  I could probably estimate that
> >> myself on a
> >>> per-column basis.  it ranges from 4 distinct values for
> >> media_type to
> >>> 30-ish for location to 200-ish for country_code to almost
> >> 10,000 for
> >>> site_id to almost 100,000 for journalist_id.
> >>
> >> Using the filter cache method on the things like media type and
> >> location; this will occupy ~2.3MB of memory _per unique
> value_, so it
> >> should be a net win for those (although quite close in space
> >> requirements for a 30-ary field on your index size).
> >>
> >> -Mike
> >>
> >>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Mike Klaas
On 10-Oct-07, at 2:40 PM, David Whalen wrote:

> Accoriding to Yonik I can't use minDf because I'm faceting
> on a string field.  I'm thinking of changing it to a tokenized
> type so that I can utilize this setting, but then I'll have to
> rebuild my entire index.
>
> Unless there's some way around that?

For the fields that matter (many unique values), this is likely  
result in a performance regression.

It might be better to try storing less unique data.  For instance,  
faceting on the blog_url field, or create_date in your schema would  
case problems (they probably have millions of unique values).

It would be helpful to know which field is causing the problem.  One  
way would be to do a sorted query on a quiescent index for each  
field, and see if there are any suspiciously large jumps in memory  
usage.

-Mike

>
>
>
>> -----Original Message-----
>> From: Mike Klaas [mailto:[hidden email]]
>> Sent: Wednesday, October 10, 2007 4:56 PM
>> To: [hidden email]
>> Cc: stuhood
>> Subject: Re: Facets and running out of Heap Space
>>
>> On 10-Oct-07, at 12:19 PM, David Whalen wrote:
>>
>>> It looks now like I can't use facets the way I was hoping
>> to because
>>> the memory requirements are impractical.
>>
>> I can't remember if this has been mentioned, but upping the
>> HashDocSet size is one way to reduce memory consumption.
>> Whether this will work well depends greatly on the
>> cardinality of your facet sets.  facet.enum.cache.minDf set
>> high is another option (will not generate a bitset for any
>> value whose facet set is less that this value).
>>
>> Both options have performance implications.
>>
>>> So, as an alternative I was thinking I could get counts by doing
>>> rows=0 and using filter queries.
>>>
>>> Is there a reason to think that this might perform better?
>>> Or, am I simply moving the problem to another step in the process?
>>
>> Running one query per unique facet value seems impractical,
>> if that is what you are suggesting.  Setting minDf to a very
>> high value should always outperform such an approach.
>>
>> -Mike
>>
>>> DW
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Stu Hood [mailto:[hidden email]]
>>>> Sent: Tuesday, October 09, 2007 10:53 PM
>>>> To: [hidden email]
>>>> Subject: Re: Facets and running out of Heap Space
>>>>
>>>>> Using the filter cache method on the things like media type and
>>>>> location; this will occupy ~2.3MB of memory _per unique value_
>>>>
>>>> Mike, how did you calculate that value? I'm trying to tune
>> my caches,
>>>> and any equations that could be used to determine some balanced
>>>> settings would be extremely helpful. I'm in a memory limited
>>>> environment, so I can't afford to throw a ton of cache at the
>>>> problem.
>>>>
>>>> (I don't want to thread-jack, but I'm also wondering
>> whether anyone
>>>> has any notes on how to tune cache sizes for the filterCache,
>>>> queryResultCache and documentCache).
>>>>
>>>> Thanks,
>>>> Stu
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mike Klaas <[hidden email]>
>>>> Sent: Tuesday, October 9, 2007 9:30pm
>>>> To: [hidden email]
>>>> Subject: Re: Facets and running out of Heap Space
>>>>
>>>> On 9-Oct-07, at 12:36 PM, David Whalen wrote:
>>>>
>>>>> (snip)
>>>>> I'm sure we could stop storing many of these columns,
>>>> especially  if
>>>>> someone told me that would make a big difference.
>>>>
>>>> I don't think that it would make a difference in memory
>> consumption,
>>>> but storage is certainly not necessary for faceting.  Extra stored
>>>> fields can slow down search if they are large (in terms of bytes),
>>>> but don't really occupy extra memory, unless they are
>> polluting the
>>>> doc cache.  Does 'text'
>>>> need to be stored?
>>>>>
>>>>>> what does the LukeReqeust Handler tell you about the #
>> of distinct
>>>>>> terms in each field that you facet on?
>>>>>
>>>>> Where would I find that?  I could probably estimate that
>>>> myself on a
>>>>> per-column basis.  it ranges from 4 distinct values for
>>>> media_type to
>>>>> 30-ish for location to 200-ish for country_code to almost
>>>> 10,000 for
>>>>> site_id to almost 100,000 for journalist_id.
>>>>
>>>> Using the filter cache method on the things like media type and
>>>> location; this will occupy ~2.3MB of memory _per unique
>> value_, so it
>>>> should be a net win for those (although quite close in space
>>>> requirements for a 30-ary field on your index size).
>>>>
>>>> -Mike
>>>>
>>>>
>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

RE: Facets and running out of Heap Space

David Whalen
I'll see what I can do about that.

Truthfully, the most important facet we need is the one on
media_type, which has only 4 unique values.  The second
most important one to us is location, which has about 30
unique values.

So, it would seem like we actually need a counter-intuitive
solution.  That's why I thought Field Queries might be the
solution.

Is there some reason to avoid setting multiValued to true
here?  It sounds like it would be the true cure-all....

Thanks again!

dave


 

> -----Original Message-----
> From: Mike Klaas [mailto:[hidden email]]
> Sent: Wednesday, October 10, 2007 6:20 PM
> To: [hidden email]
> Subject: Re: Facets and running out of Heap Space
>
> On 10-Oct-07, at 2:40 PM, David Whalen wrote:
>
> > Accoriding to Yonik I can't use minDf because I'm faceting
> on a string
> > field.  I'm thinking of changing it to a tokenized type so
> that I can
> > utilize this setting, but then I'll have to rebuild my entire index.
> >
> > Unless there's some way around that?
>
> For the fields that matter (many unique values), this is
> likely result in a performance regression.
>
> It might be better to try storing less unique data.  For
> instance, faceting on the blog_url field, or create_date in
> your schema would case problems (they probably have millions
> of unique values).
>
> It would be helpful to know which field is causing the
> problem.  One way would be to do a sorted query on a
> quiescent index for each field, and see if there are any
> suspiciously large jumps in memory usage.
>
> -Mike
>
> >
> >
> >
> >> -----Original Message-----
> >> From: Mike Klaas [mailto:[hidden email]]
> >> Sent: Wednesday, October 10, 2007 4:56 PM
> >> To: [hidden email]
> >> Cc: stuhood
> >> Subject: Re: Facets and running out of Heap Space
> >>
> >> On 10-Oct-07, at 12:19 PM, David Whalen wrote:
> >>
> >>> It looks now like I can't use facets the way I was hoping
> >> to because
> >>> the memory requirements are impractical.
> >>
> >> I can't remember if this has been mentioned, but upping the
> >> HashDocSet size is one way to reduce memory consumption.
> >> Whether this will work well depends greatly on the
> >> cardinality of your facet sets.  facet.enum.cache.minDf set
> >> high is another option (will not generate a bitset for any
> >> value whose facet set is less that this value).
> >>
> >> Both options have performance implications.
> >>
> >>> So, as an alternative I was thinking I could get counts by doing
> >>> rows=0 and using filter queries.
> >>>
> >>> Is there a reason to think that this might perform better?
> >>> Or, am I simply moving the problem to another step in the process?
> >>
> >> Running one query per unique facet value seems impractical,
> >> if that is what you are suggesting.  Setting minDf to a very
> >> high value should always outperform such an approach.
> >>
> >> -Mike
> >>
> >>> DW
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Stu Hood [mailto:[hidden email]]
> >>>> Sent: Tuesday, October 09, 2007 10:53 PM
> >>>> To: [hidden email]
> >>>> Subject: Re: Facets and running out of Heap Space
> >>>>
> >>>>> Using the filter cache method on the things like media type and
> >>>>> location; this will occupy ~2.3MB of memory _per unique value_
> >>>>
> >>>> Mike, how did you calculate that value? I'm trying to tune
> >> my caches,
> >>>> and any equations that could be used to determine some balanced
> >>>> settings would be extremely helpful. I'm in a memory limited
> >>>> environment, so I can't afford to throw a ton of cache at the
> >>>> problem.
> >>>>
> >>>> (I don't want to thread-jack, but I'm also wondering
> >> whether anyone
> >>>> has any notes on how to tune cache sizes for the filterCache,
> >>>> queryResultCache and documentCache).
> >>>>
> >>>> Thanks,
> >>>> Stu
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Mike Klaas <[hidden email]>
> >>>> Sent: Tuesday, October 9, 2007 9:30pm
> >>>> To: [hidden email]
> >>>> Subject: Re: Facets and running out of Heap Space
> >>>>
> >>>> On 9-Oct-07, at 12:36 PM, David Whalen wrote:
> >>>>
> >>>>> (snip)
> >>>>> I'm sure we could stop storing many of these columns,
> >>>> especially  if
> >>>>> someone told me that would make a big difference.
> >>>>
> >>>> I don't think that it would make a difference in memory
> >> consumption,
> >>>> but storage is certainly not necessary for faceting.  
> Extra stored
> >>>> fields can slow down search if they are large (in terms
> of bytes),
> >>>> but don't really occupy extra memory, unless they are
> >> polluting the
> >>>> doc cache.  Does 'text'
> >>>> need to be stored?
> >>>>>
> >>>>>> what does the LukeReqeust Handler tell you about the #
> >> of distinct
> >>>>>> terms in each field that you facet on?
> >>>>>
> >>>>> Where would I find that?  I could probably estimate that
> >>>> myself on a
> >>>>> per-column basis.  it ranges from 4 distinct values for
> >>>> media_type to
> >>>>> 30-ish for location to 200-ish for country_code to almost
> >>>> 10,000 for
> >>>>> site_id to almost 100,000 for journalist_id.
> >>>>
> >>>> Using the filter cache method on the things like media type and
> >>>> location; this will occupy ~2.3MB of memory _per unique
> >> value_, so it
> >>>> should be a net win for those (although quite close in space
> >>>> requirements for a 30-ary field on your index size).
> >>>>
> >>>> -Mike
> >>>>
> >>>>
> >>
> >>
> >>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Mike Klaas
On 10-Oct-07, at 3:46 PM, David Whalen wrote:

> I'll see what I can do about that.
>
> Truthfully, the most important facet we need is the one on
> media_type, which has only 4 unique values.  The second
> most important one to us is location, which has about 30
> unique values.
>
> So, it would seem like we actually need a counter-intuitive
> solution.  That's why I thought Field Queries might be the
> solution.
>
> Is there some reason to avoid setting multiValued to true
> here?  It sounds like it would be the true cure-all....

Should work.  It would cost about 100 MB on a 25m corpus for those  
two fields.

Have you tried setting multivalued=true without reindexing?  I'm not  
sure, but I think it will work.

-Mike



Reply | Threaded
Open this post in threaded view
|

Re: Facets and running out of Heap Space

Yonik Seeley-2
On 10/10/07, Mike Klaas <[hidden email]> wrote:
> Have you tried setting multivalued=true without reindexing?  I'm not
> sure, but I think it will work.

Yes, that will work fine.
One thing that will change is the response format for stored fields
<arr name="foo"><str>val1</str></arr>
instead of
<str name="foo">val1</str>

Hopefully in the future we can specify a faceting method w/o having to
change the schema.

-Yonik