Limiting facets for huge data - setting indexed=false in schema.xml

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Limiting facets for huge data - setting indexed=false in schema.xml

Rahul R
Hello,
We are trying to get Solr to work for a really huge parts database. Details
of the database
- 55 million parts
- Totally 3700 properties (facets). But each record will not have value for
all properties.
- Most of these facets are defined as dynamic fields within the Solr Index

We were getting really unacceptable timing while doing faceting/searches on
an index created with this database. With only one user using the system,
query times are in excess of 1 minute. With more users concurrently using
the system, the response times are further high.

We thought that by limiting the number of properties that are available for
faceting, the performance can be improved. To test this, we enabled only 6
properties for faceting by setting indexed=true (in schema.xml) for only
these properties. All other properties which are defined as dynamic
properties had indexed=false. The observations after this change :

- Index size reduced by a meagre 5 % only
- Performance did not improve. Infact during PSR run we observed that it
degraded.

My questions:
 - Will reducing the number of facets improve faceting and search
performance ?
- Is there a better way to reduce the number of facets ?
- Will having a large number of properties defined as dynamic fields, reduce
performance ?

Thank you.

Regards
Rahul
Reply | Threaded
Open this post in threaded view
|

Re: Limiting facets for huge data - setting indexed=false in schema.xml

Erik Hatcher

On Jul 31, 2009, at 2:35 AM, Rahul R wrote:

> Hello,
> We are trying to get Solr to work for a really huge parts database.  
> Details
> of the database
> - 55 million parts
> - Totally 3700 properties (facets). But each record will not have  
> value for
> all properties.
> - Most of these facets are defined as dynamic fields within the Solr  
> Index
>
> We were getting really unacceptable timing while doing faceting/
> searches on
> an index created with this database.

Were you accounting for cache warming?  Were your caches sized  
appropriately?  What kind of hardware and RAM were you using?  What  
were the JVM settings?

And certainly not least important - what version of Solr are you  
running?   The difference in faceting performance and scalability  
between Solr 1.3 and what will be Solr 1.4 is quite dramatic.

> We thought that by limiting the number of properties that are  
> available for
> faceting, the performance can be improved. To test this, we enabled  
> only 6
> properties for faceting by setting indexed=true (in schema.xml) for  
> only
> these properties. All other properties which are defined as dynamic
> properties had indexed=false.

These settings won't matter - what matters in this case is what facets  
you request, not what is actually in the index.


> My questions:
> - Will reducing the number of facets improve faceting and search
> performance ?

Reducing what fields you request will, of course.  But what you  
actually index has no effect on performance until you request it.

> - Is there a better way to reduce the number of facets ?

Hard to say without doing a deeper analysis of your needs.

> - Will having a large number of properties defined as dynamic  
> fields, reduce
> performance ?

Dynamic fields versus statically named fields have no effect on  
performance.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Limiting facets for huge data - setting indexed=false in schema.xml

Rahul R
Erik,
I understand that caching is going to improve performance. Infact we did a
PSR run with caches enabled and we got awesome results. But these wouldn't
be really representative because the PSR scripts will be doing the same
searches again and again. These would be cached and there would be virtually
no evictions. This is not a practical case.

My hardware (in the PSR environment where I am testing) is pretty good - 12
CPU, 24 G RAM, Ultrasparc III 1.2 GHz processors, Solaris 10. We have
allocated 3.2 GB RAM for Weblogic (JVM). This is the maximum that I am able
to allocate for one JVM.
I think I need to go back and check if I am not using all the fields in the
query. I understand that setting indexed=false alone will not ensure that
all fields don't participate in the query.

Thanks a lot for your response.

Regards
Rahul
On Fri, Jul 31, 2009 at 3:33 PM, Erik Hatcher <[hidden email]>wrote:

>
> On Jul 31, 2009, at 2:35 AM, Rahul R wrote:
>
> Hello,
>> We are trying to get Solr to work for a really huge parts database.
>> Details
>> of the database
>> - 55 million parts
>> - Totally 3700 properties (facets). But each record will not have value
>> for
>> all properties.
>> - Most of these facets are defined as dynamic fields within the Solr Index
>>
>> We were getting really unacceptable timing while doing faceting/searches
>> on
>> an index created with this database.
>>
>
> Were you accounting for cache warming?  Were your caches sized
> appropriately?  What kind of hardware and RAM were you using?  What were the
> JVM settings?
>
> And certainly not least important - what version of Solr are you running?
> The difference in faceting performance and scalability between Solr 1.3 and
> what will be Solr 1.4 is quite dramatic.
>
> We thought that by limiting the number of properties that are available for
>> faceting, the performance can be improved. To test this, we enabled only 6
>> properties for faceting by setting indexed=true (in schema.xml) for only
>> these properties. All other properties which are defined as dynamic
>> properties had indexed=false.
>>
>
> These settings won't matter - what matters in this case is what facets you
> request, not what is actually in the index.
>
>
> My questions:
>> - Will reducing the number of facets improve faceting and search
>> performance ?
>>
>
> Reducing what fields you request will, of course.  But what you actually
> index has no effect on performance until you request it.
>
> - Is there a better way to reduce the number of facets ?
>>
>
> Hard to say without doing a deeper analysis of your needs.
>
> - Will having a large number of properties defined as dynamic fields,
>> reduce
>> performance ?
>>
>
> Dynamic fields versus statically named fields have no effect on
> performance.
>
>        Erik
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Limiting facets for huge data - setting indexed=false in schema.xml

Erik Hatcher

On Jul 31, 2009, at 7:17 AM, Rahul R wrote:

> Erik,
> I understand that caching is going to improve performance. Infact we  
> did a
> PSR run with caches enabled and we got awesome results. But these  
> wouldn't
> be really representative because the PSR scripts will be doing the  
> same
> searches again and again. These would be cached and there would be  
> virtually
> no evictions. This is not a practical case.

I don't understand how this is not practical.  Why wouldn't having the  
caches warmed and filled with the facets be practical for your needs?

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Limiting facets for huge data - setting indexed=false in schema.xml

Rahul R
In a production environment, having the caches enabled makes a lot of sense.
And most definitely we will be enabling them. However, the primary idea of
this exercise is to verify if limiting the number of facets will actually
improve the performance.

An update on this. I did verify and looks like although I set indexed=false
for most of the properties, I have not blocked them from participating in
the query. I now enabled only 7 properties for faceting. Now at any given
time only a maximum of 7 facets will participate in the query. Performance
has now improved from an erstwhile 60 seconds to around 10 seconds.

This really helped. Thanks a lot !

Regards
Rahul

On Fri, Jul 31, 2009 at 6:34 PM, Erik Hatcher <[hidden email]>wrote:

>
> On Jul 31, 2009, at 7:17 AM, Rahul R wrote:
>
> Erik,
>> I understand that caching is going to improve performance. Infact we did a
>> PSR run with caches enabled and we got awesome results. But these wouldn't
>> be really representative because the PSR scripts will be doing the same
>> searches again and again. These would be cached and there would be
>> virtually
>> no evictions. This is not a practical case.
>>
>
> I don't understand how this is not practical.  Why wouldn't having the
> caches warmed and filled with the facets be practical for your needs?
>
>        Erik
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Limiting facets for huge data - setting indexed=false in schema.xml

Erik Hatcher
What version of Solr?   Try a nightly build if you're at Solr 1.3 or  
earlier and you'll be amazed at the difference.

        Erik

On Jul 31, 2009, at 10:00 AM, Rahul R wrote:

> In a production environment, having the caches enabled makes a lot  
> of sense.
> And most definitely we will be enabling them. However, the primary  
> idea of
> this exercise is to verify if limiting the number of facets will  
> actually
> improve the performance.
>
> An update on this. I did verify and looks like although I set  
> indexed=false
> for most of the properties, I have not blocked them from  
> participating in
> the query. I now enabled only 7 properties for faceting. Now at any  
> given
> time only a maximum of 7 facets will participate in the query.  
> Performance
> has now improved from an erstwhile 60 seconds to around 10 seconds.
>
> This really helped. Thanks a lot !
>
> Regards
> Rahul
>
> On Fri, Jul 31, 2009 at 6:34 PM, Erik Hatcher <[hidden email]
> >wrote:
>
>>
>> On Jul 31, 2009, at 7:17 AM, Rahul R wrote:
>>
>> Erik,
>>> I understand that caching is going to improve performance. Infact  
>>> we did a
>>> PSR run with caches enabled and we got awesome results. But these  
>>> wouldn't
>>> be really representative because the PSR scripts will be doing the  
>>> same
>>> searches again and again. These would be cached and there would be
>>> virtually
>>> no evictions. This is not a practical case.
>>>
>>
>> I don't understand how this is not practical.  Why wouldn't having  
>> the
>> caches warmed and filled with the facets be practical for your needs?
>>
>>       Erik
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Limiting facets for huge data - setting indexed=false in schema.xml

Rahul R
We are using 1.3.0. Thanks for the suggestion. Will see if I can try one of
the ngihtly builds.

On Fri, Jul 31, 2009 at 7:49 PM, Erik Hatcher <[hidden email]>wrote:

> What version of Solr?   Try a nightly build if you're at Solr 1.3 or
> earlier and you'll be amazed at the difference.
>
>        Erik
>
>
> On Jul 31, 2009, at 10:00 AM, Rahul R wrote:
>
> In a production environment, having the caches enabled makes a lot of
>> sense.
>> And most definitely we will be enabling them. However, the primary idea of
>> this exercise is to verify if limiting the number of facets will actually
>> improve the performance.
>>
>> An update on this. I did verify and looks like although I set
>> indexed=false
>> for most of the properties, I have not blocked them from participating in
>> the query. I now enabled only 7 properties for faceting. Now at any given
>> time only a maximum of 7 facets will participate in the query. Performance
>> has now improved from an erstwhile 60 seconds to around 10 seconds.
>>
>> This really helped. Thanks a lot !
>>
>> Regards
>> Rahul
>>
>> On Fri, Jul 31, 2009 at 6:34 PM, Erik Hatcher <[hidden email]
>> >wrote:
>>
>>
>>> On Jul 31, 2009, at 7:17 AM, Rahul R wrote:
>>>
>>> Erik,
>>>
>>>> I understand that caching is going to improve performance. Infact we did
>>>> a
>>>> PSR run with caches enabled and we got awesome results. But these
>>>> wouldn't
>>>> be really representative because the PSR scripts will be doing the same
>>>> searches again and again. These would be cached and there would be
>>>> virtually
>>>> no evictions. This is not a practical case.
>>>>
>>>>
>>> I don't understand how this is not practical.  Why wouldn't having the
>>> caches warmed and filled with the facets be practical for your needs?
>>>
>>>      Erik
>>>
>>>
>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Limiting facets for huge data - setting indexed=false in schema.xml

Yao Ge
In reply to this post by Rahul R
Having a large number of fields is not the same as having a large number of facets. To facets are something you would display to users as aid for query refinement or navigation. There is no way for a user to use 3700 facets at the same time. So it more of question on how to determine what facets to fetch on search time based on the user's actions or based on certain predefined configurations. I have written an application with 30 some facetable fields on millions of records, I also ran into the issue of calculate all facets as the server resources as limited to number of caches available and CPU cycles available for facet calculations. I then realize why display all these facet regardless user want to see them or not? I have then change to approach to only fetch minimum set of facets by default and make the rest of facets fields open on-demand (using AJAX). I was able to dramatically increase the response time by spreading the facet loading overtime. There are still issues of total facet caches when you have a large number available facets, but you need realistically evaluate what does it means to a user to have large number of facet. I don't think on typical user interface having more than 10 filters showing at the same time will be any more effective than having a small number of filters to begin with and progressive showing more on-demand (hierarchical facets?)

Rahul R wrote
Hello,
We are trying to get Solr to work for a really huge parts database. Details
of the database
- 55 million parts
- Totally 3700 properties (facets). But each record will not have value for
all properties.
- Most of these facets are defined as dynamic fields within the Solr Index

We were getting really unacceptable timing while doing faceting/searches on
an index created with this database. With only one user using the system,
query times are in excess of 1 minute. With more users concurrently using
the system, the response times are further high.

We thought that by limiting the number of properties that are available for
faceting, the performance can be improved. To test this, we enabled only 6
properties for faceting by setting indexed=true (in schema.xml) for only
these properties. All other properties which are defined as dynamic
properties had indexed=false. The observations after this change :

- Index size reduced by a meagre 5 % only
- Performance did not improve. Infact during PSR run we observed that it
degraded.

My questions:
 - Will reducing the number of facets improve faceting and search
performance ?
- Is there a better way to reduce the number of facets ?
- Will having a large number of properties defined as dynamic fields, reduce
performance ?

Thank you.

Regards
Rahul
Reply | Threaded
Open this post in threaded view
|

Re: Limiting facets for huge data - setting indexed=false in schema.xml

Yonik Seeley-2-2
On Fri, Jul 31, 2009 at 3:19 PM, Yao Ge<[hidden email]> wrote:
> Having a large number of fields is not the same as having a large number of
> facets. To facets are something you would display to users as aid for query
> refinement or navigation. There is no way for a user to use 3700 facets at
> the same time.

Indeed... it may just be a terminology issue.  Likely it's one field
with 3700 possible values.

-Yonik
http://www.lucidimagination.com