Improve indexing time

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Improve indexing time

Gurjot Singh-2
Hi,
We have a solr index of size 626 MB and number of douments indexed are
141810. We have configured index based spellchecker with buildOnCommit
option set to true. Spellcheck index is of size 8.67 MB.

We use data import handler to create the index from scratch and also to
update the index periodically. We have created the job to run full import
once every week and the delta import after every 20 mins. The full import
takes about 38 mins to complete and the delta import takes about 12 mins to
complete. The index also serves the search queries (even at the time the
delta import is running). The number of documents that are changed during
every delta import are on an average 25 to 30.

Is there a way to reduce the amount of time delta import takes to update the
index.
The system specs are
MS Windows Server 2003 R2
Standard x64 Edition
8 GB RAM.
Solr is set up on Tomcat 6.0

The CPU utilization of the tomcat.exe at the time of delta import is 60%.

In the data-config.xml file there are 6 root entities for 6 database tables
under the <Document> element. The first root entity gets the rows from
table1, the 2nd root entity gets the rows from table2 ...so on. The root
entities have several child entities to get the fields from associated
tables.

The mergeFactor is set to 10 and ramBufferSizeMB is set to 32. The following
is the cache setting

<filterCache class="solr.LRUCache" size="16384" initialSize="4096"
autowarmCount="4096"/>
<queryResultCache class="solr.LRUCache" size="16384" initialSize="4096"
autowarmCount="4096"/>
<documentCache class="solr.LRUCache" size="16384" initialSize="16384"
autowarmCount="0"/>
<enableLazyFieldLoading>true</enableLazyFieldLoading>

Is it advisable to use master slave configuration. Does the index size of
626 MB validate the change from existing single solr core (on which delta
import is done after every 20 mins and also serves search queries) to master
slave configuration keeping into consideration that the index size will keep
on increasing over time.

Is there any other way to improve the indexing time.

Thanks,
Gurjot



**
Reply | Threaded
Open this post in threaded view
|

Re: Improve indexing time

Glen Newton
Try using LuSql to create the index. It is 4-10 times faster on a
multicore machine, and can run in 1/20th the heap size Solr needs.
See slides 22-25 in this presentation comparing Solr DIH with LuSql:
 http://code4lib.org/files/glen_newton_LuSql.pdf

LuSql: http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql

Disclosure: I am the author of LuSql.

Glen Newton
http://zzzoot.blogspot.com/

2009/7/13 Gurjot Singh <[hidden email]>:

> Hi,
> We have a solr index of size 626 MB and number of douments indexed are
> 141810. We have configured index based spellchecker with buildOnCommit
> option set to true. Spellcheck index is of size 8.67 MB.
>
> We use data import handler to create the index from scratch and also to
> update the index periodically. We have created the job to run full import
> once every week and the delta import after every 20 mins. The full import
> takes about 38 mins to complete and the delta import takes about 12 mins to
> complete. The index also serves the search queries (even at the time the
> delta import is running). The number of documents that are changed during
> every delta import are on an average 25 to 30.
>
> Is there a way to reduce the amount of time delta import takes to update the
> index.
> The system specs are
> MS Windows Server 2003 R2
> Standard x64 Edition
> 8 GB RAM.
> Solr is set up on Tomcat 6.0
>
> The CPU utilization of the tomcat.exe at the time of delta import is 60%.
>
> In the data-config.xml file there are 6 root entities for 6 database tables
> under the <Document> element. The first root entity gets the rows from
> table1, the 2nd root entity gets the rows from table2 ...so on. The root
> entities have several child entities to get the fields from associated
> tables.
>
> The mergeFactor is set to 10 and ramBufferSizeMB is set to 32. The following
> is the cache setting
>
> <filterCache class="solr.LRUCache" size="16384" initialSize="4096"
> autowarmCount="4096"/>
> <queryResultCache class="solr.LRUCache" size="16384" initialSize="4096"
> autowarmCount="4096"/>
> <documentCache class="solr.LRUCache" size="16384" initialSize="16384"
> autowarmCount="0"/>
> <enableLazyFieldLoading>true</enableLazyFieldLoading>
>
> Is it advisable to use master slave configuration. Does the index size of
> 626 MB validate the change from existing single solr core (on which delta
> import is done after every 20 mins and also serves search queries) to master
> slave configuration keeping into consideration that the index size will keep
> on increasing over time.
>
> Is there any other way to improve the indexing time.
>
> Thanks,
> Gurjot
>
>
>
> **
>



--

-
Reply | Threaded
Open this post in threaded view
|

Re: Improve indexing time

Noble Paul നോബിള്‍  नोब्ळ्-2
In reply to this post by Gurjot Singh-2
considering the fact that there are only 20 to 30 docs changed the
indexing is not the bottleneck. Bottleneck is probably the db and the
time taken for the query to run. Are there deltaQueries in the
sub-entities? if you can create a 'VIEW' in DB to identify the delta
it could be faster

On Tue, Jul 14, 2009 at 12:13 AM, Gurjot Singh<[hidden email]> wrote:

> Hi,
> We have a solr index of size 626 MB and number of douments indexed are
> 141810. We have configured index based spellchecker with buildOnCommit
> option set to true. Spellcheck index is of size 8.67 MB.
>
> We use data import handler to create the index from scratch and also to
> update the index periodically. We have created the job to run full import
> once every week and the delta import after every 20 mins. The full import
> takes about 38 mins to complete and the delta import takes about 12 mins to
> complete. The index also serves the search queries (even at the time the
> delta import is running). The number of documents that are changed during
> every delta import are on an average 25 to 30.
>
> Is there a way to reduce the amount of time delta import takes to update the
> index.
> The system specs are
> MS Windows Server 2003 R2
> Standard x64 Edition
> 8 GB RAM.
> Solr is set up on Tomcat 6.0
>
> The CPU utilization of the tomcat.exe at the time of delta import is 60%.
>
> In the data-config.xml file there are 6 root entities for 6 database tables
> under the <Document> element. The first root entity gets the rows from
> table1, the 2nd root entity gets the rows from table2 ...so on. The root
> entities have several child entities to get the fields from associated
> tables.
>
> The mergeFactor is set to 10 and ramBufferSizeMB is set to 32. The following
> is the cache setting
>
> <filterCache class="solr.LRUCache" size="16384" initialSize="4096"
> autowarmCount="4096"/>
> <queryResultCache class="solr.LRUCache" size="16384" initialSize="4096"
> autowarmCount="4096"/>
> <documentCache class="solr.LRUCache" size="16384" initialSize="16384"
> autowarmCount="0"/>
> <enableLazyFieldLoading>true</enableLazyFieldLoading>
>
> Is it advisable to use master slave configuration. Does the index size of
> 626 MB validate the change from existing single solr core (on which delta
> import is done after every 20 mins and also serves search queries) to master
> slave configuration keeping into consideration that the index size will keep
> on increasing over time.
>
> Is there any other way to improve the indexing time.
>
> Thanks,
> Gurjot
>
>
>
> **
>



--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com