Question: index performance

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Question: index performance

James liu-2
i find it will be OutOfMemory when i get more that 10k records.

so now i index 10k records( 5k / record)

if i use for to index more data. it always show OutOfMemory.


i use top to moniter  and find index finish, free memory is 125m,,and
sometime it will be 218m

it show me solr index finish and use sometime free memory?


how can i index more data than 10k records and doesn't stop by OutOfMemory.

tomcat i set memory 512m.


--
regards
jl
Reply | Threaded
Open this post in threaded view
|

Re: Question: index performance

Yonik Seeley-2
On 4/13/07, James liu <[hidden email]> wrote:
> i find it will be OutOfMemory when i get more that 10k records.
>
> so now i index 10k records( 5k / record)

In one request?  There's really no reason to put more than hundreds of
documents in a single add request.

If you are indexing using multiple requests, and always run into
problems at 10k records, you are probably hitting memory issues with
Lucene merging.  If that's the case, try lowering the mergeFactor so
fewer segments will be merged at the same time.

Some other things to be careful of:
- don't call commit after you add every batch of documents
- don't set maxBufferedDocs too high if you don't have the memory

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Question: index performance

galo-2
Hi there,

I'm building an index to which I'm sending a few hundred thousand
entries. I pull them off the database in batches of 25k and send them to
solr, 100 documents at a time. I was doing a commit after each of those
but after what Yonik says I will remove it and commit only after each
batch of 25k.

Q1: I've got autocommit set to 1000 now.. in solrconfig.xml, should i
disable it in this scenario?

Q2: To decide which of those 25k are going to be indexed, we need to do
a query for each (this is the main reason to optimize before a new DB
batch is indexed), each of these 25k queries take around 30ms which is
good enough for us, but i've observed every ~30 queries the time of one
search goes up to 150ms or even 1200ms. Then it does another ~30, etc. I
guess there is something happening inside the server regularly that
causes it. Any clues what it can be and how can i minimize that time?

Q3: The 25k searches are done without any cumulative effect on
performance (avg/search is ~30ms from start to end). But if inmmediately
after start posting documents to the index tomcat peaks CPU. But if i
stop tomcat, and then post the 25k documents without doing those
searches they're very quick. Is there any reason why the searches would
affect tomcat to justify this? Just to clarify, searches are NOT done at
the same time as indexing.

My tomcat is running with -server -Xmx512m -Xms512m

Cheers,

galo

Yonik Seeley wrote:

> On 4/13/07, James liu <[hidden email]> wrote:
>> i find it will be OutOfMemory when i get more that 10k records.
>>
>> so now i index 10k records( 5k / record)
>
> In one request?  There's really no reason to put more than hundreds of
> documents in a single add request.
>
> If you are indexing using multiple requests, and always run into
> problems at 10k records, you are probably hitting memory issues with
> Lucene merging.  If that's the case, try lowering the mergeFactor so
> fewer segments will be merged at the same time.
>
> Some other things to be careful of:
> - don't call commit after you add every batch of documents
> - don't set maxBufferedDocs too high if you don't have the memory
>
> -Yonik
>
Reply | Threaded
Open this post in threaded view
|

Re: Question: index performance

Chris Hostetter-3

: solr, 100 documents at a time. I was doing a commit after each of those
: but after what Yonik says I will remove it and commit only after each
: batch of 25k.

do the commit only when you think it's neccessary to expose those docs to
your search clients, one of which may be "you" checking on the progress of
your index build.

: Q1: I've got autocommit set to 1000 now.. in solrconfig.xml, should i
: disable it in this scenario?

i'm guessing you don't want that if you are doing full builds on a regular
basis.  it's intent is for indexes that are being continuously updated and
you just want to know that eventually a commit will happen 9wihtout
needing to ever call it explicilty)

: Q2: To decide which of those 25k are going to be indexed, we need to do
: a query for each (this is the main reason to optimize before a new DB
: batch is indexed), each of these 25k queries take around 30ms which is
: good enough for us, but i've observed every ~30 queries the time of one
: search goes up to 150ms or even 1200ms. Then it does another ~30, etc. I
: guess there is something happening inside the server regularly that
: causes it. Any clues what it can be and how can i minimize that time?

are these queries happening simultenously with the updates? the
autocommiting will be causing a newSearcher to be opened, and the first
search on it will have to pay some added cost.

besdies autocommit, there is nothing that happens automaticly on a
recuring basis in Solr .. there may be something else running on your box
that is using ram, which is taking away from the disk page cache, which
causes some searches to need to rerad pages (pure speculation)

: Q3: The 25k searches are done without any cumulative effect on
: performance (avg/search is ~30ms from start to end). But if inmmediately
: after start posting documents to the index tomcat peaks CPU. But if i
: stop tomcat, and then post the 25k documents without doing those
: searches they're very quick. Is there any reason why the searches would
: affect tomcat to justify this? Just to clarify, searches are NOT done at
: the same time as indexing.

i'm having trouble understanding your question ... how can you post
documenst after stopping tomcat?



-Hoss